[whatwg] Handling of illegal byte-sequences (typically in UTF-8)
Øistein E. Andersen
html5 at xn--istein-9xa.com
Thu Nov 23 18:11:57 PST 2006
> Bytes that are not valid UTF-8 sequences must be interpreted as [...] U+FFFD
> Bytes or sequences of bytes [...] that could not be converted to Unicode characters
> must be converted to U+FFFD
If I read this correctly, section 8.1.4 requires that an illegal UTF-8 sequence like
F2 BF BF (the three first bytes of a four-byte sequence, obviously not followed by
a continuation byte) be converted into exactly three U+FFFD characters (one
for each byte), whereas section 9.2.2 also allows one single replacement character (and possibly even two) in this case (and permits an arbitrary number n of repetitions
of the three-byte sequence to be replaced by any number of U+FFFD characters
between 1 and 3n).
I realise that the underspecification in section 9.2.2 may well be intentional, given that
this section is not limited to UTF-8, but (quite possibly depending on the handling chosen) this
can (more or less easily) be expressed in such a way that it applies to any encoding.
Alternatively, a reference to an authoritative source would of course fulfil the purpose in the particular case of UTF-8 (if such a document can be found).
[Currently, an alert reader might infer that the treatment indicated in section 8.1.4
would be preferable also in section 9.2.2, but such inference for consistency can
hardly be expected.]
Ãistein E. Andersen
More information about the whatwg