[whatwg] Handling of illegal byte-sequences (typically in UTF-8)

Øistein E. Andersen html5 at xn--istein-9xa.com
Thu Nov 23 18:11:57 PST 2006


Section 8.1.4:
> Bytes that are not valid UTF-8 sequences must be interpreted as [...] U+FFFD

Section 9.2.2:
> Bytes or sequences of bytes [...] that could not be converted to Unicode characters
> must be converted to U+FFFD

If I read this correctly, section 8.1.4 requires that an illegal UTF-8 sequence like
F2 BF BF (the three first bytes of a four-byte sequence, obviously not followed by
a continuation byte) be converted into exactly three U+FFFD characters (one
for each byte), whereas section 9.2.2 also allows one single replacement character (and possibly even two) in this case (and permits an arbitrary number n of repetitions
of the three-byte sequence to be replaced by any number of U+FFFD characters
between 1 and 3n).

I realise that the underspecification in section 9.2.2 may well be intentional, given that
this section is not limited to UTF-8, but (quite possibly depending on the handling chosen) this 
can (more or less easily) be expressed in such a way that it applies to any encoding.

Alternatively, a reference to an authoritative source would of course fulfil the purpose in the particular case of UTF-8 (if such a document can be found).

[Currently, an alert reader might infer that the treatment indicated in section 8.1.4
would be preferable also in section 9.2.2, but such inference for consistency can
hardly be expected.]

-- 
Øistein E. Andersen



More information about the whatwg mailing list