[whatwg] Handling of illegal byte-sequences (typically in UTF-8)
Ian Hickson
ian at hixie.ch
Thu Jun 14 18:15:47 PDT 2007
On Fri, 24 Nov 2006, Øistein E. Andersen wrote:
>
> Section 8.1.4:
> > Bytes that are not valid UTF-8 sequences must be interpreted as [...] U+FFFD
>
> Section 9.2.2:
> > Bytes or sequences of bytes [...] that could not be converted to Unicode characters
> > must be converted to U+FFFD
>
> If I read this correctly, section 8.1.4 requires that an illegal UTF-8
> sequence like F2 BF BF (the three first bytes of a four-byte sequence,
> obviously not followed by a continuation byte) be converted into exactly
> three U+FFFD characters (one for each byte), whereas section 9.2.2 also
> allows one single replacement character (and possibly even two) in this
> case (and permits an arbitrary number n of repetitions of the three-byte
> sequence to be replaced by any number of U+FFFD characters between 1 and
> 3n).
>
> I realise that the underspecification in section 9.2.2 may well be
> intentional, given that this section is not limited to UTF-8, but (quite
> possibly depending on the handling chosen) this can (more or less
> easily) be expressed in such a way that it applies to any encoding.
>
> Alternatively, a reference to an authoritative source would of course
> fulfil the purpose in the particular case of UTF-8 (if such a document
> can be found).
>
> [Currently, an alert reader might infer that the treatment indicated in
> section 8.1.4 would be preferable also in section 9.2.2, but such
> inference for consistency can hardly be expected.]
On Fri, 24 Nov 2006, Henri Sivonen wrote:
>
> I'm inclined to think that interop in error situations doesn't need to
> go as deep as defining how many replacement characters (in the range
> 1...number of bytes in a faulty sequence) a character decoder has to
> emit. Apps may want to delegate character decoding to an outside library
> whose authors don't care about the details of HTML5. (For example, it
> appears that Safari is leaving this stuff to ICU.) Chances are that
> there's more value in being able to use a library than in getting a
> specific number of replacement characters on error.
On Sat, 25 Nov 2006, Øistein E. Andersen wrote:
>
> I agree. The current slight inconsistency should probably be amended by
> making section 8.1.4 more liberal rather than the other way round.
Done.
--
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
More information about the whatwg
mailing list