[whatwg] Handling of illegal byte-sequences (typically in UTF-8)

Thu Jun 14 18:15:47 PDT 2007

On Fri, 24 Nov 2006, Øistein E. Andersen wrote:
>
> Section 8.1.4:
> > Bytes that are not valid UTF-8 sequences must be interpreted as [...] U+FFFD
> 
> Section 9.2.2:
> > Bytes or sequences of bytes [...] that could not be converted to Unicode characters
> > must be converted to U+FFFD
> 
> If I read this correctly, section 8.1.4 requires that an illegal UTF-8 
> sequence like F2 BF BF (the three first bytes of a four-byte sequence, 
> obviously not followed by a continuation byte) be converted into exactly 
> three U+FFFD characters (one for each byte), whereas section 9.2.2 also 
> allows one single replacement character (and possibly even two) in this 
> case (and permits an arbitrary number n of repetitions of the three-byte 
> sequence to be replaced by any number of U+FFFD characters between 1 and 
> 3n).
> 
> I realise that the underspecification in section 9.2.2 may well be 
> intentional, given that this section is not limited to UTF-8, but (quite 
> possibly depending on the handling chosen) this can (more or less 
> easily) be expressed in such a way that it applies to any encoding.
> 
> Alternatively, a reference to an authoritative source would of course 
> fulfil the purpose in the particular case of UTF-8 (if such a document 
> can be found).
> 
> [Currently, an alert reader might infer that the treatment indicated in 
> section 8.1.4 would be preferable also in section 9.2.2, but such 
> inference for consistency can hardly be expected.]

On Fri, 24 Nov 2006, Henri Sivonen wrote:
> 
> I'm inclined to think that interop in error situations doesn't need to 
> go as deep as defining how many replacement characters (in the range 
> 1...number of bytes in a faulty sequence) a character decoder has to 
> emit. Apps may want to delegate character decoding to an outside library 
> whose authors don't care about the details of HTML5. (For example, it 
> appears that Safari is leaving this stuff to ICU.) Chances are that 
> there's more value in being able to use a library than in getting a 
> specific number of replacement characters on error.

On Sat, 25 Nov 2006, Øistein E. Andersen wrote:
> 
> I agree. The current slight inconsistency should probably be amended by 
> making section 8.1.4 more liberal rather than the other way round.

Done.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'