[whatwg] Handling of illegal byte-sequences (typically in UTF-8)

Fri Nov 24 02:33:22 PST 2006

On Nov 24, 2006, at 04:11, Øistein E. Andersen wrote:

> Section 8.1.4:
>> Bytes that are not valid UTF-8 sequences must be interpreted as  
>> [...] U+FFFD
>
> Section 9.2.2:
>> Bytes or sequences of bytes [...] that could not be converted to  
>> Unicode characters
>> must be converted to U+FFFD
>
> If I read this correctly, section 8.1.4 requires that an illegal  
> UTF-8 sequence like
> F2 BF BF (the three first bytes of a four-byte sequence, obviously  
> not followed by
> a continuation byte) be converted into exactly three U+FFFD  
> characters (one
> for each byte), whereas section 9.2.2 also allows one single  
> replacement character (and possibly even two) in this case (and  
> permits an arbitrary number n of repetitions
> of the three-byte sequence to be replaced by any number of U+FFFD  
> characters
> between 1 and 3n).

I'm inclined to think that interop in error situations doesn't need  
to go as deep as defining how many replacement characters (in the  
range 1...number of bytes in a faulty sequence) a character decoder  
has to emit. Apps may want to delegate character decoding to an  
outside library whose authors don't care about the details of HTML5.  
(For example, it appears that Safari is leaving this stuff to ICU.)  
Chances are that there's more value in being able to use a library  
than in getting a specific number of replacement characters on error.

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/