[whatwg] ISO-8859-* and the C1 control range

Tue Jun 5 09:59:49 PDT 2007

On Jun 5, 2007, at 12:18 AM, Henri Sivonen wrote:

> On May 29, 2007, at 13:13, Henri Sivonen wrote:
>
>> To avoid stepping on the toes of Charmod more than is necessary, I  
>> suggest making it non-conforming for a document to have bytes in  
>> the 0x80…0x9F range when the character encoding is declared to be  
>> one of the ISO-8859 family encodings.
>
> I've been thinking about this. I have a proposal on how to spec  
> this *conceptually* and how to implement this with error reporting.  
> I am assuming here that 1) No one ever intends C1 code points to be  
> present in the decoded stream and 2) we want, as a Charmod  
> correctness fig leaf, to make the C1 bytes non-conforming when  
> ISO-8859-1 or ISO-8859-11 was declared but Windows-1252 or  
> Windows-874 decoding is needed.
>
> Based on the behavior of Minefield and Opera 9.20, the following  
> seems to be the least Charmod violating and least quirky approach  
> that could possibly work:
>
> 1) Decode the byte stream using a decoder for whatever encoding was  
> declared, even ISO-8859-1 or ISO-8859-11, according to ftp:// 
> ftp.unicode.org/Public/MAPPINGS/.
> 2) If a character in the decoded character stream is in the C1 code  
> point range, this is a document conformance violation.
>    2a) If the declared encoding was ISO-8859-1, replace that  
> character with the character that you get by casting the code point  
> into a byte and decoding it as Windows-1252.
>    2b) If the declared encoding was ISO-8859-11, replace that  
> character with the character that you get by casting the code point  
> into a byte and decoding it as Windows-874.
>
>
> [
> The *simplest* and most robust (and maximally Charmod-violating)  
> thing would be:
>
> 1) Decode the byte stream using a decoder for whatever encoding was  
> declared, even ISO-8859-1 or ISO-8859-11, according to ftp:// 
> ftp.unicode.org/Public/MAPPINGS/.
> 2) If a character in the decoded character stream is in the C1 code  
> point range, this is a document conformance violation. Replace that  
> character with the character that you get by casting the code point  
> into a byte and decoding it as Windows-1252.
>
> But this isn't what Minefield, Opera 9.20 and WebKit nightlies do.
> ]

What we actually do in WebKit is always use a windows-1252 decoder  
when ISO-8859-1 is requested. I don't think it's very helpful to make  
all documents that declare a ISO-8859-1 encoding and use characters  
in the C1 range nonconforming. It's true that they are counting on  
nonstandard processing of the nominally declared encoding, but I  
don't think that causes a problem in practice, as long as the rule is  
well known. It seems simpler to just make latin1 an alias for winlatin1.

Regards,
Maciej