[whatwg] ISO-8859-* and the C1 control range

Henri Sivonen hsivonen at iki.fi
Tue Jun 5 00:18:49 PDT 2007


On May 29, 2007, at 13:13, Henri Sivonen wrote:

> To avoid stepping on the toes of Charmod more than is necessary, I  
> suggest making it non-conforming for a document to have bytes in  
> the 0x80…0x9F range when the character encoding is declared to be  
> one of the ISO-8859 family encodings.

I've been thinking about this. I have a proposal on how to spec this  
*conceptually* and how to implement this with error reporting. I am  
assuming here that 1) No one ever intends C1 code points to be  
present in the decoded stream and 2) we want, as a Charmod  
correctness fig leaf, to make the C1 bytes non-conforming when  
ISO-8859-1 or ISO-8859-11 was declared but Windows-1252 or  
Windows-874 decoding is needed.

Based on the behavior of Minefield and Opera 9.20, the following  
seems to be the least Charmod violating and least quirky approach  
that could possibly work:

1) Decode the byte stream using a decoder for whatever encoding was  
declared, even ISO-8859-1 or ISO-8859-11, according to ftp:// 
ftp.unicode.org/Public/MAPPINGS/.
2) If a character in the decoded character stream is in the C1 code  
point range, this is a document conformance violation.
    2a) If the declared encoding was ISO-8859-1, replace that  
character with the character that you get by casting the code point  
into a byte and decoding it as Windows-1252.
    2b) If the declared encoding was ISO-8859-11, replace that  
character with the character that you get by casting the code point  
into a byte and decoding it as Windows-874.


[
The *simplest* and most robust (and maximally Charmod-violating)  
thing would be:

1) Decode the byte stream using a decoder for whatever encoding was  
declared, even ISO-8859-1 or ISO-8859-11, according to ftp:// 
ftp.unicode.org/Public/MAPPINGS/.
2) If a character in the decoded character stream is in the C1 code  
point range, this is a document conformance violation. Replace that  
character with the character that you get by casting the code point  
into a byte and decoding it as Windows-1252.

But this isn't what Minefield, Opera 9.20 and WebKit nightlies do.
]

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/





More information about the whatwg mailing list