[whatwg] ISO-8859-* and the C1 control range
Maciej Stachowiak
mjs at apple.com
Tue Jun 5 09:59:49 PDT 2007
On Jun 5, 2007, at 12:18 AM, Henri Sivonen wrote:
> On May 29, 2007, at 13:13, Henri Sivonen wrote:
>
>> To avoid stepping on the toes of Charmod more than is necessary, I
>> suggest making it non-conforming for a document to have bytes in
>> the 0x80…0x9F range when the character encoding is declared to be
>> one of the ISO-8859 family encodings.
>
> I've been thinking about this. I have a proposal on how to spec
> this *conceptually* and how to implement this with error reporting.
> I am assuming here that 1) No one ever intends C1 code points to be
> present in the decoded stream and 2) we want, as a Charmod
> correctness fig leaf, to make the C1 bytes non-conforming when
> ISO-8859-1 or ISO-8859-11 was declared but Windows-1252 or
> Windows-874 decoding is needed.
>
> Based on the behavior of Minefield and Opera 9.20, the following
> seems to be the least Charmod violating and least quirky approach
> that could possibly work:
>
> 1) Decode the byte stream using a decoder for whatever encoding was
> declared, even ISO-8859-1 or ISO-8859-11, according to ftp://
> ftp.unicode.org/Public/MAPPINGS/.
> 2) If a character in the decoded character stream is in the C1 code
> point range, this is a document conformance violation.
> 2a) If the declared encoding was ISO-8859-1, replace that
> character with the character that you get by casting the code point
> into a byte and decoding it as Windows-1252.
> 2b) If the declared encoding was ISO-8859-11, replace that
> character with the character that you get by casting the code point
> into a byte and decoding it as Windows-874.
>
>
> [
> The *simplest* and most robust (and maximally Charmod-violating)
> thing would be:
>
> 1) Decode the byte stream using a decoder for whatever encoding was
> declared, even ISO-8859-1 or ISO-8859-11, according to ftp://
> ftp.unicode.org/Public/MAPPINGS/.
> 2) If a character in the decoded character stream is in the C1 code
> point range, this is a document conformance violation. Replace that
> character with the character that you get by casting the code point
> into a byte and decoding it as Windows-1252.
>
> But this isn't what Minefield, Opera 9.20 and WebKit nightlies do.
> ]
What we actually do in WebKit is always use a windows-1252 decoder
when ISO-8859-1 is requested. I don't think it's very helpful to make
all documents that declare a ISO-8859-1 encoding and use characters
in the C1 range nonconforming. It's true that they are counting on
nonstandard processing of the nominally declared encoding, but I
don't think that causes a problem in practice, as long as the rule is
well known. It seems simpler to just make latin1 an alias for winlatin1.
Regards,
Maciej
More information about the whatwg
mailing list