[whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
ian at hixie.ch
Mon Sep 1 22:06:22 PDT 2008
On Wed, 30 Jul 2008, Øistein E. Andersen wrote:
> The current table seems to cover the mappings between different common
> compatible 8-bit encodings as implemented in IE7, yes. The table at
> <http://coq.no/character-tables/mime/en> gives a bit more detail, most
> of which is better kept outside HTML5 itself. However, the following
> observations can be made:
> 1. Opera, Firefox and Safari all handle US-ASCII as Windows-1252.
> IE7, on the other hand, simply ignores the high bit (as it does for
> a few other 7-bit encodings, by the way). Perhaps this
> alias could be dropped from the other browsers.
Ignoring the high bit seems like a dangerous security bug; dropping any
character with a high bit as U+FFFD seems unnecessarily drastic. I've made
the spec go with the O/F/S behaviour here.
> 2. Firefox and Opera seem to sniff for text/plain; charset=ISO-8859-1 (as per HTML5),
> whereas Safari seems to do the same for text/plain; charset=ISO-8859-11
> instead [Version 3.1.2 (5525.20.1)]. Bug?
I believe so.
> 3. For certain character sets, different browsers map to different, but visually
> similar Unicode characters. Sometimes, one mapping is old/outdated,
> but this is not always the case.
Not sure what I can do about that.
> 4. Delete (0x7F) and the C1 range (0x80--0x9F) are handled quite inconsistently;
> different browsers do different things for the same encoding, and the same
> browser gives analogous encodings different treatment.
> (For the early ISO-8859-* encodings, the IANA registry points to RFC 1345,
> which effectively maps 0x7F--0x9F to U+7F--U+9F, but does not really
> seem to regard this feature as an essential part of the character set:
> the charset is often coded with both
> graphical and control character sets. If the coded character set is
> a 96-character set, it is tabled with the relevant GL set (normally
> ISO-IR-6) and with ISO 6429 as C0 and C1
> As for the Windows-* encodings, Microsoft documentation treats bytes
> in this range as unassigned unless they are mapped to graphical characters,
> whereas Microsoft products return the underlying byte value in this case.)
I think the HTML5 spec does what is necessary here, but it may be that the
encodings specs are vague still.
> 5. IE handles KOI8-U as KOI8-RU, whereas Safari does the opposite. The former
> is probably more reasonable (assuming that letters are more important than
> line-drawing characters), but neither is actually correct given that the encodings
> are, strictly speaking, incompatible. This issue will of course look a bit different
> if it can be shown that documents containing the letter Ð/Ñ (only in KOI8-RU)
> are frequently mislabelled as KOI8-U.
I guess we'll see what feedback we get on this when testing begins.
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
More information about the whatwg