[whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

Ian Hickson ian at hixie.ch
Mon Sep 1 22:06:22 PDT 2008


On Wed, 30 Jul 2008, Øistein E. Andersen wrote:
> 
> The current table seems to cover the mappings between different common 
> compatible 8-bit encodings as implemented in IE7, yes.  The table at 
> <http://coq.no/character-tables/mime/en> gives a bit more detail, most 
> of which is better kept outside HTML5 itself. However, the following 
> observations can be made:
> 
> 1.  Opera, Firefox and Safari all handle US-ASCII as Windows-1252.
>     IE7, on the other hand, simply ignores the high bit (as it does for
>     a few other 7-bit encodings, by the way).  Perhaps this
>     alias could be dropped from the other browsers.

Ignoring the high bit seems like a dangerous security bug; dropping any 
character with a high bit as U+FFFD seems unnecessarily drastic. I've made 
the spec go with the O/F/S behaviour here.


> 2.  Firefox and Opera seem to sniff for text/plain; charset=ISO-8859-1 (as per HTML5),
>     whereas Safari seems to do the same for text/plain; charset=ISO-8859-11
>     instead [Version 3.1.2 (5525.20.1)].  Bug?

I believe so.


> 3.  For certain character sets, different browsers map to different, but visually
>     similar Unicode characters.  Sometimes, one mapping is old/outdated,
>     but this is not always the case.

Not sure what I can do about that.


> 4.  Delete (0x7F) and the C1 range (0x80--0x9F) are handled quite inconsistently;
>     different browsers do different things for the same encoding, and the same
>     browser gives analogous encodings different treatment.
> 
>     (For the early ISO-8859-* encodings, the IANA registry points to RFC 1345,
>     which effectively maps 0x7F--0x9F to U+7F--U+9F, but does not really
>     seem to regard this feature as an essential part of the character set:
> 
>         the charset is often coded with both
>         graphical and control character sets.  If the coded character set is
>         a 96-character set, it is tabled with the relevant GL set (normally
>         ISO-IR-6) and with ISO 6429 as C0 and C1
> 
>     As for the Windows-* encodings, Microsoft documentation treats bytes
>     in this range as unassigned unless they are mapped to graphical characters,
>     whereas Microsoft products return the underlying byte value in this case.)

I think the HTML5 spec does what is necessary here, but it may be that the 
encodings specs are vague still.


> 5. IE handles KOI8-U as KOI8-RU, whereas Safari does the opposite. The former
>     is probably more reasonable (assuming that letters are more important than
>     line-drawing characters), but neither is actually correct given that the encodings
>     are, strictly speaking, incompatible.  This issue will of course look a bit different
>     if it can be shown that documents containing the letter ÐŽ/Ñž (only in KOI8-RU)
>     are frequently mislabelled as KOI8-U.

I guess we'll see what feedback we get on this when testing begins.

Cheers,
-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


More information about the whatwg mailing list