[whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

Ian Hickson ian at hixie.ch
Thu May 22 04:40:45 PDT 2008


On Thu, 13 Mar 2008, Øistein E. Andersen wrote:
> On 5th June 2007, Øistein E. Andersen wrote:
> 
> > (To do this properly, what we really ought to do is look for
> > C1 and undefined characters in all IANA charsets and semi-official
> > mappings to Unicode and check 1) whether the gaps can be filled
> > by borrowing from other encodings, and 2) whether browsers
> > actually do so. [...])
> 
> I have finally got round to looking at superset encodings.
> 
> To do this, I started with Unicode mappings from [UNI] for 8-bit 1-byte
> alphabet encodings and added mappings for other such encodings
> implemented in Opera, Safari or Firefox, mostly from [CSETS], though
> I made one for Windows-Sami-2 from a PDF.  (I then discovered that IE
> had something called Arabic-ASMO, for which no matching specification 
> could be found, and subsequently reverse-engineered all IE's encodings.
> Most of these turned out to be identical to other mappings or only
> add characters from the PUA, but some real differences were found,
> and those are reported in the text below.)
> 
>     [UNI] <http://unicode.org/Public/MAPPINGS/>
>     [CSETS] <http://crl.nmsu.edu/~mleisher/csets.html>
> 
> All the character repertoires and encoding vectors defined by the mappings
> were then compared pairwise. (Codepoints mapped to C0, space, BS or C1
> were treated as unassigned, and directionality indicators for Arabic and
> Hebrew were ignored.) The result is quite a big and unreadable table
> [FULL], so the repertoires and encodings were clustered, which gave rise to
> the tables in [ENC], which compare charsets with less than 27 incompatible
> codepoints, as well as those in [REP], which compare charsets with at most
> 60 characters not found in both repertoires. (The thresholds are arbitrary, but 
> more than sufficiently large to assure that all related charsets will be
> clustered together and at the sime time sufficiently small to keep the
> tables at a reasonable size.)
> 
>     [FULL] <http://coq.no/X/charset-table.html>
>     [ENC] <http://coq.no/X/charset-enc.html>
>     [REP] <http://coq.no/X/charset-rep.html>
> 
> A short summary of the most interesting/relevant results (supported by [ENC])
> can be found below.

This is quite amazing data, thank you.

I'm not sure what to do with it, frankly. Given your familiarity with the 
topic, would you say that what the spec says now is what browsers 
implement? What should we change?

Do you have input on the EUC-JP issue?


> PS: How should colour be added to tables like these in HTML5 with
>     neither of the attributes bgcolor and style?

Class attribute and external stylesheets. (Possibly a data-* attribute.)



> Note: Similarly, IE apparently handles CS-ISO-2022-JP as distinct from
>       ISO-2022-JP. This is something to keep in mind when looking at
>       multi-byte encodings.

What should we say about this?


> (TC)VN5712-2 < (TC)VN5712-1
> 
> Opera and Firefox seem to have implemented the superset only.

Should we require this mapping?

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


More information about the whatwg mailing list