[whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

Tue Jun 9 15:08:23 PDT 2009

Le 3 juin 09 à 23h19, Ian Hickson écrivit :

> On Tue, 14 Apr 2009, Øistein E. Andersen wrote:
>>
>> HTML5 currently contains a table of encodings aliases,
>> [...]
>> GB2312 and GB_2312-80 technically refer to the *character set* GB  
>> 2312-80,
>> [...]. GBK, on the other hand, is an encoding.
>> [...]
>> There is
>> a large number of unregistered charset strings, however, and the  
>> other
>> mappings in this table are between encodings.  Unless x-x-big5 is  
>> actually
>> supposed to refer to an encoding distinct from Big5, [this mapping]  
>> should be
>> removed.
>> [...]
>
> I believe you misunderstand the purpose of this table. The idea is  
> to give
> a mapping of _labels_ to encodings, not encodings to encodings. I've
> clarified the text to this effect.

You seem to have added "specified by a label" to the phrase which now  
reads "an encoding specified by a label given in the first column of  
the following table" without changing the column heading ("Input  
encoding") and without defining what a "label" actually is. The  
reference to "encoding aliasing" is also intact, which seems  
misleading if the table is not supposed to map between encodings.

The concept of "misinterpret[ation] for compatibility" seems  
inappropriate for the mapping from x-x-big5 to Big5 unless the "label"  
x-x-big5 is actually supposed to specify an encoding distinct from Big5.

It is not at all clear to me what you mean by "label". It might be the  
MIME charset string with which the HTML document is labelled, but that  
would require an inordinate number of strings to be specified (e.g.,  
iso-ir-100, latin1 and IBM819 amongst others alongside ISO-8859-1), so  
this cannot possibly be the intended meaning. It might be a normalised  
form of the MIME charset string, using the IANA charset registry to  
map an "alias" to its corresponding "name" (or to the "alias"  
qualified as "preferred MIME name" if there is such an entry), but  
that does not quite seem to work either, since aliases not registered  
in the IANA charset registry would then not be covered by the aliasing  
mechanism (e.g., it would cause content labelled as x-sjis to be  
handled as unaugmented Shift_JIS despite the mapping from Shift_JIS to  
Windows-31J, since x-sjis does not and cannot figure in the IANA  
charset registry).

I did indeed believe that the table was supposed to map between  
encodings, and this interpretation still seems to give the correct  
result in practice for non-CJK encodings (unless, of course, content  
labelled TIS-620-2533 should actually be interpreted as TIS-620 rather  
than windows-874).

Le 9 juin 09 à 10h55, Anne van Kesteren écrivit :

> On Tue, 09 Jun 2009 01:42:57 +0200, Øistein E. Andersen wrote:
>>
>> Shift-JIS and Windows-932 are commonly used names/labels for the
>> encodings that are registered as Shift_JIS and Windows-31J

>> (respectively) in the IANA charset registry. [...]
>
> So should HTML5 mention that Windows-932 maps to Windows-31J? (It  
> does not appear in the IANA registry.)

That is an interesting question. My (apparently wrong) understanding  
was that the table was merely supposed to provide mappings between  
encodings, since such mappings are inappropriate in non-HTML contexts  
and cannot be added to the IANA registry. It might be to useful to  
include a set of MIME charset strings which cannot be or have not yet  
been registered (e.g., x-x-big5, x-sjis, windows-932) as well as  
information on how CJK character sets are implemented in practice,  
both of which seem to be necessary for compatibility.

Such information does not fit comfortably in the current table, though.

-- 
Øistein E. Andersen