[whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
Øistein E. Andersen
liszt at coq.no
Tue Jun 9 15:08:23 PDT 2009
Le 3 juin 09 à 23h19, Ian Hickson écrivit :
> On Tue, 14 Apr 2009, Øistein E. Andersen wrote:
>>
>> HTML5 currently contains a table of encodings aliases,
>> [...]
>> GB2312 and GB_2312-80 technically refer to the *character set* GB
>> 2312-80,
>> [...]. GBK, on the other hand, is an encoding.
>> [...]
>> There is
>> a large number of unregistered charset strings, however, and the
>> other
>> mappings in this table are between encodings. Unless x-x-big5 is
>> actually
>> supposed to refer to an encoding distinct from Big5, [this mapping]
>> should be
>> removed.
>> [...]
>
> I believe you misunderstand the purpose of this table. The idea is
> to give
> a mapping of _labels_ to encodings, not encodings to encodings. I've
> clarified the text to this effect.
You seem to have added "specified by a label" to the phrase which now
reads "an encoding specified by a label given in the first column of
the following table" without changing the column heading ("Input
encoding") and without defining what a "label" actually is. The
reference to "encoding aliasing" is also intact, which seems
misleading if the table is not supposed to map between encodings.
The concept of "misinterpret[ation] for compatibility" seems
inappropriate for the mapping from x-x-big5 to Big5 unless the "label"
x-x-big5 is actually supposed to specify an encoding distinct from Big5.
It is not at all clear to me what you mean by "label". It might be the
MIME charset string with which the HTML document is labelled, but that
would require an inordinate number of strings to be specified (e.g.,
iso-ir-100, latin1 and IBM819 amongst others alongside ISO-8859-1), so
this cannot possibly be the intended meaning. It might be a normalised
form of the MIME charset string, using the IANA charset registry to
map an "alias" to its corresponding "name" (or to the "alias"
qualified as "preferred MIME name" if there is such an entry), but
that does not quite seem to work either, since aliases not registered
in the IANA charset registry would then not be covered by the aliasing
mechanism (e.g., it would cause content labelled as x-sjis to be
handled as unaugmented Shift_JIS despite the mapping from Shift_JIS to
Windows-31J, since x-sjis does not and cannot figure in the IANA
charset registry).
I did indeed believe that the table was supposed to map between
encodings, and this interpretation still seems to give the correct
result in practice for non-CJK encodings (unless, of course, content
labelled TIS-620-2533 should actually be interpreted as TIS-620 rather
than windows-874).
Le 9 juin 09 à 10h55, Anne van Kesteren écrivit :
> On Tue, 09 Jun 2009 01:42:57 +0200, Øistein E. Andersen wrote:
>>
>> Shift-JIS and Windows-932 are commonly used names/labels for the
>> encodings that are registered as Shift_JIS and Windows-31J
>> (respectively) in the IANA charset registry. [...]
>
> So should HTML5 mention that Windows-932 maps to Windows-31J? (It
> does not appear in the IANA registry.)
That is an interesting question. My (apparently wrong) understanding
was that the table was merely supposed to provide mappings between
encodings, since such mappings are inappropriate in non-HTML contexts
and cannot be added to the IANA registry. It might be to useful to
include a set of MIME charset strings which cannot be or have not yet
been registered (e.g., x-x-big5, x-sjis, windows-932) as well as
information on how CJK character sets are implemented in practice,
both of which seem to be necessary for compatibility.
Such information does not fit comfortably in the current table, though.
--
Øistein E. Andersen
More information about the whatwg
mailing list