[whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
Øistein E. Andersen
liszt at coq.no
Fri Jul 17 17:29:07 PDT 2009
On 7 Jul 2009, at 09:25, Ian Hickson wrote:
> On Tue, 9 Jun 2009, Anne van Kesteren wrote:
>> [S]hould HTML5 mention that Windows-932 maps to Windows-31J? (It does
>> not appear in the IANA registry.)
>
> I've added this mapping too, just in case.
> Added x-sjis. What are the other mappings that would be good?
Potentially quite a few... The following do not appear in the IANA
registry and seem to be supported in IE as well as in at least two of
the three browsers Safari, Firefox and Opera.
Aliases for EUC-CN or GB2312-80, ultimately mapping to GBK:
- EUC-CN
- x-euc-cn
- CN-GB
- csGB231280
Alias for EUC-JP:
- X-EUC-JP
Aliases for Big5:
- cn-big5
- x-x-big5 (already in HTML5)
Aliases for Shift_JIS or Windows-31J (which was originally called
Shift_JIS):
- x-sjis (already in HTML5)
Alias for windows-1256:
- cp1256
Name and alias for windows-874 (which does not seem to appear in the
IANA registry):
- windows-874
- DOS-874
In addition, the following legacy Macintosh encodings enjoy universal
support (IE, Safari, Firefox, Opera), but do not appear in the IANA
registry:
- x-mac-icelandic
- x-mac-arabic (somewhat incomplete implementation in IE)
- x-mac-ce (Central-European)
- x-mac-croatian
- x-mac-romanian
- x-mac-cyrillic
- x-mac-ukrainian
- x-mac-greek
- x-mac-turkish
Windows-932 is not supported in IE7 and may not be necessary; others
should probably be added if windows-932 is deemed necessary.
> I've split the table in two to avoid this issue.
It looks much better now. (The terminology is perhaps slightly
inconsistent, but that can be fixed later.)
> Earlier, you wrote:
>>
>> GB2312 and GB_2312-80 technically refer to the *character set* GB
>> 2312-80, [...]. GBK, on the other hand, is an encoding.
>
> As far as I can tell, GB2312 and GB_2312-80 are two different
> encodings
> according to IANA.
Indeed.
The following CJK character sets are listed as encodings in the IANA
registry:
- JIS_C6226-1978
- JIS_C6226-1983
- JIS_X0212-1990
- GB_2312-80
- KS_C_5601-1987
All these character sets are defined as a 94x94 matrix with rows and
columns numbered from 1 to 94 (inclusive). According to RFC1345, a
character is to be encoded as the two-byte sequence (row number + 32),
(column number + 32) in the eponymous encoding. (The two-byte
sequences are thus the same as in an ISO-2022 encoding, but only one
character set is available, and there are no escape sequences or
anything remotely similar.)
In addition, GB_2312, which is really GB_2312-80 with the year
omitted, has been defined as what is properly known as EUC-CN.
JIS_C6226-1978, JIS_C6226-1983 and JIS_X0212-1990 do not seem to be
supported in browsers at all. Both GB_2312-80 and GB_2312 are taken
to mean GBK, which is a superset of EUC-CN. KS_C_5601-1987 is taken
to mean windows-949, a superset of EUC-KR, in Safari, Firefox and
Opera (IE treats it as the union of windows-949 and ISO-2022-KR, which
may or may not be needed for compatibility).
This is all quite confusing, and what is called GB_2312 in IANA really
should be renamed to EUC-CN (keeping GB_2312 as an alias). The HTML5
tables are now technically correct (provided that the encoding names
be interpreted strictly according to the IANA registry).
Very minor detail: The capitalisation of Windows/windows is
inconsistent in the IANA registry; you would have to write, e.g.,
windows-932 and Windows-31J to follow IANA.
Other character encoding issues:
--------------------------------
ASCII-compatibility:
The note in ‘2.1.5 Character encodings’ seems to say that ‘variants of
ISO-2022’ (presumably including common ones like ISO-2022-CN,
ISO-2022KR and ISO-2022-JP) are ASCII-compatible, whereas HZ-GB-2312
is not, and I cannot find anything in Section 2.1.5 that would explain
this difference.
Discouraged encodings:
‘4.2.5.5 Specifying the document's character encoding’ advises against
certain encodings. (Incidentally, this advice probably deserves not
to be ‘hidden’ in a section nominally reserved for character encoding
*declaration* issues.) In particular:
> Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212
> (JIS_X0212-1990), encodings based on ISO-2022, and encodings based
> on EBCDIC.
It is not clear what this means (e.g., the character set
JIS_C6226-1983 in any encoding, or only when encoded alone according
to RFC1345 as described above); the list of discouraged encodings
seems conspicuously short if it is supposed to be complete; and the
lack of rationale makes it difficult to understand why these encodings
are considered particularly harmful (JIS_C6226-1983 v. JIS_C6226-1978
or ISO-2022 v. HZ, to mention but two at least initially puzzling
cases). It might be better to say *why* particular encodings are
better avoided, whether or not the list of discouraged encodings be
presented as definitive.
Minor grammar detail in 4.2.5.5:
> Conformance checkers may advise against authors using legacy
> encodings.
This is ambiguous. It should probably be ‘advise against authors’
using legacy encodings’ or better ‘advise authors against using
legacy encodings’.
--
Øistein E. Andersen
More information about the whatwg
mailing list