[whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

Fri Jul 17 17:29:07 PDT 2009

On 7 Jul 2009, at 09:25, Ian Hickson wrote:

> On Tue, 9 Jun 2009, Anne van Kesteren wrote:
>> [S]hould HTML5 mention that Windows-932 maps to Windows-31J? (It does
>> not appear in the IANA registry.)
>
> I've added this mapping too, just in case.

> Added x-sjis. What are the other mappings that would be good?

Potentially quite a few...  The following do not appear in the IANA  
registry and seem to be supported in IE as well as in at least two of  
the three browsers Safari, Firefox and Opera.

Aliases for EUC-CN or GB2312-80, ultimately mapping to GBK:
- EUC-CN
- x-euc-cn
- CN-GB
- csGB231280

Alias for EUC-JP:
- X-EUC-JP

Aliases for Big5:
- cn-big5
- x-x-big5 (already in HTML5)

Aliases for Shift_JIS or Windows-31J (which was originally called  
Shift_JIS):
- x-sjis (already in HTML5)

Alias for windows-1256:
- cp1256

Name and alias for windows-874 (which does not seem to appear in the  
IANA registry):
- windows-874
- DOS-874

In addition, the following legacy Macintosh encodings enjoy universal  
support (IE, Safari, Firefox, Opera), but do not appear in the IANA  
registry:
- x-mac-icelandic
- x-mac-arabic (somewhat incomplete implementation in IE)
- x-mac-ce (Central-European)
- x-mac-croatian
- x-mac-romanian
- x-mac-cyrillic
- x-mac-ukrainian
- x-mac-greek
- x-mac-turkish

Windows-932 is not supported in IE7 and may not be necessary; others  
should probably be added if windows-932 is deemed necessary.

> I've split the table in two to avoid this issue.

It looks much better now.  (The terminology is perhaps slightly  
inconsistent, but that can be fixed later.)

> Earlier, you wrote:
>>
>> GB2312 and GB_2312-80 technically refer to the *character set* GB
>> 2312-80, [...]. GBK, on the other hand, is an encoding.
>
> As far as I can tell, GB2312 and GB_2312-80 are two different  
> encodings
> according to IANA.

Indeed.

The following CJK character sets are listed as encodings in the IANA  
registry:
- JIS_C6226-1978
- JIS_C6226-1983
- JIS_X0212-1990
- GB_2312-80
- KS_C_5601-1987

All these character sets are defined as a 94x94 matrix with rows and  
columns numbered from 1 to 94 (inclusive). According to RFC1345, a  
character is to be encoded as the two-byte sequence (row number + 32),  
(column number + 32) in the eponymous encoding. (The two-byte  
sequences are thus the same as in an ISO-2022 encoding, but only one  
character set is available, and there are no escape sequences or  
anything remotely similar.)

In addition, GB_2312, which is really GB_2312-80 with the year  
omitted, has been defined as what is properly known as EUC-CN.

JIS_C6226-1978, JIS_C6226-1983 and JIS_X0212-1990 do not seem to be  
supported in browsers at all.  Both GB_2312-80 and GB_2312 are taken  
to mean GBK, which is a superset of EUC-CN.  KS_C_5601-1987 is taken  
to mean windows-949, a superset of EUC-KR, in Safari, Firefox and  
Opera (IE treats it as the union of windows-949 and ISO-2022-KR, which  
may or may not be needed for compatibility).

This is all quite confusing, and what is called GB_2312 in IANA really  
should be renamed to EUC-CN (keeping GB_2312 as an alias).  The HTML5  
tables are now technically correct (provided that the encoding names  
be interpreted strictly according to the IANA registry).

Very minor detail:  The capitalisation of Windows/windows is  
inconsistent in the IANA registry; you would have to write, e.g.,  
windows-932 and Windows-31J  to follow IANA.

Other character encoding issues:
--------------------------------

ASCII-compatibility:
The note in ‘2.1.5 Character encodings’ seems to say that ‘variants of  
ISO-2022’ (presumably including common ones like ISO-2022-CN,  
ISO-2022KR and ISO-2022-JP) are ASCII-compatible, whereas HZ-GB-2312  
is not, and I cannot find anything in Section 2.1.5 that would explain  
this difference.

Discouraged encodings:
‘4.2.5.5 Specifying the document's character encoding’ advises against  
certain encodings.  (Incidentally, this advice probably deserves not  
to be ‘hidden’ in a section nominally reserved for character encoding  
*declaration* issues.)  In particular:

> Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212  
> (JIS_X0212-1990), encodings based on ISO-2022, and encodings based  
> on EBCDIC.

It is not clear what this means (e.g., the character set  
JIS_C6226-1983 in any encoding, or only when encoded alone according  
to RFC1345 as described above); the list of discouraged encodings  
seems conspicuously short if it is supposed to be complete; and the  
lack of rationale makes it difficult to understand why these encodings  
are considered particularly harmful (JIS_C6226-1983 v. JIS_C6226-1978  
or ISO-2022 v. HZ, to mention but two at least initially puzzling  
cases).  It might be better to say *why* particular encodings are  
better avoided, whether or not the list of discouraged encodings be  
presented as definitive.

Minor grammar detail in 4.2.5.5:
> Conformance checkers may advise against authors using legacy  
> encodings.

This is ambiguous.  It should probably be ‘advise against authors’  
using legacy encodings’  or better ‘advise authors against using  
legacy encodings’.

-- 
Øistein E. Andersen