[whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

Mon Apr 13 17:14:25 PDT 2009

This e-mail is an attempt to give a relatively concise yet reasonably  
complete overview of non-Unicode character sets and encodings for  
`Chinese characters', excluding those which are not supported by at  
least one of the four browsers IE, Safari, Firefox and Opera  
(henceforth `all browsers'), and tentatively avoiding technical  
details which are out of scope for HTML5 unless they are important to  
gain a general understanding of the relevant issues.

To avoid unnecessary confusion, the following three concepts are kept  
distinct:

1) Character set: A collection of characters, typically defined as a  
matrix with 94 rows and 94 columns.  (A character set with more than  
one matrix is said to have multiple planes.)  The ones officially  
registered `for use with escape sequences' (typically in ISO-2022  
encodings, see below) can be found at <http://www.itscj.ipsj.or.jp/ISO-IR/overview.htm 
 >.

2) Encoding: Defines how a given character (typically defined by its  
row and column numbers) from a given character set can be encoded as a  
sequence of bytes.  All the encodings discussed below allow multiple  
character sets to be encoded.  [ISO-2022 encodings use only 7-bit  
bytes and employ escape sequences to switch between different  
character sets. EUC encodings use bytes < 128 for ASCII (or something  
similar) and bytes >= 128 to encode other character sets.]

3) MIME charset string: This is the string used, e.g., in a HTTP  
Content-Type header to indicate the *encoding*.  Many of these can be  
found at <http://www.iana.org/assignments/character-sets>.

Some information about browser support for specific character sets,  
encodings and MIME charset strings can be found at <http://coq.no/character-tables/mime/iso-2022/en 
 >, <http://coq.no/character-tables/mime/euc/en> and <http://coq.no/character-tables/mime/locale-specific/en 
 >.

The notation a < b means that a is a proper subset of b; a and b can  
be either character sets or encodings.

******************************************
* What should HTML 5 say about all this? *
******************************************

This section gives a summary of superset encodings which are either  
universally supported or potentially needed for compatibility.

(Anyone who is going to read the entire e-mail will probably prefer to  
read the sections *Chinese*, *Japanese* and *Korean* at this point and  
return to this section afterwards.)

Superset encodings (stricto sensu)
----------------------------------

HTML5 currently contains a table of encodings aliases, of which the  
following involve Chinese characters:

1) EUC-KR          ->  Windows-949
2) GB2312          ->  GBK
3) GB_2312-80      ->  GBK
4) KS_C_5601-1987  ->  Windows-949
5) x-x-big5        ->  Big5

EUC-KR < Windows-949, and all browsers do 1), so this is reasonable  
and probably needed.

GB2312 and GB_2312-80 technically refer to the *character set* GB  
2312-80, which can be expressed not only in EUC-CN encoding, but also  
in ISO-2022-CN encoding and HZ encoding.  GBK, on the other hand, is  
an encoding.  EUC-CN < GBK.  It would be more correct to remove 2) and  
3) and instead add:
    EUC-CN      ->  GBK

Admittedly, EUC-CN is sometimes called `8-bit GB encoding', and  
registered MIME charset strings include GB_2312-80 and GB_2312-80 as  
distinct entries (but not EUC-CN), so a note to this effect might be  
appropriate.

(Additionally, GBK is slightly ambiguous, so make sure not to  
reference an incomplete or outdated version without pointing out  
necessary amendments/additions.)

Similarly, EUC-KR is sometimes referred to as `eight-bit KS' or  
`KS_C_5601-1987', which Ken Lunde characterises as `incorrect and  
dangerous' in his book /CJKV Information Processing/.  It would be  
more correct to remove 4).

Unlike EUC-CN, EUC-KR is a registered MIME charset string, but  
KS_C_5601-1987 has a distinct entry, so a note might again be  
appropriate.

As for 5), the MIME charset string x-x-big5 does indeed correspond to  
Big5 encoding (or rather an extension thereof) in all browsers but  
Opera.  There is a large number of unregistered charset strings,  
however, and the other mappings in this table are between encodings.   
Unless x-x-big5 is actually supposed to refer to an encoding distinct  
from Big5, 5) should be removed.

Instead (depending on the reference ultimately given for Big5), it may  
be necessary to note that at least certain ETen extensions should be  
regarded as part of Big5.

In addition, Shift_JIS < Windows-31J, and all browsers implement this  
mapping, so the following should be added:
    Shift_JIS       ->  Windows-31J

Further superset encodings (probably not needed)
------------------------------------------------

ISO-2022-CN < ISO-2022-CN-EXT

This is reasonable, but probably not necessary: Firefox does it,  
Safari does not, Opera does not implement the superset, IE does not  
even implement the subset.  Distinguishing between them is pointless.

EUC-CN < GBK < GB18030

The first step is probably sufficient, and the second is potentially  
problematic if an incompatible extension to GBK were to be invented.

ISO-2022-JP < ISO-2022-JP-1 < ISO-2022-JP-2 < ISO-2022-JP-3 < ISO-2022- 
JP-2004

No browser attempts to distinguish between these, which would be  
completely meaningless.  On the other hand, IE only implements  
ISO-2022-JP, and only Firefox implements ISO-2022-JP-2, so these may  
not actually be necessary.

Shift_JIS_X0213-2000 < Shift_JIS-2004

Safari arguably does this, and there is no need to make a distinction  
between them, but no browser seems to implement either in a meaningful  
way at the moment.

Superset *character sets* (universally recognised)
--------------------------------------------------

JIS C 6226-1978 < JIS X 0208-1983 < JIS X 0208-1990/1997

Whenever one of the subsets are referred to in any variety of ISO-2022- 
JP, the superset is used instead.

JIS X 0208-1990/1997 should be understood as including NEC and IBM  
extensions.  This character set is part of all varieties of ISO-2022- 
JP, as well as EUC-JP and Shift-JIS.

KS X 1001:1992 < KS X 1001:1998 < KS X 1001:2002

Only three characters have been added in total.  All but Safari have  
implemented the two characters added in 1998.  This character set is  
part of ISO-2022-KR, EUC-KR and Johab.

Other additions to ISO-2022 encodings (potentially essential)
-------------------------------------------------------------

All varieties of ISO-2022-JP must include the Katakana character set  
which was not officially added to the standard until ISO-2022-3.

The escape sequence for Swedish should be accepted as a synonym for  
JIS-Roman.

(IE furthermore allows to select katakana using shift-in/out.)

All these extensions were originally defined in the older JIS  
encoding, which predates ISO-2022-JP.

8-bit bytes in 7-bit encodings
------------------------------

IE interprets 8-bit bytes (i.e., octets with the high bit set) in 7- 
bit encodings as if they had occurred in an 8-bit encoding of the same  
language, viz:

	HZ-GB-2312   ->   GBK
	ISO-2022-JP  ->   Shift-JIS
	ISO-2022-KR  ->   Windows-949

Other browsers (at least Safari and Opera) sometimes ignore the  
specified MIME charset string and try to detect/sniff the encoding  
instead, which is prone to error and no less `wrong'.

I would suggest other browsers to support the mappings above, which  
should hopefully enable them to trust the MIME charset string.

                               ***

The remainder of this e-mail gives further details about character  
sets (single underline) and encodings (double underline), divided into  
three sections according to the language for which they are intended  
(Chinese, Japanese and Korean).

***********
* Chinese *
***********

Character sets for simplified Chinese characters
------------------------------------------------

GB2312-80 < GB 6345.1-86 < ISO-IR-165:1992

GB2312-80 < GB 8565.2-88 < ISO-IR-165:1992

(It follows that GB 6345.1-86 and GB 8565.2-88 have no conflicting
assignments.)

Most browsers support only GB2312-80.  Safari supports ISO-IR-165:1992  
as well, but the two are kept distinct.

Character sets for traditional Chinese characters
-------------------------------------------------

CNS 11643-1992:
Plane 1 and plane 2 defined in 1986.
Plane 14 added in 1988.
Plane 15 added in 1988.
In 1992, plane 3 was defined as the first part of plane 14,
the remainder of plane 14 was put into plane 4, many of the
characters from plane 15 were added to planes 4--7, other
characters were added to planes 4--7, and planes 14 and 15 were
removed; the result was seven planes, 1--7.

HZ encoding for simplified Chinese
==================================

HZ-GB-2312 supports:
- ASCII
— GB2312-80

IE furthermore allows GB2312-80 encoded as in EUC-CN, as well as GBK  
extensions (8-bit).

ISO-2022 encoding for traditional and simplified Chinese
========================================================

ISO-2022-CN supports:
- ASCII
- GB2312-80
- CNS 11643-1992, planes 1 and 2

ISO-2022-CN-EXT supports in addition:
- ISO-IR-165
- CNS 11643-1992, planes 3--7
- (theoretically, further character sets, but which cannot be
    selected because escape sequences have not been allocated)

IE does not support ISO-2022 for Chinese.
ISO-2022-CN-EXT is implemented in Safari (complete) and Firefox  
(missing ISO-IR-165).

ISO-2022-CN < ISO-2022-CN-EXT

Firefox treats ISO-2022-CN as ISO-2022-CN-EXT, whereas Safari does  
not.  There does not seem to be any reason not to.

EUC encoding for simplified Chinese and extensions thereof
==========================================================

EUC-CN supports:
- ASCII
- GB2312-80

GBK adds in particular all Chinese characters in Unicode 1.1 not  
included in GB2312-80.

GB18030 adds all remaining Unicode characters.

EUC-CN < GBK < GB2312-80

Windows-936 is very similar to GBK and probably the only variant  
implemented in browsers.  Windows-936 includes a few characters in  
addition to GBK; conversely, GBK apparently includes some characters  
not in Windows-936, at least not originally.  GBK should probably  
refer to Windows-936, possibly with later additions (I have yet to see  
an official GBK specification).

All browsers (except Firefox) treat EUC-CN as GBK/Windows-936.

Firefox instead treats EUC-CN as GB18030, keeping GBK/Windows-936 apart.

Only Safari supports Mac-specific additions to EUC-CN called MacOS-S;  
IE and Opera handles this as pure EUC-CN, which is a fairly good fall- 
back mechanism.

EUC encoding for traditional Chinese
====================================

EUC-TW supports:
- ASCII
- CNS 11643-1992, planes 1--7

It may previously have included:
- CNS 11643-1992, planes 14 and 15

DEC Hanyu provides a different (8-bit) encoding for:
- CNS 11643-1992, planes 2--4

All browsers support ASCII and CNS 11643-1992, plane 1 (albeit IE,  
Safari and Firefox each require a different MIME charset string!).

Safari, Firefox and Opera support CNS 11643-1992, plane 2 encoding  
according to EUC-TW; IE instead supports it when encoded as DEC Hanyu.

Opera supports plane 14; Firefox supports planes 3--7.

EUC-TW and DEC Hanyu are not conflicting, so it would be possible to  
support planes 2--4 (or at least plane 2) according to both standards.  
Plane 1 can already be encoded in two different ways according to EUC- 
TW (and Opera supports both), so this does not really add any  
problems.  Similarly, supporting planes 14 and 15 as well as planes  
2--7 is completely unproblematic.  However, the current degree of  
incompatibility between browsers would seem to suggest that EUC-TW is  
not a very popular encoding.

Big5 encoding for traditional Chinese
=====================================

Big5 is (roughly) an encoding that supports:
- ASCII
- CNS 11643-1992, planes 1 and 2

(Historically, Big5 predates CNS 11643-1992)

Extensions include:
- ETen
- MacOS-T
- Hong Kong extensions
- Big5+
- Big5E
- Big5-2003
- Unicode-At-On

All browsers support some ETen extensions; only IE does not support  
them all.

ETen and MacOS-T extensions are compatible, and IE supports both  
(given the MIME charset string referring to MacOS-T), but Safari does  
not and this is almost certainly not needed.

Hong Kong extensions are incompatible with ETen extensions, so a  
separate MIME charset string is needed to activate Hong Kong extensions.

Big5 < Big5+
Big5 < Big5E
Big5 < Big5-2003

However, these three extensions are all incompatible, and at least  
some of them are incompatible with other extensions.

Big5+ and the later, smaller Big5E are not implemented in browsers, as  
far as I can tell.

Firefox adds characters from Big5-2003 and (according to bug reports)  
Unicode-At-On.  I have not found an authoritative Big5-2003  
specification, but handling Big5 as Big5-2003 (adding at least ETen  
extensions if they are not part of Big5-2003 already) might be a good  
idea.

ETen encoding for traditional Chinese
=====================================

ETen is an encoding that supports:
- ASCII
- CNS 11643-1992, planes 1 and 2
- ETen extensions

Only IE supports this particular encoding.

************
* Japanese *
************

Character sets for Japanese characters
--------------------------------------

JIS X 0201 (Katakana)
JIS C 6226-1978
JIS X 0208-1983
JIS X 0208-1990/1997
JIS X 0212-1990
JIS X 0213-2000 Plane 1
JIS X 0213-2000 Plane 2
JIS X 0213-2004 Plane 1

JIS C 6226-1978 < JIS X 0208-1983 < JIS X 0208-1990/1997 < JIS X  
0213-2000 Plane 1 < JIS X 0213-2004 Plane 1

(There are a few incompatible changes, but those should officially be  
regarded as `corrections'.)

Characters from JIS X 0212-1990 were included in JIS X 0213-2000 Plane  
1.

There is also a	Japanese ASCII variant (JIS Roman) with yen and macron  
instead of backslash and tilde.  However, IE makes no distinction  
between ASCII and JIS Roman, but uses a hybrid if either is needed.

IE furthermore shows a yen symbol for &#x5C;.

ISO-2022 encoding for Japanese
==============================

ISO-2022-JP < ISO-2022-JP-1 < ISO-2022-JP-2 < ISO-2022-JP-3 < ISO-2022- 
JP-2004

JIS is a precursor for ISO-2022-JP.

No browser distinguishes between any of these encodings.

The following lists the character sets that can be encoded in  
different variants of ISO-2022 according to the specifications.

ISO-2022-JP:
- ASCII
- JIS Roman
- JIS C 6226-1978
- JIS X 0208-1983

ISO-2022-JP-1 adds:
- JIS X 0212-1990

ISO-2022-JP-2 adds:
- GB 2312-80 (Chinese)
- KS X 1001 (Korean)
- ISO 8859-1 (Western-European)
- ISO 8859-7 (Monotonic Greek)

ISO-2022-JP-3 adds:
- Katakana
- JIS X 0213-2000 Plane 1
- JIS X 0213-2000 Plane 2

ISO-2022-JP-2004 adds:
- JIS X 0213-2004 Plane 1

In practice, the situation is rather different:

The escape sequences reserved for JIS C 6226-1978 and JIS X 0208-1983  
instead selects the superset JIS X 0208-1990/1997, whose escape  
sequence is not recognised.

IE incorrectly selects JIS X 0208-1990/1997 also when the escape  
sequence for JIS X 0212-1990 is used, but the two are completely  
incompatible.  I have no idea whether it is common to use the wrong  
escape sequence in this particular case.

Only Firefox supports the non-Japanese character sets added in  
ISO-2022-JP-2.

No browser supports JIS X 0213 (in ISO-2022 encoding).

Only Safari does not include IBM extensions, in both NEC and to the  
extent possible IBM (non-Shift-JIS) positions.

IE furthermore interprets 8-bit characters as Shift-JIS and allows  
shift-in/shift-out control characters to indicate Katakana, as defined  
in the earlier JIS standard. Other browsers might want to add this.   
(Some other IE extensions are completely insane and almost certainly  
not needed for compatibility.)

The escape sequence reserved for 7-bit Swedish (which is not included  
in any ISO-2022-JP variant) must instead select JIS Roman.

EUC encoding for Japanese
=========================

EUC-JP supports:
- ASCII
- JIS X 0208-1990/1997
- Katakana
- JIS X 0212-1990

IE and Safari does not support JIS X 0212-1990.

IBM extensions in NEC and to the extent possible IBM (non-Shift-JIS)  
positions are universally supported (except for Safari, which does not  
support NEC positions).

Shift-JIS encoding for Japanese
===============================

Shift-JIS supports:
- ASCII
- Katakana
- JIS X 0208-1990/1997

All browsers furthermore supports NEC symbols as well as IBM  
extensions in both NEC and IBM (Shift-JIS) positions.  This is  
actually Windows-932:

Shift-JIS < Windows-932

There are also other extensions, incompatible with Windows-932:

Shift-JIS < Shift-JIS X0213 < Shift-JIS-2004

Shift-JIS X0213 adds:
- Shift_JISX0213-2000 plane 1
- Shift_JISX0213-2000 plane 2

Shift-JIS-2004 adds instead:
- Shift_JISX0213-2004 plane 1
- Shift_JISX0213-2000 plane 2 (same as previous encoding)

Safari supports the latter, but I have not yet found a MIME charset  
string which triggers it. (Surprisingly and somewhat stupidly,  
Shift_JIS_X0213-2000 triggers Windows-932 in Safari, whereas no other  
browser even supports this string.)

**********
* Korean *
**********

Character sets for Korean characters
------------------------------------

KS X 1001:1992

Two characters were added in 1998, and another in 2002.  Only Safari  
does still not support the additions from 1998.

Hangul syllables which are not included in precomposed form can be  
encoded as 8-byte sequences, 2 bytes for for each of the following:  
specific `composition' code, initial consonant, medial vowel, final  
consonant.  This is not supported unless noted otherwise below. (Not  
actually tested for Johab, for which it is irrelevant.)

IE uses a ASCII/KS-Roman hybrid with won instead of backslash (when  
compared to ASCII) and furthermore displays won for &#x5C;.

ISO-2022 encoding for Korean
============================

ISO-2022-KR supports:
- ASCII
- KS X 1001:1992

Safari displays won instead of backslash (as IE does it for all  
encodings).

IE treats 8-bit characters as Windows-949.

EUC encoding for Korean
=======================

EUC-KR supports:
- ASCII
- KS X 1001:1992

Firefox supports 8-byte Hangul encoding.

Only Safari does not support the Microsoft UHC extension (which adds  
all missing precomposed hangul).  The combination is also known as  
Windows-949.

Only Safari supports the Mac-specific HangulTalk extensions.

EUC-KR < Windows-949
EUC-KR < HangulTalk

Johab encoding for Korean
=========================

EUC-KR supports:
- ASCII
- KS X 1001:1992 (non-hangul)
- All possible hangul (including those in KS X 1001:1992)

This encoding contains the same characters as Windows-949, but  
arranged more systematically.  Unfortunately, the encoding is not  
compatible with EUC-KR.

Opera does not support Johab.  Safari does not render my test page at  
all.

-- 
Øistein E. Andersen