[whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
Øistein E. Andersen
liszt at coq.no
Mon Apr 13 17:14:25 PDT 2009
This e-mail is an attempt to give a relatively concise yet reasonably
complete overview of non-Unicode character sets and encodings for
`Chinese characters', excluding those which are not supported by at
least one of the four browsers IE, Safari, Firefox and Opera
(henceforth `all browsers'), and tentatively avoiding technical
details which are out of scope for HTML5 unless they are important to
gain a general understanding of the relevant issues.
To avoid unnecessary confusion, the following three concepts are kept
distinct:
1) Character set: A collection of characters, typically defined as a
matrix with 94 rows and 94 columns. (A character set with more than
one matrix is said to have multiple planes.) The ones officially
registered `for use with escape sequences' (typically in ISO-2022
encodings, see below) can be found at <http://www.itscj.ipsj.or.jp/ISO-IR/overview.htm
>.
2) Encoding: Defines how a given character (typically defined by its
row and column numbers) from a given character set can be encoded as a
sequence of bytes. All the encodings discussed below allow multiple
character sets to be encoded. [ISO-2022 encodings use only 7-bit
bytes and employ escape sequences to switch between different
character sets. EUC encodings use bytes < 128 for ASCII (or something
similar) and bytes >= 128 to encode other character sets.]
3) MIME charset string: This is the string used, e.g., in a HTTP
Content-Type header to indicate the *encoding*. Many of these can be
found at <http://www.iana.org/assignments/character-sets>.
Some information about browser support for specific character sets,
encodings and MIME charset strings can be found at <http://coq.no/character-tables/mime/iso-2022/en
>, <http://coq.no/character-tables/mime/euc/en> and <http://coq.no/character-tables/mime/locale-specific/en
>.
The notation a < b means that a is a proper subset of b; a and b can
be either character sets or encodings.
******************************************
* What should HTML 5 say about all this? *
******************************************
This section gives a summary of superset encodings which are either
universally supported or potentially needed for compatibility.
(Anyone who is going to read the entire e-mail will probably prefer to
read the sections *Chinese*, *Japanese* and *Korean* at this point and
return to this section afterwards.)
Superset encodings (stricto sensu)
----------------------------------
HTML5 currently contains a table of encodings aliases, of which the
following involve Chinese characters:
1) EUC-KR -> Windows-949
2) GB2312 -> GBK
3) GB_2312-80 -> GBK
4) KS_C_5601-1987 -> Windows-949
5) x-x-big5 -> Big5
EUC-KR < Windows-949, and all browsers do 1), so this is reasonable
and probably needed.
GB2312 and GB_2312-80 technically refer to the *character set* GB
2312-80, which can be expressed not only in EUC-CN encoding, but also
in ISO-2022-CN encoding and HZ encoding. GBK, on the other hand, is
an encoding. EUC-CN < GBK. It would be more correct to remove 2) and
3) and instead add:
EUC-CN -> GBK
Admittedly, EUC-CN is sometimes called `8-bit GB encoding', and
registered MIME charset strings include GB_2312-80 and GB_2312-80 as
distinct entries (but not EUC-CN), so a note to this effect might be
appropriate.
(Additionally, GBK is slightly ambiguous, so make sure not to
reference an incomplete or outdated version without pointing out
necessary amendments/additions.)
Similarly, EUC-KR is sometimes referred to as `eight-bit KS' or
`KS_C_5601-1987', which Ken Lunde characterises as `incorrect and
dangerous' in his book /CJKV Information Processing/. It would be
more correct to remove 4).
Unlike EUC-CN, EUC-KR is a registered MIME charset string, but
KS_C_5601-1987 has a distinct entry, so a note might again be
appropriate.
As for 5), the MIME charset string x-x-big5 does indeed correspond to
Big5 encoding (or rather an extension thereof) in all browsers but
Opera. There is a large number of unregistered charset strings,
however, and the other mappings in this table are between encodings.
Unless x-x-big5 is actually supposed to refer to an encoding distinct
from Big5, 5) should be removed.
Instead (depending on the reference ultimately given for Big5), it may
be necessary to note that at least certain ETen extensions should be
regarded as part of Big5.
In addition, Shift_JIS < Windows-31J, and all browsers implement this
mapping, so the following should be added:
Shift_JIS -> Windows-31J
Further superset encodings (probably not needed)
------------------------------------------------
ISO-2022-CN < ISO-2022-CN-EXT
This is reasonable, but probably not necessary: Firefox does it,
Safari does not, Opera does not implement the superset, IE does not
even implement the subset. Distinguishing between them is pointless.
EUC-CN < GBK < GB18030
The first step is probably sufficient, and the second is potentially
problematic if an incompatible extension to GBK were to be invented.
ISO-2022-JP < ISO-2022-JP-1 < ISO-2022-JP-2 < ISO-2022-JP-3 < ISO-2022-
JP-2004
No browser attempts to distinguish between these, which would be
completely meaningless. On the other hand, IE only implements
ISO-2022-JP, and only Firefox implements ISO-2022-JP-2, so these may
not actually be necessary.
Shift_JIS_X0213-2000 < Shift_JIS-2004
Safari arguably does this, and there is no need to make a distinction
between them, but no browser seems to implement either in a meaningful
way at the moment.
Superset *character sets* (universally recognised)
--------------------------------------------------
JIS C 6226-1978 < JIS X 0208-1983 < JIS X 0208-1990/1997
Whenever one of the subsets are referred to in any variety of ISO-2022-
JP, the superset is used instead.
JIS X 0208-1990/1997 should be understood as including NEC and IBM
extensions. This character set is part of all varieties of ISO-2022-
JP, as well as EUC-JP and Shift-JIS.
KS X 1001:1992 < KS X 1001:1998 < KS X 1001:2002
Only three characters have been added in total. All but Safari have
implemented the two characters added in 1998. This character set is
part of ISO-2022-KR, EUC-KR and Johab.
Other additions to ISO-2022 encodings (potentially essential)
-------------------------------------------------------------
All varieties of ISO-2022-JP must include the Katakana character set
which was not officially added to the standard until ISO-2022-3.
The escape sequence for Swedish should be accepted as a synonym for
JIS-Roman.
(IE furthermore allows to select katakana using shift-in/out.)
All these extensions were originally defined in the older JIS
encoding, which predates ISO-2022-JP.
8-bit bytes in 7-bit encodings
------------------------------
IE interprets 8-bit bytes (i.e., octets with the high bit set) in 7-
bit encodings as if they had occurred in an 8-bit encoding of the same
language, viz:
HZ-GB-2312 -> GBK
ISO-2022-JP -> Shift-JIS
ISO-2022-KR -> Windows-949
Other browsers (at least Safari and Opera) sometimes ignore the
specified MIME charset string and try to detect/sniff the encoding
instead, which is prone to error and no less `wrong'.
I would suggest other browsers to support the mappings above, which
should hopefully enable them to trust the MIME charset string.
***
The remainder of this e-mail gives further details about character
sets (single underline) and encodings (double underline), divided into
three sections according to the language for which they are intended
(Chinese, Japanese and Korean).
***********
* Chinese *
***********
Character sets for simplified Chinese characters
------------------------------------------------
GB2312-80 < GB 6345.1-86 < ISO-IR-165:1992
GB2312-80 < GB 8565.2-88 < ISO-IR-165:1992
(It follows that GB 6345.1-86 and GB 8565.2-88 have no conflicting
assignments.)
Most browsers support only GB2312-80. Safari supports ISO-IR-165:1992
as well, but the two are kept distinct.
Character sets for traditional Chinese characters
-------------------------------------------------
CNS 11643-1992:
Plane 1 and plane 2 defined in 1986.
Plane 14 added in 1988.
Plane 15 added in 1988.
In 1992, plane 3 was defined as the first part of plane 14,
the remainder of plane 14 was put into plane 4, many of the
characters from plane 15 were added to planes 4--7, other
characters were added to planes 4--7, and planes 14 and 15 were
removed; the result was seven planes, 1--7.
HZ encoding for simplified Chinese
==================================
HZ-GB-2312 supports:
- ASCII
— GB2312-80
IE furthermore allows GB2312-80 encoded as in EUC-CN, as well as GBK
extensions (8-bit).
ISO-2022 encoding for traditional and simplified Chinese
========================================================
ISO-2022-CN supports:
- ASCII
- GB2312-80
- CNS 11643-1992, planes 1 and 2
ISO-2022-CN-EXT supports in addition:
- ISO-IR-165
- CNS 11643-1992, planes 3--7
- (theoretically, further character sets, but which cannot be
selected because escape sequences have not been allocated)
IE does not support ISO-2022 for Chinese.
ISO-2022-CN-EXT is implemented in Safari (complete) and Firefox
(missing ISO-IR-165).
ISO-2022-CN < ISO-2022-CN-EXT
Firefox treats ISO-2022-CN as ISO-2022-CN-EXT, whereas Safari does
not. There does not seem to be any reason not to.
EUC encoding for simplified Chinese and extensions thereof
==========================================================
EUC-CN supports:
- ASCII
- GB2312-80
GBK adds in particular all Chinese characters in Unicode 1.1 not
included in GB2312-80.
GB18030 adds all remaining Unicode characters.
EUC-CN < GBK < GB2312-80
Windows-936 is very similar to GBK and probably the only variant
implemented in browsers. Windows-936 includes a few characters in
addition to GBK; conversely, GBK apparently includes some characters
not in Windows-936, at least not originally. GBK should probably
refer to Windows-936, possibly with later additions (I have yet to see
an official GBK specification).
All browsers (except Firefox) treat EUC-CN as GBK/Windows-936.
Firefox instead treats EUC-CN as GB18030, keeping GBK/Windows-936 apart.
Only Safari supports Mac-specific additions to EUC-CN called MacOS-S;
IE and Opera handles this as pure EUC-CN, which is a fairly good fall-
back mechanism.
EUC encoding for traditional Chinese
====================================
EUC-TW supports:
- ASCII
- CNS 11643-1992, planes 1--7
It may previously have included:
- CNS 11643-1992, planes 14 and 15
DEC Hanyu provides a different (8-bit) encoding for:
- CNS 11643-1992, planes 2--4
All browsers support ASCII and CNS 11643-1992, plane 1 (albeit IE,
Safari and Firefox each require a different MIME charset string!).
Safari, Firefox and Opera support CNS 11643-1992, plane 2 encoding
according to EUC-TW; IE instead supports it when encoded as DEC Hanyu.
Opera supports plane 14; Firefox supports planes 3--7.
EUC-TW and DEC Hanyu are not conflicting, so it would be possible to
support planes 2--4 (or at least plane 2) according to both standards.
Plane 1 can already be encoded in two different ways according to EUC-
TW (and Opera supports both), so this does not really add any
problems. Similarly, supporting planes 14 and 15 as well as planes
2--7 is completely unproblematic. However, the current degree of
incompatibility between browsers would seem to suggest that EUC-TW is
not a very popular encoding.
Big5 encoding for traditional Chinese
=====================================
Big5 is (roughly) an encoding that supports:
- ASCII
- CNS 11643-1992, planes 1 and 2
(Historically, Big5 predates CNS 11643-1992)
Extensions include:
- ETen
- MacOS-T
- Hong Kong extensions
- Big5+
- Big5E
- Big5-2003
- Unicode-At-On
All browsers support some ETen extensions; only IE does not support
them all.
ETen and MacOS-T extensions are compatible, and IE supports both
(given the MIME charset string referring to MacOS-T), but Safari does
not and this is almost certainly not needed.
Hong Kong extensions are incompatible with ETen extensions, so a
separate MIME charset string is needed to activate Hong Kong extensions.
Big5 < Big5+
Big5 < Big5E
Big5 < Big5-2003
However, these three extensions are all incompatible, and at least
some of them are incompatible with other extensions.
Big5+ and the later, smaller Big5E are not implemented in browsers, as
far as I can tell.
Firefox adds characters from Big5-2003 and (according to bug reports)
Unicode-At-On. I have not found an authoritative Big5-2003
specification, but handling Big5 as Big5-2003 (adding at least ETen
extensions if they are not part of Big5-2003 already) might be a good
idea.
ETen encoding for traditional Chinese
=====================================
ETen is an encoding that supports:
- ASCII
- CNS 11643-1992, planes 1 and 2
- ETen extensions
Only IE supports this particular encoding.
************
* Japanese *
************
Character sets for Japanese characters
--------------------------------------
JIS X 0201 (Katakana)
JIS C 6226-1978
JIS X 0208-1983
JIS X 0208-1990/1997
JIS X 0212-1990
JIS X 0213-2000 Plane 1
JIS X 0213-2000 Plane 2
JIS X 0213-2004 Plane 1
JIS C 6226-1978 < JIS X 0208-1983 < JIS X 0208-1990/1997 < JIS X
0213-2000 Plane 1 < JIS X 0213-2004 Plane 1
(There are a few incompatible changes, but those should officially be
regarded as `corrections'.)
Characters from JIS X 0212-1990 were included in JIS X 0213-2000 Plane
1.
There is also a Japanese ASCII variant (JIS Roman) with yen and macron
instead of backslash and tilde. However, IE makes no distinction
between ASCII and JIS Roman, but uses a hybrid if either is needed.
IE furthermore shows a yen symbol for \.
ISO-2022 encoding for Japanese
==============================
ISO-2022-JP < ISO-2022-JP-1 < ISO-2022-JP-2 < ISO-2022-JP-3 < ISO-2022-
JP-2004
JIS is a precursor for ISO-2022-JP.
No browser distinguishes between any of these encodings.
The following lists the character sets that can be encoded in
different variants of ISO-2022 according to the specifications.
ISO-2022-JP:
- ASCII
- JIS Roman
- JIS C 6226-1978
- JIS X 0208-1983
ISO-2022-JP-1 adds:
- JIS X 0212-1990
ISO-2022-JP-2 adds:
- GB 2312-80 (Chinese)
- KS X 1001 (Korean)
- ISO 8859-1 (Western-European)
- ISO 8859-7 (Monotonic Greek)
ISO-2022-JP-3 adds:
- Katakana
- JIS X 0213-2000 Plane 1
- JIS X 0213-2000 Plane 2
ISO-2022-JP-2004 adds:
- JIS X 0213-2004 Plane 1
In practice, the situation is rather different:
The escape sequences reserved for JIS C 6226-1978 and JIS X 0208-1983
instead selects the superset JIS X 0208-1990/1997, whose escape
sequence is not recognised.
IE incorrectly selects JIS X 0208-1990/1997 also when the escape
sequence for JIS X 0212-1990 is used, but the two are completely
incompatible. I have no idea whether it is common to use the wrong
escape sequence in this particular case.
Only Firefox supports the non-Japanese character sets added in
ISO-2022-JP-2.
No browser supports JIS X 0213 (in ISO-2022 encoding).
Only Safari does not include IBM extensions, in both NEC and to the
extent possible IBM (non-Shift-JIS) positions.
IE furthermore interprets 8-bit characters as Shift-JIS and allows
shift-in/shift-out control characters to indicate Katakana, as defined
in the earlier JIS standard. Other browsers might want to add this.
(Some other IE extensions are completely insane and almost certainly
not needed for compatibility.)
The escape sequence reserved for 7-bit Swedish (which is not included
in any ISO-2022-JP variant) must instead select JIS Roman.
EUC encoding for Japanese
=========================
EUC-JP supports:
- ASCII
- JIS X 0208-1990/1997
- Katakana
- JIS X 0212-1990
IE and Safari does not support JIS X 0212-1990.
IBM extensions in NEC and to the extent possible IBM (non-Shift-JIS)
positions are universally supported (except for Safari, which does not
support NEC positions).
Shift-JIS encoding for Japanese
===============================
Shift-JIS supports:
- ASCII
- Katakana
- JIS X 0208-1990/1997
All browsers furthermore supports NEC symbols as well as IBM
extensions in both NEC and IBM (Shift-JIS) positions. This is
actually Windows-932:
Shift-JIS < Windows-932
There are also other extensions, incompatible with Windows-932:
Shift-JIS < Shift-JIS X0213 < Shift-JIS-2004
Shift-JIS X0213 adds:
- Shift_JISX0213-2000 plane 1
- Shift_JISX0213-2000 plane 2
Shift-JIS-2004 adds instead:
- Shift_JISX0213-2004 plane 1
- Shift_JISX0213-2000 plane 2 (same as previous encoding)
Safari supports the latter, but I have not yet found a MIME charset
string which triggers it. (Surprisingly and somewhat stupidly,
Shift_JIS_X0213-2000 triggers Windows-932 in Safari, whereas no other
browser even supports this string.)
**********
* Korean *
**********
Character sets for Korean characters
------------------------------------
KS X 1001:1992
Two characters were added in 1998, and another in 2002. Only Safari
does still not support the additions from 1998.
Hangul syllables which are not included in precomposed form can be
encoded as 8-byte sequences, 2 bytes for for each of the following:
specific `composition' code, initial consonant, medial vowel, final
consonant. This is not supported unless noted otherwise below. (Not
actually tested for Johab, for which it is irrelevant.)
IE uses a ASCII/KS-Roman hybrid with won instead of backslash (when
compared to ASCII) and furthermore displays won for \.
ISO-2022 encoding for Korean
============================
ISO-2022-KR supports:
- ASCII
- KS X 1001:1992
Safari displays won instead of backslash (as IE does it for all
encodings).
IE treats 8-bit characters as Windows-949.
EUC encoding for Korean
=======================
EUC-KR supports:
- ASCII
- KS X 1001:1992
Firefox supports 8-byte Hangul encoding.
Only Safari does not support the Microsoft UHC extension (which adds
all missing precomposed hangul). The combination is also known as
Windows-949.
Only Safari supports the Mac-specific HangulTalk extensions.
EUC-KR < Windows-949
EUC-KR < HangulTalk
Johab encoding for Korean
=========================
EUC-KR supports:
- ASCII
- KS X 1001:1992 (non-hangul)
- All possible hangul (including those in KS X 1001:1992)
This encoding contains the same characters as Windows-949, but
arranged more systematically. Unfortunately, the encoding is not
compatible with EUC-KR.
Opera does not support Johab. Safari does not render my test page at
all.
--
Øistein E. Andersen
More information about the whatwg
mailing list