[whatwg] Web Encodings
NARUSE, Yui
naruse at airemix.jp
Sat Sep 26 08:03:55 PDT 2009
Anne van Kesteren wrote:
> On Sun, 30 Aug 2009 03:47:34 +0200, Ian Hickson <ian at hixie.ch> wrote:
>> I've backed off UTS22. I think we need the IANA list updated, though, to
>> include the aliases browsers support. I understand you are working on
>> this? I would like to remove the table in the HTML5 spec that defines
>> such mappings, once that is done.
>
> Part of the alias table is apparently incorrect. I will be working on
> registering the required aliases though, yes, once some more research is
> complete. This will however not solve at least the following two problems:
>
> * Some encodings need to be decoded (and encoded) using another
> encoding. (The other table HTML5 contains.)
> * The standards for encodings do not always match the required
> implementation of the encoding. Apparently just like with anything else
> encoding standards do not match reality.
>
> (Initially it also seemed to be a problem to register encodings with an
> "x-" prefix, but I think we're past that now, though of course we can't
> be sure until it actually succeeds.)
As far as I know, all majour Japanese encodings have this problem.
And some other encodings also have this.
You know, IE's Shift_JIS implementation is Windows-31J.
And other majour Web Browsers follow this.
http://www.microsoft.com/globaldev/reference/dbcs/932.htm
NOTE:
By IANA Charsets, 7bit area is defined as JIS X0201:1997.
But actual Windows-31J/CP932 is mapped its 0x5C to U+005C;
and Japanese Windows Font uses Yen Sign Glyph for U+005C.
This problem include Tilde Overline.
You may know EUC-JP, another majour Japanese encoding.
IANA Charsets defines following:
code set 0: US-ASCII (a single 7-bit byte set)
code set 1: JIS X0208-1990 (a double 8-bit byte set)
restricted to A0-FF in both bytes
code set 2: Half Width Katakana (a single 7-bit byte set)
requiring SS2 as the character prefix
code set 3: JIS X0212-1990 (a double 7-bit byte set)
restricted to A0-FF in both bytes
requiring SS3 as the character prefix
But IE's EUC-JP implementation called CP51932 is
http://reddog.s35.xrea.com/wiki/cp51932.enc.html
code set 0: US-ASCII (a single 7-bit byte set)
code set 1: JIS X0208-1990 (a double 8-bit byte set),
NEC special characters (Row 13),
NEC selection of IBM extensions (Rows 89 to 92),
and IBM extensions (Rows 115 to 119)
restricted to A0-FF in both bytes
code set 2: Half Width Katakana (a single 7-bit byte set)
requiring SS2 as the character prefix
code set 3: not supported
current Mozilla's is CP51932 and JIS X 0212 mixed encoding.
(in bug 5184 of Bugzilla-jp, they are going to CP51932)
http://bugzilla.mozilla.gr.jp/show_bug.cgi?id=4873 (in Japanese)
http://bugzilla.mozilla.gr.jp/show_bug.cgi?id=5184 (in Japanese)
Chrome is the same as Mozilla
http://code.google.com/p/chromium/issues/detail?id=3094
Webkit/Safari is of course almost same as Chrome,
but it does strange replacement.
https://bugs.webkit.org/show_bug.cgi?id=24906
http://code.google.com/p/chromium/issues/detail?id=9696 Chrome doesn't
I think HTML5's EUC-JP should be CP51932.
ISO-2022-JP (CP50220/CP50221/CP50222) has the same problem.
IANA Charsets defines Big5, but it doesn't say what is the "Big5".
IE's Big5 is CP950.
Mozilla uses its original table.
Its decoding is CP950, Big5-2003 and UAO mixed table, and encoding is CP950.
https://bugzilla.mozilla.org/show_bug.cgi?id=310299
http://moztw.org/docs/big5/
--
NARUSE, Yui <naruse at airemix.jp>
More information about the whatwg
mailing list