[whatwg] Web Encodings

NARUSE, Yui naruse at airemix.jp
Sat Sep 26 08:03:55 PDT 2009


Anne van Kesteren wrote:
> On Sun, 30 Aug 2009 03:47:34 +0200, Ian Hickson <ian at hixie.ch> wrote:
>> I've backed off UTS22. I think we need the IANA list updated, though, to
>> include the aliases browsers support. I understand you are working on
>> this? I would like to remove the table in the HTML5 spec that defines
>> such mappings, once that is done.
> 
> Part of the alias table is apparently incorrect. I will be working on
> registering the required aliases though, yes, once some more research is
> complete. This will however not solve at least the following two problems:
> 
>  * Some encodings need to be decoded (and encoded) using another
> encoding. (The other table HTML5 contains.)
>  * The standards for encodings do not always match the required
> implementation of the encoding. Apparently just like with anything else
> encoding standards do not match reality.
> 
> (Initially it also seemed to be a problem to register encodings with an
> "x-" prefix, but I think we're past that now, though of course we can't
> be sure until it actually succeeds.)

As far as I know, all majour Japanese encodings have this problem.
And some other encodings also have this.


You know, IE's Shift_JIS implementation is Windows-31J.
And other majour Web Browsers follow this.
http://www.microsoft.com/globaldev/reference/dbcs/932.htm

NOTE:
 By IANA Charsets, 7bit area is defined as JIS X0201:1997.
 But actual Windows-31J/CP932 is mapped its 0x5C to U+005C;
 and Japanese Windows Font uses Yen Sign Glyph for U+005C.
 This problem include Tilde Overline.


You may know EUC-JP, another majour Japanese encoding.
IANA Charsets defines following:

  code set 0: US-ASCII (a single 7-bit byte set)
  code set 1: JIS X0208-1990 (a double 8-bit byte set)
             restricted to A0-FF in both bytes
  code set 2: Half Width Katakana (a single 7-bit byte set)
             requiring SS2 as the character prefix
  code set 3: JIS X0212-1990 (a double 7-bit byte set)
             restricted to A0-FF in both bytes
             requiring SS3 as the character prefix

But IE's EUC-JP implementation called CP51932 is
http://reddog.s35.xrea.com/wiki/cp51932.enc.html

  code set 0: US-ASCII (a single 7-bit byte set)
  code set 1: JIS X0208-1990 (a double 8-bit byte set),
             NEC special characters (Row 13),
             NEC selection of IBM extensions (Rows 89 to 92),
             and IBM extensions (Rows 115 to 119)
             restricted to A0-FF in both bytes
  code set 2: Half Width Katakana (a single 7-bit byte set)
             requiring SS2 as the character prefix
  code set 3: not supported

current Mozilla's is CP51932 and JIS X 0212 mixed encoding.
(in bug 5184 of Bugzilla-jp, they are going to CP51932)
http://bugzilla.mozilla.gr.jp/show_bug.cgi?id=4873 (in Japanese)
http://bugzilla.mozilla.gr.jp/show_bug.cgi?id=5184 (in Japanese)

Chrome is the same as Mozilla
http://code.google.com/p/chromium/issues/detail?id=3094

Webkit/Safari is of course almost same as Chrome,
but it does strange replacement.
https://bugs.webkit.org/show_bug.cgi?id=24906
http://code.google.com/p/chromium/issues/detail?id=9696 Chrome doesn't

I think HTML5's EUC-JP should be CP51932.


ISO-2022-JP (CP50220/CP50221/CP50222) has the same problem.


IANA Charsets defines Big5, but it doesn't say what is the "Big5".
IE's Big5 is CP950.

Mozilla uses its original table.
Its decoding is CP950, Big5-2003 and UAO mixed table, and encoding is CP950.
https://bugzilla.mozilla.org/show_bug.cgi?id=310299
http://moztw.org/docs/big5/

-- 
NARUSE, Yui  <naruse at airemix.jp>


More information about the whatwg mailing list