[whatwg] Encodings and the web

NARUSE, Yui naruse at airemix.jp
Sat Jan 7 16:37:14 PST 2012


(2012/01/07 0:38), Anne van Kesteren wrote:
> On Thu, 22 Dec 2011 15:33:35 +0100, L. David Baron <dbaron at dbaron.org> wrote:
>> This seems like one of those areas where it may be substantially
>> easier to figure out what implementations do by looking at their
>> code than by reverse-engineering, at least for the implementations
>> whose code is available publicly.
>>
>> Gecko's code lives in
>> http://mxr.mozilla.org/mozilla-central/source/intl/uconv/ .  There
>> are others who know it substantially better, but I or others could
>> probably answer questions you have about how it works and how to
>> understand it.
>>
>> I'm not the right person for pointers to other implementations,
>> though.
> 
> Thanks, I'm doing a combination of code inspection, reverse engineering (especially for edge cases), and applying some lessons we learned (e.g. non-greedy error handling).
> 
> So far I defined the to Unicode algorithms for hz-gb-2312, euc-jp, iso-2022-jp, and shift_jis.

= Legacy multi-octet Chinese (traditional) encodings

Mozilla supports another Big5 variants, Big5-UAO.
http://bugs.ruby-lang.org/issues/1784

= Legacy multi-octet Japanese encodings

> The jis code point for a given number is: ...
> The jis0208 index for a given octet is:

I wonder about this description.
I should explain the concept of JIS X 0208.

The most important thing is that JIS X 0208 is on the context of ISO 2022.
Its target is ISO/IEC 2022 double byte 94 characters set.
It means its code space is 94 x 94.
http://en.wikipedia.org/wiki/JIS_X_0208

At the top, there is kuten numbers.
"ku" is row, expressed by the first one of double byte code.
"ten" is cell, expressed by the second one of doubye byte code.
So kuten number expresses a code-point.
Both ku and ten is an integer from 1 to 94.
For example Hiragana Character A, its kuten number is 04-01.

ISO-2022-JP, EUC-JP, and Shift_JIS map a kuten number to bytes.
ISO-2022-JP's double bytes are:
 first:  ku  + 0x20
 second: ten + 0x20
EUC-JP's double bytes are:
 first:  ku  + 0xA0
 second: ten + 0xA0
Shift_JIS's double bytes are:
 first:  if    1 <= ku <= 62 then (ku-1) / 2 + 0x81
         elif 63 <= ku <= 94 then (ku-1) / 2 + 0xC1
 second: if ku is even
           if    1 <= ku <= 63 then ten + 0x3F
           elif 64 <= ku <= 94 then ten + 0x40
         elif ku is odd then ten + 0x9E


So theoretically, we should make a conversion table between
kuten numbers and Unicode scalar values.

But as you know, "JIS X 0208" in web context should be Windows Code Page 932,
extended by Microsoft.
http://msdn.microsoft.com/en-us/goglobal/cc305152
It is defined by Shift_JIS.

> The jis0212 index for a given octet is:

As written in Bugzilla at Mozilla Bug 600715, IE doesn't support JIS X 0212.
https://bugzilla.mozilla.org/show_bug.cgi?id=600715
How treat X0212 in this Encoding spec will be a problem.

== iso-2022-jp
=== The to Unicode algorithm
==== Based on iso-2022-jp state
===== ASCII state
====== Based on octet:
======= Otherwise
> If the fatal flag is set, return failure.
> Otherwise, emit the fallback code point.

Just FYI, IE and Opera show these bytes as Katakana.
If octet is greater than 0xA0 and less than 0xE0, value is octet + 0xFEC0.

Moreover IE shows any shift_jis characters here.
It seems that IE uses the same converter both iso-2022-jp and shift_jis.

-- 
NARUSE, Yui  <naruse at airemix.jp>



More information about the whatwg mailing list