[whatwg] Encodings and the web
Anne van Kesteren
annevk at opera.com
Sun Jan 8 06:32:47 PST 2012
On Sun, 08 Jan 2012 01:37:14 +0100, NARUSE, Yui <naruse at airemix.jp> wrote:
> = Legacy multi-octet Chinese (traditional) encodings
> Mozilla supports another Big5 variants, Big5-UAO.
As part of the big5 encoding, right? It sounds like it's a good idea to
adopt that. I don't think there's much concern about table size these
days, though obviously the less complexity the better.
> = Legacy multi-octet Japanese encodings
>> The jis code point for a given number is: ...
>> The jis0208 index for a given octet is:
> I wonder about this description.
> I should explain the concept of JIS X 0208.
> The most important thing is that JIS X 0208 is on the context of ISO
> Its target is ISO/IEC 2022 double byte 94 characters set.
> It means its code space is 94 x 94.
> At the top, there is kuten numbers.
> "ku" is row, expressed by the first one of double byte code.
> "ten" is cell, expressed by the second one of doubye byte code.
> So kuten number expresses a code-point.
> Both ku and ten is an integer from 1 to 94.
> For example Hiragana Character A, its kuten number is 04-01.
> ISO-2022-JP, EUC-JP, and Shift_JIS map a kuten number to bytes.
> ISO-2022-JP's double bytes are:
> first: ku + 0x20
> second: ten + 0x20
> EUC-JP's double bytes are:
> first: ku + 0xA0
> second: ten + 0xA0
> Shift_JIS's double bytes are:
> first: if 1 <= ku <= 62 then (ku-1) / 2 + 0x81
> elif 63 <= ku <= 94 then (ku-1) / 2 + 0xC1
> second: if ku is even
> if 1 <= ku <= 63 then ten + 0x3F
> elif 64 <= ku <= 94 then ten + 0x40
> elif ku is odd then ten + 0x9E
> So theoretically, we should make a conversion table between
> kuten numbers and Unicode scalar values.
> But as you know, "JIS X 0208" in web context should be Windows Code Page
> extended by Microsoft.
> It is defined by Shift_JIS.
>> The jis0212 index for a given octet is:
> As written in Bugzilla at Mozilla Bug 600715, IE doesn't support JIS X 0212.
> How treat X0212 in this Encoding spec will be a problem.
Yeah so currently I used Gecko's approach (roughly) towards Japanese
encodings, including how they put both 0208 and 0212 in a single longish
array. But maybe instead I should write it down as it has been done by
Unicode.org, with double-octet sequence mapping to a Unicode character.
With respect to 0212, it's not that hard to support it and given how long
it has been deployed this way it's probably safer to keep it there I think.
> == iso-2022-jp
> === The to Unicode algorithm
> ==== Based on iso-2022-jp state
> ===== ASCII state
> ====== Based on octet:
> ======= Otherwise
>> If the fatal flag is set, return failure.
>> Otherwise, emit the fallback code point.
> Just FYI, IE and Opera show these bytes as Katakana.
> If octet is greater than 0xA0 and less than 0xE0, value is octet +
> Moreover IE shows any shift_jis characters here.
> It seems that IE uses the same converter both iso-2022-jp and shift_jis.
I have filed a bug on Opera to become more strict like Webkit/Gecko. If
there is some evidence that approach is wrong though, we can turn it
Anne van Kesteren
More information about the whatwg