[whatwg] Encoding: big5 and big5-hkscs
philipj at opera.com
Fri Apr 6 06:42:26 PDT 2012
On Fri, 06 Apr 2012 12:54:53 +0200, Philip Jägenstedt <philipj at opera.com>
> As a starting point for the spec, I suggest taking the intersection of
> opera-hk, firefox-hk and chrome-hk.
I've written a script in <https://gitorious.org/whatwg/big5> to generate
the mapping that I think makes sense. This is the logic used:
1. If all 3 *-hk mappings agree, use that.
2. If 2 of the *-hk mappings agree on something that is not in the PUA and
not U+FFFD, use that.
3. If HKSCS-2008  defines a mapping, verify that at least 1 *-hk
mapping agrees and use that.
Finally, check that the resulting spec does not use the PUA, U+FFFD or
contradicts a Big5 mapping that everybody agrees on.
This yields a mapping for 18583 of 19782 combinations, which I propose as
a starting point. To this I would add these 4 mappings from HKSCS-2008,
which uses multiple code points to represent what was previously a single
code point in the PUA in some browsers:
8862 => <U+00CA,U+0304> Ê̄
8864 => <U+00CA,U+030C> Ê̌
88A3 => <U+00EA,U+0304> ê̄
88A5 => <U+00EA,U+030C> ê̌
Also, a single mapping fails the Big5-contraction test:
opera-hk: U+FFED ￭
firefox: U+2593 ▓
chrome: U+2593 ▓
firefox-hk: U+2593 ▓
opera: U+2593 ▓
chrome-hk: U+FFED ￭
internetexplorer: U+2593 ▓
hkscs-2008: <U+FFED> ￭
I'd say that we should go with U+FFED here, since that's what the spec
says and it's visually close anyway.
These are the ranges that need more investigation.
8140-817F, 81A2-81FE, 8240-827F, 82A2-82FE, 8340-837F, 83A2-83FE,
8440-847F, 84A2-84FE, 8540-857F, 85A2-85FE, 8640-867F, 86A2-86FE, 8766,
87E0-87FE, 8862, 8864, 88A3, 88A5, 88AB-88FE, 8942, 8944-8945, 894A-894B,
89A7-89AA, 89AF, 89B3-89B4, 89C0, 89C4, 8A42, 8A63, 8A75, 8AAB, 8AB1,
8ABA, 8AC8, 8ACD, 8ADD-8ADE, 8AF5, 8B54, 8BDD, 8BFE, 8CA6, 8CC6-8CC8,
8CCD, 8CE5, 8D41, 9B61, 9EAC, 9EC4, 9EF4, 9F4E, 9FAD, 9FB1, 9FC0, 9FC8,
9FDA, 9FE6, 9FEA, 9FEF, A054, A057, A05A, A062, A072, A0A5, A0AD, A0AF,
A0D3, A0E1, A3E2-A3FE, C6CF, C6D3, C6D5, C6D7, C6DE-C6DF, C8A5-C8CC,
They all map to U+FFFD in opera-hk and mostly to PUA points in other
mappings. A lot of them should probably be U+FFFD, but not all of them. Is
someone (Simon?) able to do a search for existing content labeled as Big5
or Big5-HKSCS that uses any of these bytes?
More information about the whatwg