[whatwg] Encoding: big5 and big5-hkscs
Anne van Kesteren
annevk at opera.com
Thu Mar 29 02:16:42 PDT 2012
On Wed, 28 Mar 2012 17:40:58 +0200, Philip Jägenstedt <philipj at opera.com>
> Making big5 and big5-hkscs aliases sounds like a good idea, on the
> assumption that big5-hkscs is a pure extension of Big5.
I believe they are not, but given that a) Windows treats them identical
and b) reportedly has no different default setup for Hong Kong and Taiwan
users (and no longer offers a HKSCS download), they can probably be
considered the same.
For more details on Windows and Internet Explorer, see:
> To make this more concrete, here are a few fairly common characters that
> I think are in big5-hkscs but not in big5, their unicode point and byte
> representation in big5-hkscs when converted using Python:
> 啫 U+556B '\x94\xdc'
> 嗰 U+55F0 '\x9d\xf5'
> 嘅 U+5605 '\x9d\xef'
> I'm not sure how to use big5.json, so perhaps you can tell me what these
> map to in various browsers? If they're all the same, examples of byte
> sequences that don't would be interesting.
big5.json is the result of outputting all possible lead/trail byte
combinations and then running charCodeAt over the resulting string, while
accounting for surrogates and working around a minor problem in Opera.
Running the following (Python):
data = json.loads(open("big5.json", "r").read())
lead = 0x9D
trail = 0xF5
row = 0xFE-0xA1 + 0x7E-0x40 + 2
cell = (trail-0xA1 + 0x7E-0x40 +1) if trail > 0x7E else trail - 0x40
index = (lead-0x81) * row + cell
for x in data:
print x, hex(data[x][index])
indicating browsers agree for big5-hkscs and not at all for big5. Similar
results for your other examples.
> It seems fairly obvious that the most sane solution would be to just use
> a more correct mapping that doesn't involve the PUA, but:
> 1. What is the compatible subset of all browsers?
> 2. Does that subset include anything mapping to the PUA?
This depends on whether or not you include big5-hkscs results. Opera never
maps to PUA, but whether that is compatible enough is unclear.
> 3. Do Hong Kong or Taiwan sites depend on charCodeAt returning values in
> the PUA?
> 4. Would hacks be needed on the font-loading side if browsers started
> using a more correct mapping?
Mozilla has done a number of interesting things here nobody else does, but
that was all big in '05 or earlier.
How relevant that is today, given that they are not the market leader
there, is unclear.
Given the information from Microsoft indicated at the start of this email
I sort of think maybe just following Internet Explorer here is the best
way forward, combined with strongly discouraging the usage of big5.
Anne van Kesteren
More information about the whatwg