[whatwg] Encoding: big5 and big5-hkscs
Anne van Kesteren
annevk at opera.com
Thu Mar 29 02:16:42 PDT 2012
On Wed, 28 Mar 2012 17:40:58 +0200, Philip Jägenstedt <philipj at opera.com>
wrote:
> Making big5 and big5-hkscs aliases sounds like a good idea, on the
> assumption that big5-hkscs is a pure extension of Big5.
I believe they are not, but given that a) Windows treats them identical
and b) reportedly has no different default setup for Hong Kong and Taiwan
users (and no longer offers a HKSCS download), they can probably be
considered the same.
For more details on Windows and Internet Explorer, see:
http://lists.w3.org/Archives/Public/www-archive/2012Mar/thread.html#msg46
> To make this more concrete, here are a few fairly common characters that
> I think are in big5-hkscs but not in big5, their unicode point and byte
> representation in big5-hkscs when converted using Python:
>
> 啫 U+556B '\x94\xdc'
> 嗰 U+55F0 '\x9d\xf5'
> 嘅 U+5605 '\x9d\xef'
>
> I'm not sure how to use big5.json, so perhaps you can tell me what these
> map to in various browsers? If they're all the same, examples of byte
> sequences that don't would be interesting.
big5.json is the result of outputting all possible lead/trail byte
combinations and then running charCodeAt over the resulting string, while
accounting for surrogates and working around a minor problem in Opera.
Running the following (Python):
import json
data = json.loads(open("big5.json", "r").read())
lead = 0x9D
trail = 0xF5
row = 0xFE-0xA1 + 0x7E-0x40 + 2
cell = (trail-0xA1 + 0x7E-0x40 +1) if trail > 0x7E else trail - 0x40
index = (lead-0x81) * row + cell
for x in data:
print x, hex(data[x][index])
I get
opera-hk 0x55f0
firefox 0x9c1f
chrome 0xecd7
firefox-hk 0x55f0
opera 0xfffd
chrome-hk 0x55f0
internetexplorer 0xecd7
indicating browsers agree for big5-hkscs and not at all for big5. Similar
results for your other examples.
> It seems fairly obvious that the most sane solution would be to just use
> a more correct mapping that doesn't involve the PUA, but:
>
> 1. What is the compatible subset of all browsers?
> 2. Does that subset include anything mapping to the PUA?
This depends on whether or not you include big5-hkscs results. Opera never
maps to PUA, but whether that is compatible enough is unclear.
> 3. Do Hong Kong or Taiwan sites depend on charCodeAt returning values in
> the PUA?
>
> 4. Would hacks be needed on the font-loading side if browsers started
> using a more correct mapping?
Don't know.
Mozilla has done a number of interesting things here nobody else does, but
that was all big in '05 or earlier.
https://bugzilla.mozilla.org/show_bug.cgi?id=9686
https://bugzilla.mozilla.org/show_bug.cgi?id=310299
How relevant that is today, given that they are not the market leader
there, is unclear.
Given the information from Microsoft indicated at the start of this email
I sort of think maybe just following Internet Explorer here is the best
way forward, combined with strongly discouraging the usage of big5.
--
Anne van Kesteren
http://annevankesteren.nl/
More information about the whatwg
mailing list