[whatwg] Encoding: big5 and big5-hkscs

Anne van Kesteren annevk at opera.com
Thu Mar 29 02:16:42 PDT 2012


On Wed, 28 Mar 2012 17:40:58 +0200, Philip Jägenstedt <philipj at opera.com>  
wrote:
> Making big5 and big5-hkscs aliases sounds like a good idea, on the  
> assumption that big5-hkscs is a pure extension of Big5.

I believe they are not, but given that a) Windows treats them identical  
and b) reportedly has no different default setup for Hong Kong and Taiwan  
users (and no longer offers a HKSCS download), they can probably be  
considered the same.

For more details on Windows and Internet Explorer, see:  
http://lists.w3.org/Archives/Public/www-archive/2012Mar/thread.html#msg46


> To make this more concrete, here are a few fairly common characters that  
> I think are in big5-hkscs but not in big5, their unicode point and byte  
> representation in big5-hkscs when converted using Python:
>
> 啫 U+556B '\x94\xdc'
> 嗰 U+55F0 '\x9d\xf5'
> 嘅 U+5605 '\x9d\xef'
>
> I'm not sure how to use big5.json, so perhaps you can tell me what these  
> map to in various browsers? If they're all the same, examples of byte  
> sequences that don't would be interesting.

big5.json is the result of outputting all possible lead/trail byte  
combinations and then running charCodeAt over the resulting string, while  
accounting for surrogates and working around a minor problem in Opera.  
Running the following (Python):

import json
data = json.loads(open("big5.json", "r").read())

lead = 0x9D
trail = 0xF5

row = 0xFE-0xA1 + 0x7E-0x40 + 2
cell = (trail-0xA1 + 0x7E-0x40 +1) if trail > 0x7E else trail - 0x40
index = (lead-0x81) * row + cell

for x in data:
     print x, hex(data[x][index])

I get

opera-hk 0x55f0
firefox 0x9c1f
chrome 0xecd7
firefox-hk 0x55f0
opera 0xfffd
chrome-hk 0x55f0
internetexplorer 0xecd7

indicating browsers agree for big5-hkscs and not at all for big5. Similar  
results for your other examples.


> It seems fairly obvious that the most sane solution would be to just use  
> a more correct mapping that doesn't involve the PUA, but:
>
> 1. What is the compatible subset of all browsers?
> 2. Does that subset include anything mapping to the PUA?

This depends on whether or not you include big5-hkscs results. Opera never  
maps to PUA, but whether that is compatible enough is unclear.


> 3. Do Hong Kong or Taiwan sites depend on charCodeAt returning values in  
> the PUA?
>
> 4. Would hacks be needed on the font-loading side if browsers started  
> using a more correct mapping?

Don't know.


Mozilla has done a number of interesting things here nobody else does, but  
that was all big in '05 or earlier.

https://bugzilla.mozilla.org/show_bug.cgi?id=9686
https://bugzilla.mozilla.org/show_bug.cgi?id=310299

How relevant that is today, given that they are not the market leader  
there, is unclear.


Given the information from Microsoft indicated at the start of this email  
I sort of think maybe just following Internet Explorer here is the best  
way forward, combined with strongly discouraging the usage of big5.


-- 
Anne van Kesteren
http://annevankesteren.nl/


More information about the whatwg mailing list