[whatwg] Encoding: big5 and big5-hkscs

Anne van Kesteren annevk at opera.com
Thu Mar 29 02:16:42 PDT 2012

On Wed, 28 Mar 2012 17:40:58 +0200, Philip Jägenstedt <philipj at opera.com>  
> Making big5 and big5-hkscs aliases sounds like a good idea, on the  
> assumption that big5-hkscs is a pure extension of Big5.

I believe they are not, but given that a) Windows treats them identical  
and b) reportedly has no different default setup for Hong Kong and Taiwan  
users (and no longer offers a HKSCS download), they can probably be  
considered the same.

For more details on Windows and Internet Explorer, see:  

> To make this more concrete, here are a few fairly common characters that  
> I think are in big5-hkscs but not in big5, their unicode point and byte  
> representation in big5-hkscs when converted using Python:
> 啫 U+556B '\x94\xdc'
> 嗰 U+55F0 '\x9d\xf5'
> 嘅 U+5605 '\x9d\xef'
> I'm not sure how to use big5.json, so perhaps you can tell me what these  
> map to in various browsers? If they're all the same, examples of byte  
> sequences that don't would be interesting.

big5.json is the result of outputting all possible lead/trail byte  
combinations and then running charCodeAt over the resulting string, while  
accounting for surrogates and working around a minor problem in Opera.  
Running the following (Python):

import json
data = json.loads(open("big5.json", "r").read())

lead = 0x9D
trail = 0xF5

row = 0xFE-0xA1 + 0x7E-0x40 + 2
cell = (trail-0xA1 + 0x7E-0x40 +1) if trail > 0x7E else trail - 0x40
index = (lead-0x81) * row + cell

for x in data:
     print x, hex(data[x][index])

I get

opera-hk 0x55f0
firefox 0x9c1f
chrome 0xecd7
firefox-hk 0x55f0
opera 0xfffd
chrome-hk 0x55f0
internetexplorer 0xecd7

indicating browsers agree for big5-hkscs and not at all for big5. Similar  
results for your other examples.

> It seems fairly obvious that the most sane solution would be to just use  
> a more correct mapping that doesn't involve the PUA, but:
> 1. What is the compatible subset of all browsers?
> 2. Does that subset include anything mapping to the PUA?

This depends on whether or not you include big5-hkscs results. Opera never  
maps to PUA, but whether that is compatible enough is unclear.

> 3. Do Hong Kong or Taiwan sites depend on charCodeAt returning values in  
> the PUA?
> 4. Would hacks be needed on the font-loading side if browsers started  
> using a more correct mapping?

Don't know.

Mozilla has done a number of interesting things here nobody else does, but  
that was all big in '05 or earlier.


How relevant that is today, given that they are not the market leader  
there, is unclear.

Given the information from Microsoft indicated at the start of this email  
I sort of think maybe just following Internet Explorer here is the best  
way forward, combined with strongly discouraging the usage of big5.

Anne van Kesteren

More information about the whatwg mailing list