[whatwg] Encoding: big5 and big5-hkscs
philipj at opera.com
Thu Apr 12 00:26:51 PDT 2012
On Mon, 09 Apr 2012 03:08:20 +0200, Øistein E. Andersen <liszt at coq.no>
> On 8 Apr 2012, at 18:03, Philip Jägenstedt wrote:
>> On Sat, 07 Apr 2012 16:04:55 +0200, Øistein E. Andersen <liszt at coq.no>
>>> Suggested change: map C6CD to U+5E7A.
>> These are the existing mappings:
>> C6CD =>
>> opera-hk: U+2F33 ⼳
>> firefox: U+5E7A 幺
>> chrome: U+F6DD
>> firefox-hk: U+5E7A 幺
>> opera: U+2F33 ⼳
>> chrome-hk: U+2F33 ⼳
>> internetexplorer: U+F6DD
>> hkscs-2008: <U+2F33> ⼳
>> At least on the Web, this isn't a question of HK vs non-HK mappings.
>> Other than Firefox, which (de-facto) specs or implementations use
> I have now had a closer look at my notes
> (<http://coq.no/character-tables/chinese-traditional/en>). My argument
> for U+5E7A goes as follows:
> Of the 214 Kangxi radicals, 186 appear (as normal Han character) in CNS
> 11643 Planes 1 or 2, whereas 25 appear in Plane 3 and 3 are missing
> altogether. Big5 only covers Planes 1 and 2, which means that 28 Kangxi
> radicals (which may be rare in running text, but are nevertheless
> important) are missing. The E-Ten 1 extension encodes 25 of the missing
> radicals in the range C6BF--C6D7. Unlike CNS 11643 and Unicode, Big5
> does not encode radicals twice (as radicals and normal characters).
> This means that Big5 with the E-Ten 1 extension contains 211 of the 214
> Kangxi radicals, all mapped to normal Han characters, and no codepoints
> mapped to Unicode Kangxi Radicals in the range U+2F00--U+2FD5.
> In summary: although E-Ten 1 was not defined in terms of Unicode, it is
> clear that the 25 radicals were all meant to map to normal Han
> characters, not to the special radical characters found in CNS 11643 and
> Enter HKSCS. 20 of the E-Ten 1 Kangxi radical mappings (along with the
> rest of E-Ten 1 and E-Ten 2, or almost) are adopted, but the remaining 5
> are instead given new codepoints elsewhere. Whatever the reason be, 4
> of the 5 unused E-Ten positions are simply left undefined in the HKSCS
> standard, which is not much of a problem for a unified HK/non-HK Big5
> encoding. Unfortunately, the position C6CD was not left undefined, but
> instead mapped to U+2F33 (⼳), the Unicode Kangxi Radical version of
> U+5E7A (幺), thus introducing not only the only Unicode Kangxi Radical
> into the HKSCS standard, but also a Unicode mapping that is incompatible
> with previous Big5 versions. I wish I knew why.
>> Possibly, one could argue that U+2F33 normalizes (NFKC) to U+5E7A, but
>> it's not the only hanzi in HKSCS-2008 that normalizes into something
>> 8BC3 => <U+2F878> 屮 => <U+5C6E> 屮
>> 8BF8 => <U+F907> 龜 => <U+9F9C> 龜
>> 8EFD => <U+2F994> 芳 => <U+82B3> 芳
>> 8FA8 => <U+2F9B2> 䕫 => <U+456B> 䕫
>> 8FF0 => <U+2F9D4> 貫 => <U+8CAB> 貫
>> C6CD => <U+2F33> ⼳ => <U+5E7A> 幺
>> 957A => <U+2F9BC> 蜨 => <U+8728> 蜨
>> 9874 => <U+2F825> 勇 => <U+52C7> 勇
>> 9AC8 => <U+2F83B> 吆 => <U+5406> 吆
>> 9C52 => <U+2F8CD> 晉 => <U+6649> 晉
>> A047 => <U+2F840> 咢 => <U+54A2> 咢
>> FC48 => <U+2F894> 弢 => <U+5F22> 弢
>> FC77 => <U+2F8A6> 慈 => <U+6148> 慈
> The other pairs all contain characters that look slightly different,
> whereas U+5E7A and U+2F33 look the same (and, I believe, are supposed to
> look the same), the only difference being that the former is a normal
> Han character whereas the latter carries the additional semantics of
> being a Kangxi radical.
That the characters in the above list look slightly different is really a
font issue, they are canonically equivalent in Unicode and therefore the
>> I'm not sure what the conclusion is...
> I am not entirely sure either. It seems clear that the mapping from
> C6CD to U+2F33 makes no sense for non-HKSCS Big5 (which does not encode
> U+5E7A anywhere else), but it does not seem to make much sense for
> Big5-HKSCS either, which suggests that I might be missing something.
U+2F33 is indeed the only Kangxi Radical (2F00-2FDF) mapped by HKSCS-2008
and I agree that it's weird. However, unless U+2F33 causes problems on
real-world pages, I'm not really comfortable with fixing bugs in
HKSCS-2008, at least not based only on agreement by two Northern Europeans
like us... If users or implementors from Hong Kong or Taiwan also speak up
for U+5E7A, then I will not object. I posted
a few days ago seeking such feedback, but so far no one has commented on
this specific issue.
>>> On Fri Apr 6 06:42:26 PDT 2012, Philip Jägenstedt <philipj at
>>> opera.com> wrote:
>>>> Also, a single mapping fails the Big5-contra[di]ction test:
>>>> F9FE =>
>>>> opera-hk: U+FFED ￭
>>>> firefox: U+2593 ▓
>>>> chrome: U+2593 ▓
>>>> firefox-hk: U+2593 ▓
>>>> opera: U+2593 ▓
>>>> chrome-hk: U+FFED ￭
>>>> internetexplorer: U+2593 ▓
>>>> hkscs-2008: <U+FFED> ￭
>>>> I'd say that we should go with U+FFED here, since that's what the
>>>> [HKSCS-2008] spec
>>>> says and it's visually close anyway.
>>> Given that the goal is to define a unified Big5 (non-HK) and
>>> Big5-HKSCS encoding and that this seems to be a case of the HK
>>> standard going against everything and everyone else, perhaps more
>>> weight should be given to existing specifications and
>>> (non-HK-specific) implementations.
>>> Suggested change: map F9FE to U+2593
>> This is the only mapping where IE maps something other than PUA or "?"
>> that my mapping doesn't agree on, so I don't object to changing it.
>> Still, it would be very interesting to know why HKSCS-2008 changed it,
>> do you know?
> No, I am afraid not. I have been wondering as well, but I have not been
> able to find an explanation.
> Lunde (if I remember correctly, 1st Edn) and Kano's 'Developing
> International Software' (1st Edn, 1995) both show something like U+2593,
> but it could of course be that popular non-Unicode (HK) Big5 fonts had
> glyphs more like U+FFED, which would make the HKSCS-2008 mapping less
> surprising. Do let me know if you discover any information on this.
On 8 Apr 2012, at 18:03, Philip Jägenstedt wrote:
> I was misremembering: Lunde actually shows a solid black square, so it
> looks like Microsoft may have changed this in its CP950 and HKSCS-2008
> restored the original meaning. [U+FFED does not seem quite right
> (half-width looks implausible), but let us not start discussing all the
> different black solid squares in Unicode.]
> Given the above, following HKSCS-2008 appears to be the best solution,
> which brings the number of problematic forward mappings down to one.
U+FFED decomposes to U+25A0 which could perhaps be more appropriate, but I
suggest sticking with U+FFED and recommending people to use UTF-8 if they
want some particular square shape.
>>> Duplicates and reverse mappings:
>> [...] it clearly needs to be defined what to do for these 100 code
>> points that have multiple mappings to Big5. I extended my Python script
>> to find these 100 duplicates and to check what Python did for 'big5',
>> falling back to 'big5-hkscs'. This is what it produced:
>> These are the ones where you (Øistein) disagree:
>>> C6CF <= U+5EF4
>>> C6D3 <= U+65E0
>>> C6D5 <= U+7676
>>> C6D7 <= U+96B6
>> AFAICT this has nothing to do with compatibility mappings, so what's
>> the reason for this?
> As I wrote, '[o]nly these mappings will work for non-HK Big5
> implementations.' My reasoning was that a random Big5 implementation
> would be more likely to include the E-Ten 1 extension than the HKSCS
> extension. On the other hand, these codepoints could be less than ideal
> if major Big5-HKSCS implementations follow the standard strictly and map
> to nothing.
>>> F9E9 <= U+255E
>>> F9EA <= U+256A
>>> F9EB <= U+2561
>>> F9F9 <= U+2550
>> Python's big5-hkscs agrees, but Python's big5 does this instead:
>> A2A5 <= U+255E
>> A2A6 <= U+256A
>> A2A7 <= U+2561
>> A2A4 <= U+2550
>> It seems safer to go with the big5 mappings, but checking what browsers
>> do would be helpful.
> Does this imply that Python's big5 (non-HK) implementation does not
> include the corresponding E-Ten 2 (forward) mappings for decoding either?
So says python3:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'big5' codec can't decode bytes in position 0-1:
illegal multibyte sequence
A2A4-A2A7 are fine in both big5 and big5-hkscs, however.
On Tue, 10 Apr 2012 17:00:03 +0200, Øistein E. Andersen <liszt at coq.no>
> Getting the double-stroked circle segments at F9FB..F9FD added to
> Unicode would make it possible to provide Unicode mappings in accordance
> with the original intent and remove four duplicate mappings. This might
> be worthwhile if the characters have not been proposed and rejected
Are there any sites that use these line drawing characters that would be
fixed by this? If not, I'm quite willing to accept the historical
accidents and move on :)
More information about the whatwg