[whatwg] Encoding: big5 and big5-hkscs
Philip Jägenstedt
philipj at opera.com
Thu Apr 12 00:26:51 PDT 2012
On Mon, 09 Apr 2012 03:08:20 +0200, Øistein E. Andersen <liszt at coq.no>
wrote:
> On 8 Apr 2012, at 18:03, Philip Jägenstedt wrote:
>
>> On Sat, 07 Apr 2012 16:04:55 +0200, Øistein E. Andersen <liszt at coq.no>
>> wrote:
>>> Suggested change: map C6CD to U+5E7A.
>>
>> These are the existing mappings:
>>
>> C6CD =>
>> opera-hk: U+2F33 ⼳
>> firefox: U+5E7A 幺
>> chrome: U+F6DD
>> firefox-hk: U+5E7A 幺
>> opera: U+2F33 ⼳
>> chrome-hk: U+2F33 ⼳
>> internetexplorer: U+F6DD
>> hkscs-2008: <U+2F33> ⼳
>>
>> At least on the Web, this isn't a question of HK vs non-HK mappings.
>> Other than Firefox, which (de-facto) specs or implementations use
>> U+5E7A?
>
> I have now had a closer look at my notes
> (<http://coq.no/character-tables/chinese-traditional/en>). My argument
> for U+5E7A goes as follows:
>
> Of the 214 Kangxi radicals, 186 appear (as normal Han character) in CNS
> 11643 Planes 1 or 2, whereas 25 appear in Plane 3 and 3 are missing
> altogether. Big5 only covers Planes 1 and 2, which means that 28 Kangxi
> radicals (which may be rare in running text, but are nevertheless
> important) are missing. The E-Ten 1 extension encodes 25 of the missing
> radicals in the range C6BF--C6D7. Unlike CNS 11643 and Unicode, Big5
> does not encode radicals twice (as radicals and normal characters).
> This means that Big5 with the E-Ten 1 extension contains 211 of the 214
> Kangxi radicals, all mapped to normal Han characters, and no codepoints
> mapped to Unicode Kangxi Radicals in the range U+2F00--U+2FD5.
>
> In summary: although E-Ten 1 was not defined in terms of Unicode, it is
> clear that the 25 radicals were all meant to map to normal Han
> characters, not to the special radical characters found in CNS 11643 and
> Unicode.
>
> Enter HKSCS. 20 of the E-Ten 1 Kangxi radical mappings (along with the
> rest of E-Ten 1 and E-Ten 2, or almost) are adopted, but the remaining 5
> are instead given new codepoints elsewhere. Whatever the reason be, 4
> of the 5 unused E-Ten positions are simply left undefined in the HKSCS
> standard, which is not much of a problem for a unified HK/non-HK Big5
> encoding. Unfortunately, the position C6CD was not left undefined, but
> instead mapped to U+2F33 (⼳), the Unicode Kangxi Radical version of
> U+5E7A (幺), thus introducing not only the only Unicode Kangxi Radical
> into the HKSCS standard, but also a Unicode mapping that is incompatible
> with previous Big5 versions. I wish I knew why.
>
>> Possibly, one could argue that U+2F33 normalizes (NFKC) to U+5E7A, but
>> it's not the only hanzi in HKSCS-2008 that normalizes into something
>> else:
>>
>> 8BC3 => <U+2F878> 屮 => <U+5C6E> 屮
>> 8BF8 => <U+F907> 龜 => <U+9F9C> 龜
>> 8EFD => <U+2F994> 芳 => <U+82B3> 芳
>> 8FA8 => <U+2F9B2> 䕫 => <U+456B> 䕫
>> 8FF0 => <U+2F9D4> 貫 => <U+8CAB> 貫
>> C6CD => <U+2F33> ⼳ => <U+5E7A> 幺
>> 957A => <U+2F9BC> 蜨 => <U+8728> 蜨
>> 9874 => <U+2F825> 勇 => <U+52C7> 勇
>> 9AC8 => <U+2F83B> 吆 => <U+5406> 吆
>> 9C52 => <U+2F8CD> 晉 => <U+6649> 晉
>> A047 => <U+2F840> 咢 => <U+54A2> 咢
>> FC48 => <U+2F894> 弢 => <U+5F22> 弢
>> FC77 => <U+2F8A6> 慈 => <U+6148> 慈
>
> The other pairs all contain characters that look slightly different,
> whereas U+5E7A and U+2F33 look the same (and, I believe, are supposed to
> look the same), the only difference being that the former is a normal
> Han character whereas the latter carries the additional semantics of
> being a Kangxi radical.
That the characters in the above list look slightly different is really a
font issue, they are canonically equivalent in Unicode and therefore the
same, AFAICT.
>> I'm not sure what the conclusion is...
>
> I am not entirely sure either. It seems clear that the mapping from
> C6CD to U+2F33 makes no sense for non-HKSCS Big5 (which does not encode
> U+5E7A anywhere else), but it does not seem to make much sense for
> Big5-HKSCS either, which suggests that I might be missing something.
U+2F33 is indeed the only Kangxi Radical (2F00-2FDF) mapped by HKSCS-2008
and I agree that it's weird. However, unless U+2F33 causes problems on
real-world pages, I'm not really comfortable with fixing bugs in
HKSCS-2008, at least not based only on agreement by two Northern Europeans
like us... If users or implementors from Hong Kong or Taiwan also speak up
for U+5E7A, then I will not object. I posted
<http://lists.w3.org/Archives/Public/public-html-ig-zh/2012Apr/0001.html>
a few days ago seeking such feedback, but so far no one has commented on
this specific issue.
>>> On Fri Apr 6 06:42:26 PDT 2012, Philip Jägenstedt <philipj at
>>> opera.com> wrote:
>>>
>>>> Also, a single mapping fails the Big5-contra[di]ction test:
>>>>
>>>> F9FE =>
>>>> opera-hk: U+FFED ■
>>>> firefox: U+2593 ▓
>>>> chrome: U+2593 ▓
>>>> firefox-hk: U+2593 ▓
>>>> opera: U+2593 ▓
>>>> chrome-hk: U+FFED ■
>>>> internetexplorer: U+2593 ▓
>>>> hkscs-2008: <U+FFED> ■
>>>>
>>>> I'd say that we should go with U+FFED here, since that's what the
>>>> [HKSCS-2008] spec
>>>> says and it's visually close anyway.
>>>
>>> Given that the goal is to define a unified Big5 (non-HK) and
>>> Big5-HKSCS encoding and that this seems to be a case of the HK
>>> standard going against everything and everyone else, perhaps more
>>> weight should be given to existing specifications and
>>> (non-HK-specific) implementations.
>>>
>>> Suggested change: map F9FE to U+2593
>>
>> This is the only mapping where IE maps something other than PUA or "?"
>> that my mapping doesn't agree on, so I don't object to changing it.
>> Still, it would be very interesting to know why HKSCS-2008 changed it,
>> do you know?
>
> No, I am afraid not. I have been wondering as well, but I have not been
> able to find an explanation.
>
> Lunde (if I remember correctly, 1st Edn) and Kano's 'Developing
> International Software' (1st Edn, 1995) both show something like U+2593,
> but it could of course be that popular non-Unicode (HK) Big5 fonts had
> glyphs more like U+FFED, which would make the HKSCS-2008 mapping less
> surprising. Do let me know if you discover any information on this.
On 8 Apr 2012, at 18:03, Philip Jägenstedt wrote:
> I was misremembering: Lunde actually shows a solid black square, so it
> looks like Microsoft may have changed this in its CP950 and HKSCS-2008
> restored the original meaning. [U+FFED does not seem quite right
> (half-width looks implausible), but let us not start discussing all the
> different black solid squares in Unicode.]
>
> Given the above, following HKSCS-2008 appears to be the best solution,
> which brings the number of problematic forward mappings down to one.
U+FFED decomposes to U+25A0 which could perhaps be more appropriate, but I
suggest sticking with U+FFED and recommending people to use UTF-8 if they
want some particular square shape.
>>> Duplicates and reverse mappings:
>>>
>>> [...]
>>
>> [...] it clearly needs to be defined what to do for these 100 code
>> points that have multiple mappings to Big5. I extended my Python script
>> to find these 100 duplicates and to check what Python did for 'big5',
>> falling back to 'big5-hkscs'. This is what it produced:
>>
>> [...]
>>
>> These are the ones where you (Øistein) disagree:
>>
>>> C6CF <= U+5EF4
>>> C6D3 <= U+65E0
>>> C6D5 <= U+7676
>>> C6D7 <= U+96B6
>>
>> AFAICT this has nothing to do with compatibility mappings, so what's
>> the reason for this?
>
> As I wrote, '[o]nly these mappings will work for non-HK Big5
> implementations.' My reasoning was that a random Big5 implementation
> would be more likely to include the E-Ten 1 extension than the HKSCS
> extension. On the other hand, these codepoints could be less than ideal
> if major Big5-HKSCS implementations follow the standard strictly and map
> to nothing.
>>> F9E9 <= U+255E
>>> F9EA <= U+256A
>>> F9EB <= U+2561
>>> F9F9 <= U+2550
>>
>> Python's big5-hkscs agrees, but Python's big5 does this instead:
>>
>> A2A5 <= U+255E
>> A2A6 <= U+256A
>> A2A7 <= U+2561
>> A2A4 <= U+2550
>>
>> It seems safer to go with the big5 mappings, but checking what browsers
>> do would be helpful.
>
> Does this imply that Python's big5 (non-HK) implementation does not
> include the corresponding E-Ten 2 (forward) mappings for decoding either?
So says python3:
>>> b'\xf9\xe9'.decode('big5')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'big5' codec can't decode bytes in position 0-1:
illegal multibyte sequence
>>> b'\xf9\xe9'.decode('big5-hkscs')
'╞'
A2A4-A2A7 are fine in both big5 and big5-hkscs, however.
On Tue, 10 Apr 2012 17:00:03 +0200, Øistein E. Andersen <liszt at coq.no>
wrote:
> Getting the double-stroked circle segments at F9FB..F9FD added to
> Unicode would make it possible to provide Unicode mappings in accordance
> with the original intent and remove four duplicate mappings. This might
> be worthwhile if the characters have not been proposed and rejected
> already.
Are there any sites that use these line drawing characters that would be
fixed by this? If not, I'm quite willing to accept the historical
accidents and move on :)
--
Philip Jägenstedt
Core Developer
Opera Software
More information about the whatwg
mailing list