[whatwg] Encoding: big5 and big5-hkscs
Øistein E. Andersen
liszt at coq.no
Thu Apr 12 02:52:20 PDT 2012
On 12 Apr 2012, at 08:26, Philip Jägenstedt wrote:
>>> Possibly, one could argue that U+2F33 normalizes (NFKC) to U+5E7A, but it's not the only hanzi in HKSCS-2008 that normalizes into something else:
>
> That the characters in the above list look slightly different is really a font issue, they are canonically equivalent in Unicode and therefore the same, AFAICT.
Sorry, you are right about that, of course. U+2F33 and U+5E7A are not canonically equivalent, and I just assumed that was the case for the others as well without thinking.
> U+2F33 is indeed the only Kangxi Radical (2F00-2FDF) mapped by HKSCS-2008 and I agree that it's weird. However [...], I'm not really comfortable with fixing bugs in HKSCS-2008, at least not based only on agreement by two Northern Europeans like us... If users or implementors from Hong Kong or Taiwan also speak up for U+5E7A, then I will not object.
I certainly agree with that sentiment.
>>>>> F9FE =>
> [...]
> U+FFED decomposes to U+25A0 which could perhaps be more appropriate,
Yes, except that A1BD maps to U+25A0.
> but I suggest sticking with U+FFED and recommending people to use UTF-8 if they want some particular square shape.
That makes sense. Cf. python again for a less web-centric point of view:
>>> b'\xf9\xfe'.decode('big5-hkscs')
u'\uffed'
>>> b'\xf9\xfe'.decode('cp950')
u'\u2593'
>>> b'\xf9\xfe'.decode('big5')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'big5' codec can't decode bytes in position 0-1: illegal multibyte sequence
>> Does this imply that Python's big5 (non-HK) implementation does not include the corresponding E-Ten 2 (forward) mappings for decoding either?
>
> So says python3:
>
>>>> b'\xf9\xe9'.decode('big5')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'big5' codec can't decode bytes in position 0-1: illegal multibyte sequence
>>>> b'\xf9\xe9'.decode('big5-hkscs')
> '╞'
Python also says:
>>> b'\xf9\xe9'.decode('cp950')
u'\u255e'
> Are there any sites that use these line drawing characters that would be fixed by this? If not, I'm quite willing to accept the historical accidents and move on :)
Probably not many. Still, it seems safe to fix these four mappings if the characters are ever added to Unicode.
Øistein E. Andersen
More information about the whatwg
mailing list