[whatwg] Encoding: big5 and big5-hkscs

Øistein E. Andersen liszt at coq.no
Tue Apr 10 08:00:03 PDT 2012

On 8 Apr 2012, at 18:03, Philip Jägenstedt wrote:

> On Sat, 07 Apr 2012 16:04:55 +0200, Øistein E. Andersen <liszt at coq.no> wrote:
>> 	[1] <http://coq.no/character-tables/eten1.pdf>  <http://coq.no/character-tables/eten1.js>
> What is the source for the mappings in eten1.pdf?

Unihan H was considered normative for the 35 characters it covers:

		<http://coq.no/character-tables/u-eten1.pdf>  <http://coq.no/character-tables/u-eten1.js>

The remaining Unicode mappings are mostly straightforward (given a printed table showing the glyphs).

On 9 Apr 2012, at 02:08, Øistein E. Andersen wrote:

> On 8 Apr 2012, at 18:03, Philip Jägenstedt wrote:
>> On Sat, 07 Apr 2012 16:04:55 +0200, Øistein E. Andersen <liszt at coq.no> wrote:
>>> On Fri Apr 6 06:42:26 PDT 2012, Philip Jägenstedt <philipj at opera.com> wrote:
>>>> Also, a single mapping fails the Big5-contra[di]ction test:
>>>> F9FE =>
>>>> opera-hk: U+FFED ■
>>>> firefox: U+2593 ▓
>>>> chrome: U+2593 ▓
>>>> firefox-hk: U+2593 ▓
>>>> opera: U+2593 ▓
>>>> chrome-hk: U+FFED ■
>>>> internetexplorer: U+2593 ▓
>>>> hkscs-2008: <U+FFED> ■
>>>> I'd say that we should go with U+FFED here, since that's what the [HKSCS-2008] spec
>>>> says and it's visually close anyway.
>> [...]
> Lunde (if I remember correctly, 1st Edn) and Kano's 'Developing International Software' (1st Edn, 1995) both show something like U+2593, but it could of course be that popular non-Unicode (HK) Big5 fonts had glyphs more like U+FFED, which would make the HKSCS-2008 mapping less surprising.  [...]

I was misremembering:  Lunde actually shows a solid black square, so it looks like Microsoft may have changed this in its CP950 and HKSCS-2008 restored the original meaning.  [U+FFED does not seem quite right (half-width looks implausible), but let us not start discussing all the different black solid squares in Unicode.]

Given the above, following HKSCS-2008 appears to be the best solution, which brings the number of problematic forward mappings down to one.

>>> Duplicates and reverse mappings:
>> [...]
>> These are the ones where you (Øistein) disagree:
>> [...]
>>> F9E9 <= U+255E
>>> F9EA <= U+256A
>>> F9EB <= U+2561
>>> F9F9 <= U+2550
>> Python's big5-hkscs agrees, but Python's big5 does this instead:
>> A2A5 <= U+255E
>> A2A6 <= U+256A
>> A2A7 <= U+2561
>> A2A4 <= U+2550
> [...]

These are line-drawing characters with two horizontal lines

Four such characters are included in the original unextended Big5 (A2xx).  Lunde (and several Big5-based fonts on my machine) show glyphs with the two horizontal lines quite far apart.

In contrast, the full set of line-drawing characters with double lines added by E-Ten (F9xx) have glyphs where the two lines are quite close to each other (both in Lunde and in contemporary fonts with a full set of such line-drawing characters).

A potential problem of mapping U+255E to A2A5 etc. is that a non-Unicode system will show glyphs that do not align with other line-drawing characters.  A potential problem of mapping to F9E9 etc. is that systems without support for this E-Ten extension will show nothing at all.

'Proper' handling would probably require the four characters at A2xx to be added to Unicode as compatibility characters or variation sequences based on U+255E etc., but the case does not seem particularly strong unless it can be shown that line-drawing characters with two horizontal lines relatively far apart are somehow important.


Getting the double-stroked circle segments at F9FB..F9FD added to Unicode would make it possible to provide Unicode mappings in accordance with the original intent and remove four duplicate mappings.  This might be worthwhile if the characters have not been proposed and rejected already.

Øistein E. Andersen

More information about the whatwg mailing list