[whatwg] Encoding: big5 and big5-hkscs

Philip Jägenstedt philipj at opera.com
Fri Apr 6 03:54:53 PDT 2012


On Wed, 04 Apr 2012 18:05:14 +0200, Anne van Kesteren <annevk at opera.com>  
wrote:

> On Fri, 30 Mar 2012 14:00:38 +0200, Anne van Kesteren <annevk at opera.com>  
> wrote:
>> Ideally someone does detailed content analysis to figure out what the  
>> best path forward is here, though I'm not entirely sure how.
>
> I still don't know how, but thanks to Simon Pieters I gathered some URLs  
>  from http://dotnetdotcom.org/ and found that 22 pages (of which at  
> least two are big5-hkscs encoded) out of 609 have byte sequences in the  
> ranges that are distinct between big5 and big5-hkscs and in most  
> implementations (in IE they are identical, in Opera big5-hkscs is a  
> superset I believe). The byte sequences found per URL are published  
> here: http://lists.w3.org/Archives/Public/www-archive/2012Apr/0020.html
>
>
> To go from (lead, trail) to an index usable in big5.json you can use a  
> function such as:
>
> def get_index(lead, trail):
>      row = 0xFE-0xA1 + RANGE + 1
>      cell = (trail-0xA1 + RANGE) if trail > (0x7E+1) else trail - 0x40
>      return (lead-0x81) * row + cell
>
> I can do that for the dataset, but I need someone who is able to  
> interpret the results to see which decoding makes more sense.

I've gone through the whole list of URLs and analyzed the pages. Using the  
*-hk mappings for data labeled as big5 would fix pretty much all of these  
pages. Not treating big5 and big5-hkscs as aliases is clearly breaking  
pages, so I would recommend a single mapping for both.

Of the existing mappings, opera-hk seems like the overall winner. As a  
starting point for the spec, I suggest taking the intersection of  
opera-hk, firefox-hk and chrome-hk.

The tedious but fun (if you like Chinese) analysis follows. In case the  
encoding is messed up in transit, it's also available at  
<https://gitorious.org/whatwg/big5/blobs/master/big5.txt>.

== The useful sources ==

These are byte sequences that appear to be deliberate and that make some  
sort of sense in context. I've written the context in Chinese, with the  
byte sequences under investigation left escaped on the form \x00\x00.

> leetm.mingpao.com/cfm/Forum3.cfm?CategoryID=2&TopicID=2720&TopicOrder=Desc&TopicPage=64
> <0xA1: [('0x8b', '0xf8'), ('0x90', '0x5b')]
> 0xA3: []
> 0xC6-0xC8: []

重唔變晒烏\x8b\xf8縮得就縮

有本地產婦出現作動\x90\x5b象亦要輪候五天才可入住私家醫院

\x8b\xf8 =>

opera-hk: U+F907 龜
firefox: U+80E7 胧
chrome: U+F570 
firefox-hk: U+F907 龜
opera: U+FFFD �
chrome-hk: U+F907 龜
internetexplorer: U+F570 

\x90\x5b =>

opera-hk: U+8FF9 迹
firefox: U+823B 舻
chrome: U+E466 
firefox-hk: U+8FF9 迹
opera: U+FFFD �
chrome-hk: U+8FF9 迹
internetexplorer: U+E466 

The *-hk mappings seem correct, since 烏龜 means turtle.

Winners: opera-hk, firefox-hk, chrome-hk


> board.phonehk.com/archiver/?tid-156148.html
> <0xA1: [('0x9d', '0xeb')]
> 0xA3: []
> 0xC6-0xC8: []

我唔識點係itune度轉mp4呀,解壓之後係.m4a\x9d\xeb,係咪即係呢個??

(Cantonese, about "uncompressing" mp4 to m4a...)

This was quoted from the previous comment, where the character in question  
was encoded as 噃 That's 噃 (a modal particle), which seems to make  
sense here.

\x9d\xeb =>

opera-hk: U+5643 噃
firefox: U+ECCD 
chrome: U+ECCD 
firefox-hk: U+5643 噃
opera: U+FFFD �
chrome-hk: U+5643 噃
internetexplorer: U+ECCD 

Winners: opera-hk, firefox-hk, chrome-hk


> www.millionbook.net/gd/h/huishuianyangjiumin/qmt/006.htm
> <0xA1: [('0x8f', '0x73'), ('0x8e', '0x4e'), ('0x8e', '0x4e')]
> 0xA3: []
> 0xC6-0xC8: []

那朱媽媽正在廚下催臉水,剛進角門,听得里邊打罵,立住腳,向\x8f\x73子眼里一瞧,探知緣故。

‘槐蔭未擎\x8e\x4e鷺足’,是宮槐之下,未列著鷺序\x8e\x4e班,喻未仕也。

This looks like classical Chinese, which I don't understand. However, it's  
interesting to look at alternative mappings:

\x8e\x4e =>

opera-hk: U+259AC 𥦬
firefox: U+86F1 蛱
chrome: U+E31F 
firefox-hk: U+E31F 
opera: U+FFFD �
chrome-hk: U+259AC 𥦬
internetexplorer: U+E31F 

At least on my computer, U+E31F and U+259AC are rendered the same, and  
that rendering matches <http://www.unicode.org/charts/PDF/U20000.pdf>.  
U+E31F is in the PUA, so U+259AC is correct.

\x8f\x73 =>

opera-hk: U+25C91 𥲑
firefox: U+9F80 龀
chrome: U+E3E1 
firefox-hk: U+E3E1 
opera: U+FFFD �
chrome-hk: U+25C91 𥲑
internetexplorer: U+E3E1 

U+25C91 is correct for the same reasons.

Winners: opera-hk, chrome-hk

(Needs verification by someone who can read classical Chinese.)


> www.toysdaily.com/discuz/forum-24-2.html
> <0xA1: []
> 0xA3: []
> 0xC6-0xC8: [('0xc7', '0x55')]

This is "[個人收藏]一抽即中 (One Piece Q版盒蛋 ~ 海底\xc7\x55樂園)" which  
links to this item:

http://www.toysdaily.com/discuz/thread-180080-1-2.html

\xc7\x55 is the Japanese hiragana の, which is occasionally used instead  
of 的 or 之, see <http://en.wiktionary.org/wiki/の#Mandarin>.

\xc7\x55 =>

opera-hk: U+306E の
firefox: U+306E の
chrome: U+F724 
firefox-hk: U+306E の
opera: U+306E の
chrome-hk: U+306E の
internetexplorer: U+F724 

U+F724 is in the PUA, so U+306E is correct.

Winners: opera-hk, firefox, firefox-hk, opera, chrome-hk


> forum.mingpao.com/cfm/Forum3.cfm?OwnerID=1&CategoryID=3&TopicID=524&Page=5
> <0xA1: [('0x8e', '0xe0'), ('0x9d', '0xf8'), ('0x9d', '0xf8'), ('0x9d',
> '0xf8')]
> 0xA3: []
> 0xC6-0xC8: []

The source is a post by "又一痛\x8e\xe0":

西方傳媒每逢見到這種新聞,都雀躍萬分,跟住\x9d\xf8反中亂港人仕就隨之而起舞,抺黑中國為首任.中國的發展是剛起步,一些黑暗的事一定會發生,我們不要以為西方普通的事在中國就一定會有,唔該俾\x9d\xf8耐性對中國,唔好一有事就跳出來協助西方人抺黑中國啦.唔通類似這些事情在一些民主國家無發生咩,例如印度,菲律賓等國家,為甚麼那班抺黑中國的人不提一\xfa\xef呢.公道\x9d\xf8好唔好,你都是中國人來的.

(Cantonese, criticizing the western media's anti-Chinese bias.)

\x8e\xe0 =>

opera-hk: U+811A 脚
firefox: U+9C82 鲂
chrome: U+E38F 
firefox-hk: U+811A 脚
opera: U+FFFD �
chrome-hk: U+811A 脚
internetexplorer: U+E38F 

\x9d\xf8 =>

opera-hk: U+5572 啲
firefox: U+9C53 鱓
chrome: U+ECDA 
firefox-hk: U+5572 啲
opera: U+FFFD �
chrome-hk: U+5572 啲
internetexplorer: U+ECDA 

\xfa\xef =>

opera-hk: U+5413 吓
firefox: U+7E92 纒
chrome: U+E08D 
firefox-hk: U+5413 吓
opera: U+FFFD �
chrome-hk: U+5413 吓
internetexplorer: U+E08D 

The *-hk mappings look very plausible, especially given 啲好唔好. The rest  
are pretty obviously wrong.

Winners: opera-hk, firefox-hk, chrome-hk

(Needs verification by someone who knows Cantonese.)


> www30.discuss.com.hk/archiver/?tid-9026420.html
> <0xA1: [('0x9d', '0xef')]
> 0xA3: []
> 0xC6-0xC8: []

師兄你\x9d\xef表達能力仲驚人, 一語道破成件事.

\x9d\xef =>

opera-hk: U+5605 嘅
firefox: U+9B8B 鮋
chrome: U+ECD1 
firefox-hk: U+5605 嘅
opera: U+FFFD �
chrome-hk: U+5605 嘅
internetexplorer: U+ECD1 

嘅 seems correct in context.

Winners: opera-hk, firefox-hk, chrome-hk


> www28.discuss.com.hk/viewthread.php?tid=7539844&extra=page%3D1&page=10
> <0xA1: [('0x9d', '0xf7')]
> 0xA3: []
> 0xC6-0xC8: []

一開始用斯路在悟空下方出龜波,斯路死\x9d\xf7悟飯就爆氣,狂出龜波,如果死埋有神龍。

\x9d\xf7 also appeared in another source:

> www.hacken.cc/bbs/thread-318592-6-1.html
> <0xA1: [('0x9d', '0xf7'), ('0x89', '0x59'), ('0x89', '0x72')]
> 0xA3: []
> 0xC6-0xC8: []

This is from a comment in mixed simplified and traditional Chinese. First  
the traditional bit:

我發言後就彈\x9d\xf7依句:

\x9d\xf7 =>

opera-hk: U+5497 咗
firefox: U+9C26 鰦
chrome: U+ECD9 
firefox-hk: U+5497 咗
opera: U+FFFD �
chrome-hk: U+5497 咗
internetexplorer: U+ECD9 

U+5497 咗 seems correct, the rest are obviously bogus.

This is the simplified bit:

对不起,您暂时\xfc\xd3法\x89\x59言,可能是以下原因
1,您申请加入该群,正在等待验证通过。
2,您已\x89\x72退出该群。

\xfc\xd3 =>

opera-hk: U+65E0 无
firefox: U+75C3 痃
chrome: U+E1AB 
firefox-hk: U+65E0 无
opera: U+FFFD �
chrome-hk: U+65E0 无
internetexplorer: U+E1AB 

\x89\x59 =>

opera-hk: U+53D1 发
firefox: U+829C 芜
chrome: U+F3B9 
firefox-hk: U+53D1 发
opera: U+FFFD �
chrome-hk: U+53D1 发
internetexplorer: U+F3B9 

\x89\x72 =>

opera-hk: U+7ECF 经
firefox: U+8F93 输
chrome: U+F3D2 
firefox-hk: U+7ECF 经
opera: U+FFFD �
chrome-hk: U+7ECF 经
internetexplorer: U+F3D2 

It's complete news to me that Big5-HKSCS can encode some simplified  
Chinese characters, but the *-hk versions mappings are correct.

Winners: opera-hk, firefox-hk, chrome-hk


> www28.discuss.com.hk/viewthread.php?tid=7319244&extra=page%3D1&page=10
> <0xA1: [('0xa0', '0x4f')]
> 0xA3: []
> 0xC6-0xC8: []

『飢餓穴』是臨食\x0a\x4f之前十五分鐘去按呢!

\x0a\x4f =>

opera-hk: U+24ABB 𤪻
firefox: U+5622 嘢
chrome: U+EE2A 
firefox-hk: U+EE2A 
opera: U+FFFD �
chrome-hk: U+24ABB 𤪻
internetexplorer: U+EE2A 

This is Cantonese, which I don't really know, but from some searching the  
firefox mapping looks plausible. However, U+EE2A (PUA) and U+24ABB looks  
the same in some fonts, so probably U+24ABB is correct.

Winners: ?


> www.fhs.gov.hk/tc_chi/health_info/class_life/child/child.html
> <0xA1: [('0x8f', '0xc0')]
> 0xA3: []
> 0xC6-0xC8: []

<a href="http://www.dh.gov.hk/" target="_blank"><img  
src="../../../images/health_info/health_info_02.jpg" alt="\x8f\xc0生署"  
border="0"></a>

\x8f\xc0 =>

opera-hk: U+885E 衞
firefox: U+7F33 缳
chrome: U+E40C 
firefox-hk: U+885E 衞
opera: U+FFFD �
chrome-hk: U+885E 衞
internetexplorer: U+E40C 

Follow the link to http://www.dh.gov.hk/ and there can be no doubt that  
衞生署 is correct.

Winners: opera-hk, firefox-hk, chrome-hk


> www.books.com.tw/exep/prod/books/editorial/publisher_booklist.php?pubid=sharppnt&qseries=sharppnt9B05
> <0xA1: []
> 0xA3: []
> 0xC6-0xC8: [('0xc7', '0x5c'), ('0xc7', '0x66'), ('0xc7', '0x5c'),  
> ('0xc7',
> '0x66')]

These are hiragana in 柴門ふみ which is simply the name of a Japanese  
author: http://en.wikipedia.org/wiki/Fumi_Saimon

\xc7\x5c =>

opera-hk: U+3075 ふ
firefox: U+3075 ふ
chrome: U+F72B 
firefox-hk: U+3075 ふ
opera: U+3075 ふ
chrome-hk: U+3075 ふ
internetexplorer: U+F72B 

\xc7\x66 =>

opera-hk: U+307F み
firefox: U+307F み
chrome: U+F735 
firefox-hk: U+307F み
opera: U+307F み
chrome-hk: U+307F み
internetexplorer: U+F735 

U+F72B and U+F735 are in the PUA, so U+307F and U+3075 are correct.

Winners: opera-hk, firefox, firefox-hk, opera, chrome-hk


== Mixed encodings and other nonsense ==

> hkhk.org/viewthread.php?tid=22286&extra=page%3D1
> <0xA1: []
> 0xA3: []
> 0xC6-0xC8: [('0xc8', '0xa1')]

GBK-encoded comments in <style> and <script>, e.g.:

//取得classname为t_msgfontfix 的层


> www.epochtimes.com/b5/7/1/12/n1588315.htm
> <0xA1: [('0x8b', '0x20')]
> 0xA3: []
> 0xC6-0xC8: []
>
> epochtimes.com/b5/7/12/23/n1951744.htm
> <0xA1: [('0x8b', '0x20')]
> 0xA3: []
> 0xC6-0xC8: []

Both of these are UTF-8 in a JavaScript comment:

/* DJY left 250x250, 已建立 2010/11/18 */


> photo.pchome.com.tw/wen657476/045/
> <0xA1: []
> 0xA3: []
> 0xC6-0xC8: [('0xc6', '0xe4'), ('0xc6', '0xe4'), ('0xc7', '0xae')]

\xc6\xe4 is in a script encoded as GBK:

{Icon:'/s12/w/e/wen657476/book45/p121059706328s.jpg', PK:121059706328,  
Title:'DataSet[ 20079-卡其.jpg ]', Desc:'DataSet[ 20079-卡其.jpg ]'}

\xc7\xae is a link to http://photo.pchome.com.tw/wen657476/119307520020  
encoded as GBK:

<a href="/wen657476/119307520020">k005-罗马钱夹.jpg(1)</a>

罗马钱夹 means "roman wallet", which is exactly what is being sold.


> www.eye.hk/bbs/zboard.php?category=2&id=eyeglasses_collestables&page=1&page_num=999&sn=off&ss=on&sc=on&keyword=&select_arrange=headnum&desc=asc
> <0xA1: []
> 0xA3: []
> 0xC6-0xC8: [('0xc7', '0xd7'), ('0xc8', '0xd5'), ('0xc7', '0xd7'),  
> ('0xc8',
> '0xd5'), ('0xc7', '0xd7'), ('0xc8', '0xd5'), ('0xc8', '0xd5'), ('0xc8',
> '0xd5'), ('0xc8', '0xd5'), ('0xc8', '0xd5'), ('0xc8', '0xd5'), ('0xc8',
> '0xd5'), ('0xc8', '0xd5'), ('0xc8', '0xd5'), ('0xc8', '0xd5'), ('0xc8',
> '0xd5'), ('0xc8', '0xd5'), ('0xc8', '0xd5'), ('0xc8', '0xd5'), ('0xc6',
> '0xb0'), ('0xc6', '0xe4')]

This page is mixed Big5 and GBK, nothing could save it.


> www.izincan.com/board/novelsys.php?arid=65987
> <0xA1: []
> 0xA3: []
> 0xC6-0xC8: [('0xc8', '0xeb')]

Comments in the JavaScript code at the end of the document are in GBK.


> tvcity.tvb.com/drama/wasabi_mon_amour/story/002.html
> <0xA1: [('0x8f', '0x58'), ('0x92', '0xe5'), ('0x8b', '0x95'), ('0x88',
> '0xe5'), ('0x8d', '0x80'), ('0x83', '0x3c'), ('0x8b', '0xe8'), ('0x88',
> '0x8a'), ('0x8b', '0xe8'), ('0x98', '0xe6'), ('0x92', '0xe5'), ('0x8b',
> '0x95'), ('0x81', '0x9e'), ('0x9f', '0xe6'), ('0x81', '0x93'), ('0x8f',
> '0xb8'), ('0x87', '0xe6'), ('0x96', '0x99'), ('0x8d', '0xe5'), ('0x8b',
> '0x99'), ('0x9d', '0xe6'), ('0x8a', '0xe6'), ('0x88', '0xb2'), ('0x88',
> '0xe7'), ('0x9f', '0xa5'), ('0x89', '0x8d'), ('0x9b', '0xe8'), ('0x81',
> '0x98'), ('0x91', '0x8a'), ('0x91', '0xe5'), ('0x91', '0x3c'), ('0x8f',
> '0xe9')]
> 0xA3: [('0xa3', '0xe5')]
> 0xC6-0xC8: [('0xc7', '0x55')]

The top part of the page is in Big5-HKSCS while the site navigation at the  
bottom is in UTF-8.


> www.china-holiday.com/big5/big5train/skbzhwsy3.asp?zrxx=ccxs&sfcc=北京南&cx=全部
> <0xA1: [('0x97', '0xe4'), ('0x83', '0xa8'), ('0x97', '0xe4'), ('0x83',
> '0xa8')]
> 0xA3: []
> 0xC6-0xC8: []

北京南 and 全部 are encoded as UTF-8 and end up in the <title>...


> www.iis.sinica.edu.tw/page/library/TechReport/tr2002/threebone02.html
> <0xA1: [('0x87', '0xe8'), ('0x93', '0xe5'), ('0xa0', '0xb1'), ('0x8a',
> '0x3c')]
> 0xA3: []
> 0xC6-0xC8: []

This page page is mislabeled; it's actually encoded in UTF-8.


> bbs.rc-evo.com/viewthread.php?tid=73138&page=1&authorid=2487
> <0xA1: []
> 0xA3: []
> 0xC6-0xC8: [('0xc6', '0xbc'), ('0xc6', '0xbc'), ('0xc6', '0xbc')]

The page must have changed, I can't find \xc6\xbc


> rumotan.com/guan/modules/tinyd3/
> <0xA1: [('0x90', '0xe8'), ('0x83', '0xe6'), ('0x90', '0xe8'), ('0x8f',
> '0xb4'), ('0x9d', '0xe8'), ('0x8f', '0xb4'), ('0x8f', '0xb4'), ('0x9c',
> '0x8b'), ('0x95', '0xab'), ('0x90', '0xe8'), ('0x95', '0xab'), ('0x8f',
> '0xb4'), ('0x9d', '0xe8'), ('0x8f', '0xb4'), ('0x8f', '0xb4'), ('0x81',
> '0xa3'), ('0x9d', '0xe8'), ('0x8f', '0xb4'), ('0x8f', '0xaf'), ('0x9d',
> '0xe8'), ('0x8f', '0xb4'), ('0x90', '0xe5'), ('0x9c', '0x8b'), ('0x90',
> '0xe8'), ('0x99', '0x2c'), ('0x82', '0xe6'), ('0x8f', '0x90'), ('0x9b',
> '0xe8'), ('0x97', '0x9d'), ('0x93', '0xe5'), ('0x90', '0xe8'), ('0x94',
> '0xb6'), ('0x8f', '0xe8'), ('0x88', '0x87'), ('0x95', '0xe8'), ('0x82',
> '0xe5'), ('0x8f', '0xb4'), ('0x8c', '0x31'), ('0x94', '0x9f'), ('0x97',
> '0xe6'), ('0x90', '0xe4'), ('0x8c', '0xe5'), ('0x9b', '0xe5'), ('0x8c',
> '0xe8'), ('0x99', '0x9f'), ('0x83', '0xe7'), ('0x95', '0xab'), ('0x8b',
> '0xe4'), ('0x82', '0xe6'), ('0x9b', '0xbe'), ('0x9a', '0xe6'), ('0x9c',
> '0x8b'), ('0x8e', '0xe8'), ('0x94', '0xe6'), ('0x9c', '0x83'), ('0x86',
> '0xe4'), ('0x81', '0xe8'), ('0x81', '0xb7'), ('0x82', '0x22'), ('0x92',
> '0xe5'), ('0x82', '0xe8'), ('0x97', '0x9d'), ('0x93', '0xe7'), ('0x90',
> '0xe8'), ('0x99', '0x2c')]
> 0xA3: [('0xa3', '0xe4')]
> 0xC6-0xC8: []

There's a chunk of UTF-8 in <meta>, so I looked no further.

-- 
Philip Jägenstedt
Core Developer
Opera Software



More information about the whatwg mailing list