[whatwg] A comment to character encoding declaration

Wed Mar 5 06:36:47 PST 2008

On 03/03/2008, Jjgod Jiang <gzjjgod at gmail.com> wrote:
>  During the development of CJK information processing, many
>  text encodings is just a strict subset of another one, for
>  example, GB2312 is a subset of GBK, GBK is a subset of
>  GB18030. For compatibility purpose, a lot of web pages used
>  character encoding declaration like this:
>
>  <meta http-equiv="Content-Type" content="text/html; charset=gb2312">
>
>  in their header, yet they might use characters in GBK but
>  not in GB2312. So, I think we can suggest clients to simply
>  treat encodings like these as their biggest superset, for
>  instance, treat GB2312 as GB18030.

Out of 130K pages from dmoz.org, I see 760 which are declared as
gb2312 (by HTTP Content-Type, <meta content>, etc).

Of those 760, 120 cause decoding errors in ICU4J when treated as
gb2312. 8 cause errors when treated as gbk, and the same 8 cause
errors as gb18030.

Those 8 are:
http://www.bigm.com.cn/dinosaur/anecdote/
http://www.ccpc.edu.cn
http://www.gdoverseaschn.com.cn/
http://www.jgbr.com.cn
http://www.liechebuluo.com
http://www.netbro.com.cn
http://www.tkdts.com
http://www.wuxi-accp.com/
and I haven't tried working out why they are causing errors.

The 120 are listed at
<http://philip.html5.org/data/gb2312-errors.txt>. I don't know how
many are really using gb18030, and how many are not actually gb* but
happen to be decoded without errors because they use compatible byte
sequences; but it does look like gb2312 is a fairly significant
problem if it's not treated as gbk/gb18030, so it would be helpful to
suggest/require it to be processed specially.

-- 
Philip Taylor
excors at gmail.com