[whatwg] A comment to character encoding declaration

Philip Taylor excors+whatwg at gmail.com
Wed Mar 5 06:36:47 PST 2008

On 03/03/2008, Jjgod Jiang <gzjjgod at gmail.com> wrote:
>  During the development of CJK information processing, many
>  text encodings is just a strict subset of another one, for
>  example, GB2312 is a subset of GBK, GBK is a subset of
>  GB18030. For compatibility purpose, a lot of web pages used
>  character encoding declaration like this:
>  <meta http-equiv="Content-Type" content="text/html; charset=gb2312">
>  in their header, yet they might use characters in GBK but
>  not in GB2312. So, I think we can suggest clients to simply
>  treat encodings like these as their biggest superset, for
>  instance, treat GB2312 as GB18030.

Out of 130K pages from dmoz.org, I see 760 which are declared as
gb2312 (by HTTP Content-Type, <meta content>, etc).

Of those 760, 120 cause decoding errors in ICU4J when treated as
gb2312. 8 cause errors when treated as gbk, and the same 8 cause
errors as gb18030.

Those 8 are:
and I haven't tried working out why they are causing errors.

The 120 are listed at
<http://philip.html5.org/data/gb2312-errors.txt>. I don't know how
many are really using gb18030, and how many are not actually gb* but
happen to be decoded without errors because they use compatible byte
sequences; but it does look like gb2312 is a fairly significant
problem if it's not treated as gbk/gb18030, so it would be helpful to
suggest/require it to be processed specially.

Philip Taylor
excors at gmail.com

More information about the whatwg mailing list