[whatwg] A comment to character encoding declaration
excors+whatwg at gmail.com
Wed Mar 5 06:36:47 PST 2008
On 03/03/2008, Jjgod Jiang <gzjjgod at gmail.com> wrote:
> During the development of CJK information processing, many
> text encodings is just a strict subset of another one, for
> example, GB2312 is a subset of GBK, GBK is a subset of
> GB18030. For compatibility purpose, a lot of web pages used
> character encoding declaration like this:
> <meta http-equiv="Content-Type" content="text/html; charset=gb2312">
> in their header, yet they might use characters in GBK but
> not in GB2312. So, I think we can suggest clients to simply
> treat encodings like these as their biggest superset, for
> instance, treat GB2312 as GB18030.
Out of 130K pages from dmoz.org, I see 760 which are declared as
gb2312 (by HTTP Content-Type, <meta content>, etc).
Of those 760, 120 cause decoding errors in ICU4J when treated as
gb2312. 8 cause errors when treated as gbk, and the same 8 cause
errors as gb18030.
Those 8 are:
and I haven't tried working out why they are causing errors.
The 120 are listed at
<http://philip.html5.org/data/gb2312-errors.txt>. I don't know how
many are really using gb18030, and how many are not actually gb* but
happen to be decoded without errors because they use compatible byte
sequences; but it does look like gb2312 is a fairly significant
problem if it's not treated as gbk/gb18030, so it would be helpful to
suggest/require it to be processed specially.
excors at gmail.com
More information about the whatwg