[whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
Øistein E. Andersen
liszt at coq.no
Sun Apr 12 03:08:34 PDT 2009
On 2 Sep 2008, at 06:06, Ian Hickson wrote:
> On Wed, 30 Jul 2008, Øistein E. Andersen wrote:
>> 1. Opera, Firefox and Safari all handle US-ASCII as Windows-1252.
>> IE7, on the other hand, simply ignores the high bit (as it does
>> a few other 7-bit encodings, by the way). Perhaps this
>> alias could be dropped from the other browsers.
> Ignoring the high bit seems like a dangerous security bug; dropping
> character with a high bit as U+FFFD seems unnecessarily drastic.
According to a test I did using browsershots.org, IE8 actually seems
to do this (8-bit characters are rendered as squares), which looks
like an argument in favour of the more `drastic' option.
> I've made the spec go with the O/F/S behaviour here.
This has the advantage of not adding ASCII as a separate encoding, and
Windows-1252 is presumably one of the encodings most often mislabelled
as ASCII. However, IE has ignored the high bit at least since 5.01
(IE4 via browsershots.org treats it as CP1252, but this could well be
locale-dependent), so there may not be that many mislabelled pages.
Has anyone got a list of pages which are labelled as ASCII and contain
This is probably not very important. U+FFFD is `purer', Windows-1252
has the potential of rescuing a few pages. It is however essential
that 8-bit characters be considered not conforming since they do not
in fact work (as Windows-1252 bytes) in IE5-IE8. This is currently
the case, but I think Henri Sivonen has argued that `misinterpretation
for compatibility' should not be considered a conformance error (which
would probably be fairly harmless for other mappings).
>> 4. Delete (0x7F) and the C1 range (0x80--0x9F) are handled quite
>> inconsistently; [...]
> I think the HTML5 spec does what is necessary here, but it may be
> that the
> encodings specs are vague still.
[For the record, HTML5 currently requires delete and C1 characters (as
well as C0 save white space) to be replaced by U+FFFD during `pre-
processing of the input stream', which effectively circumvents the
problem that character encoding specifications treat this range in a
vague and inconsistent manner.]
Øistein E. Andersen
More information about the whatwg