[whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

Sun Apr 12 03:08:34 PDT 2009

On 2 Sep 2008, at 06:06, Ian Hickson wrote:

> On Wed, 30 Jul 2008, Øistein E. Andersen wrote:
>>
>> 1. Opera, Firefox and Safari all handle US-ASCII as Windows-1252.
>>    IE7, on the other hand, simply ignores the high bit (as it does  
>> for
>>    a few other 7-bit encodings, by the way).  Perhaps this
>>    alias could be dropped from the other browsers.
>
> Ignoring the high bit seems like a dangerous security bug; dropping  
> any
> character with a high bit as U+FFFD seems unnecessarily drastic.

According to a test I did using browsershots.org, IE8 actually seems  
to do this (8-bit characters are rendered as squares), which looks  
like an argument in favour of the more `drastic' option.

> I've made the spec go with the O/F/S behaviour here.

This has the advantage of not adding ASCII as a separate encoding, and  
Windows-1252 is presumably one of the encodings most often mislabelled  
as ASCII.  However, IE has ignored the high bit at least since 5.01  
(IE4 via browsershots.org treats it as CP1252, but this could well be  
locale-dependent), so there may not be that many mislabelled pages.   
Has anyone got a list of pages which are labelled as ASCII and contain  
8-bit characters?

This is probably not very important.  U+FFFD is `purer', Windows-1252  
has the potential of rescuing a few pages.  It is however essential  
that 8-bit characters be considered not conforming since they do not  
in fact work (as Windows-1252 bytes) in IE5-IE8.  This is currently  
the case, but I think Henri Sivonen has argued that `misinterpretation  
for compatibility' should not be considered a conformance error (which  
would probably be fairly harmless for other mappings).

>> 4. Delete (0x7F) and the C1 range (0x80--0x9F) are handled quite  
>> inconsistently; [...]
>>
>
> I think the HTML5 spec does what is necessary here, but it may be  
> that the
> encodings specs are vague still.

[For the record, HTML5 currently requires delete and C1 characters (as  
well as C0 save white space) to be replaced by U+FFFD during `pre- 
processing of the input stream', which effectively circumvents the  
problem that character encoding specifications treat this range in a  
vague and inconsistent manner.]

-- 
Øistein E. Andersen