[whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

Ian Hickson ian at hixie.ch
Fri Oct 23 15:25:54 PDT 2009

On Fri, 23 Oct 2009, Øistein E. Andersen wrote:
> On 23 Oct 2009, at 04:20, Ian Hickson wrote:
> > On Wed, 21 Oct 2009, Øistein E. Andersen wrote:
> > >
> > > ASCII-compatibility:
> > > The note in ‘2.1.5 Character encodings’ seems to say that [...]
> > > ISO-2022Â’[-*] are ASCII-compatible, whereas HZ-GB-2312 is not, and I
> > > cannot
> > > find anything in Section 2.1.5 that would explain this difference.
> > 
> > HZ-GB-2312 uses the byte ASCII uses for "~" as the escape character.
> > ISO-2022-* uses the control codes. That's the difference.
> '~'/0x7E is not (and should not be, as far as I can tell) relevant for HTML5's
> concept of ASCII compatibility.

Good point. Moved the encoding over to the other side.

> The added note certainly helps, but it is vague (does "[m]ost of these 
> encodings" mean "all the encodings mentioned above apart from UTF-32"?) 
> and inaccurate (Philip Taylor's example does not rely on "bugs").
> Given that the set of encodings is open-ended, I still think it would be 
> preferable to make the rationale (a definition of what makes an encoding 
> problematic) primary and mention actual encodings as examples. This 
> could give something like the following: "Encodings in which a series of 
> bytes in the range 0x20..0x7E may encode characters other than the 
> corresponding characters in the range U+20..U+7E represent a potential 
> security vulnerability since a browser that does not support the 
> encoding (or does not support the label used to declare the encoding, or 
> does not use the same mechanism to detect the encoding of unlabelled 
> content) might end up interpreting technically benign plain text content 
> as HTML tags and JavaScript.  In particular, this applies to encodings 
> in which the bytes corresponding to '<script>' in ASCII may encode a 
> different string. Authors should not use such encodings, which are known 
> to include....  In addition, authors should not use UTF-32 ...." 
> Alternatively, fixing the current note would help and might be 
> sufficient, albeit not ideal.

I've reworded the spec based on your suggestion. Thanks!

Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

More information about the whatwg mailing list