[whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

Fri Oct 23 13:21:07 PDT 2009

On 23 Oct 2009, at 04:20, Ian Hickson wrote:

> On Wed, 21 Oct 2009, Øistein E. Andersen wrote:
>>
>
>> ASCII-compatibility:
>> The note in ‘2.1.5 Character encodings’ seems to say that [...]
>> ISO-2022’[-*] are ASCII-compatible, whereas HZ-GB-2312 is not, and  
>> I cannot
>> find anything in Section 2.1.5 that would explain this difference.
>
> HZ-GB-2312 uses the byte ASCII uses for "~" as the escape character.
> ISO-2022-* uses the control codes. That's the difference.

'~'/0x7E is not (and should not be, as far as I can tell) relevant for  
HTML5's concept of ASCII compatibility.

>> Discouraged encodings: [...]
>>
>>> Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212
>>> (JIS_X0212-1990), [...]
>>
>> It is not clear what this means [...]
>
> This is talking about character encodings, not character sets.
> "JIS_C6226-1983" is a registered character encoding in the IANA  
> registry.

(This is less confusing now since HTML5 only deals with character  
encodings and the strings match those in the the IANA registry as  
suggested by Yui Naruse.)

>> the list of discouraged encodings seems conspicuously short if it is
>> supposed to be complete; and the lack of rationale makes it  
>> difficult to
>> understand why these encodings are considered particularly harmful
>> (JIS_C6226-1983 v. JIS_C6226-1978 or ISO-2022 v. HZ, to mention but  
>> two
>> at least initially puzzling cases).
>
> The reason for including these is to discourage encodings known to  
> have
> security issues. I've added HZ-GB-2312, which can be used in a  
> similarly
> dangerous fashion. (Basically the danger for user agents is in an  
> attacker
> using an encoding that a user agent could autodetect, while a site
> interprets the bytes safely; that would allow those encodings to be  
> used
> to smuggle <script> elements in a way that a naive whitelisting filter
> would think is safe.)
>
>> It might be better to say *why* particular encodings are better  
>> avoided,
>> whether or not the list of discouraged encodings be presented as
>> definitive.
>
> I've added a note.
>
> [...]
>
> On Thu, 22 Oct 2009, Philip Taylor wrote:
>>
>> The string "[숍訊昱穿]" encoded as ISO-2022-KR is the bytes 0e  
>> 3c 73
>> 63 72 69 70 74 3e. A UA that doesn't support ISO-2022-KR (e.g.  
>> Chrome,
>> when I last checked) will decode it as Windows-1252 and get the  
>> string
>> "<script>", which is bad. So a site that uses ISO-2022-KR is very  
>> likely
>> to expose some users to XSS attacks, which seems like a good reason  
>> to
>> discourage that encoding. The same applies to other ISO-2022  
>> encodings.
>
> [...]
>
> On Thu, 22 Oct 2009, Øistein E. Andersen wrote:
>>
>> If that is the reason, at least HZ encoding would seem to be  
>> affected as
>> well. Explicitly discouraging a more or less random subset of the
>> problematic encdodings without providing rationale makes it  
>> difficult to
>> assess whether or not other, somewhat similar, encodings should be
>> avoided as well, which was the main issue I wanted to raise.
>
> Hopefully this is somewhat addressed now.

The added note certainly helps, but it is vague (does "[m]ost of these  
encodings" mean "all the encodings mentioned above apart from  
UTF-32"?) and inaccurate (Philip Taylor's example does not rely on  
"bugs").

Given that the set of encodings is open-ended, I still think it would  
be preferable to make the rationale (a definition of what makes an  
encoding problematic) primary and mention actual encodings as  
examples. This could give something like the following: "Encodings in  
which a series of bytes in the range 0x20..0x7E may encode characters  
other than the corresponding characters in the range U+20..U+7E  
represent a potential security vulnerability since a browser that does  
not support the encoding (or does not support the label used to  
declare the encoding, or does not use the same mechanism to detect the  
encoding of unlabelled content) might end up interpreting technically  
benign plain text content as HTML tags and JavaScript.  In particular,  
this applies to encodings in which the bytes corresponding to  
'<script>' in ASCII may encode a different string. Authors should not  
use such encodings, which are known to include....  In addition,  
authors should not use UTF-32 ...." Alternatively, fixing the current  
note would help and might be sufficient, albeit not ideal.

I think one has to realise that a comprehensive list of problematic  
encodings is an elusive goal and act accordingly.

-- 
Øistein E. Andersen

PS: The following sentence makes little sense without (curly) quotes  
and apostrophes. In case they disappeared before you read it, please  
find it repeated below with (ASCII) quotes and apostrophes:

>> It should probably be ‘"advise against authors'’ using legacy  
>> encodings"
>> or better "‘advise authors against using legacy encodings"’.

(The current text in the spec is fine.)