[whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

Thu Oct 22 20:20:01 PDT 2009

On Wed, 21 Oct 2009, Øistein E. Andersen wrote:
> 
> ASCII-compatibility:
> The note in ‘2.1.5 Character encodings’ seems to say that ‘variants of
> ISO-2022’ (presumably including common ones like ISO-2022-CN, ISO-2022KR and
> ISO-2022-JP) are ASCII-compatible, whereas HZ-GB-2312 is not, and I cannot
> find anything in Section 2.1.5 that would explain this difference.

HZ-GB-2312 uses the byte ASCII uses for "~" as the escape character. 
ISO-2022-* uses the control codes. That's the difference.

> Discouraged encodings:
> ‘4.2.5.5 Specifying the document's character encoding’ advises against
> certain encodings. In particular:
> 
> > Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212
> > (JIS_X0212-1990), encodings based on ISO-2022, and encodings based on
> > EBCDIC.
> 
> It is not clear what this means (e.g., the character set JIS_C6226-1983 in
> any encoding, or only when encoded alone according to RFC1345 as described
> above); 

This is talking about character encodings, not character sets. 
"JIS_C6226-1983" is a registered character encoding in the IANA registry.

> the list of discouraged encodings seems conspicuously short if it is 
> supposed to be complete; and the lack of rationale makes it difficult to 
> understand why these encodings are considered particularly harmful 
> (JIS_C6226-1983 v. JIS_C6226-1978 or ISO-2022 v. HZ, to mention but two 
> at least initially puzzling cases).

The reason for including these is to discourage encodings known to have 
security issues. I've added HZ-GB-2312, which can be used in a similarly 
dangerous fashion. (Basically the danger for user agents is in an attacker 
using an encoding that a user agent could autodetect, while a site 
interprets the bytes safely; that would allow those encodings to be used 
to smuggle <script> elements in a way that a naive whitelisting filter 
would think is safe.)

> It might be better to say *why* particular encodings are better avoided, 
> whether or not the list of discouraged encodings be presented as 
> definitive.

I've added a note.

> (Incidentally, this advice probably deserves not to be ‘hidden’ in a 
> section nominally reserved for character encoding *declaration* issues.)

Yeah. I considered moving it to the Writing HTML documents section, but 
that one doesn't apply to conformance checkers, so it ends up being more 
of a pain, since the advice would have to be split into multiple pieces so 
that it applied appropriately. It's not a big deal.

> Minor grammar detail in 4.2.5.5:
> > Conformance checkers may advise against authors using legacy encodings.
> 
> This is ambiguous.  It should probably be ‘advise against authors’ using
> legacy encodings’  or better ‘advise authors against using legacy
> encodings’.

Fixed.

On Fri, 23 Oct 2009, NARUSE, Yui wrote:
> >>>
> >>> Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212 
> >>> (JIS_X0212-1990), encodings based on ISO-2022, and encodings based 
> >>> on EBCDIC.
> 
> First, JIS-X-0208 and JIS-X-0212 are not in IANA Charsets, moreover 
> those correct names as spec are JIS X 0208 and JIS X 0212.

On Thu, 22 Oct 2009, Øistein E. Andersen wrote:
>
> I am not sure what you mean; they are both listed at
> <http://www.iana.org/assignments/character-sets>:
> 
> Name: JIS_C6226-1983                                     [RFC1345,KXS2]
> MIBenum: 63
> Source: ECMA registry
> Alias: iso-ir-87
> Alias: x0208
> Alias: JIS_X0208-1983
> Alias: csISO87JISX0208
> 
> Name: JIS_X0212-1990                                     [RFC1345,KXS2]
> MIBenum: 98
> Source: ECMA registry
> Alias: x0212
> Alias: iso-ir-159
> Alias: csISO159JISX02121990

On Fri, 23 Oct 2009, NARUSE, Yui wrote:
> 
> Where is the word "JIS-X-0208" ?
> Where is the word "JIS-X-0212" ?

The exact string isn't there, that's why I included the preferred MIME 
names in brackets in the spec.

On Fri, 23 Oct 2009, NARUSE, Yui wrote:
>
> Second, JIS_C6226-1983, JIS_X0212-1990, and EBCDICs are not
> ASCII compatible. So they are out of discouraged; mustn't use.

You can use non-ASCII-compatible encodings (e.g. UTF-16).

> Finally, Why ISO 2022 series is discouraged is not clear.

Hopefully this is clear now.

> Anyway, most of charsets defined RFC 1345 are not clear.
> Conversion table between Unicode is needed.

On Thu, 22 Oct 2009, Øistein E. Andersen wrote:
> 
> > moreover those correct names as spec are JIS X 0208 and JIS X 0212.
> 
> (The IANA registry is internally inconsistent and often disagrees with 
> official standards when it comes to capitalisation, dashes/hyphens, 
> underscores and spaces, so it is difficult to get this right. Please 
> excuse me for not always paying due attention to such details in 
> e-mails. Of course, the specifications should follow either IANA or the 
> official standard as appropriate, depending on what it is referring to.)
> 
> > Second, JIS_C6226-1983, JIS_X0212-1990, and EBCDICs are not ASCII 
> > compatible. So they are out of discouraged; mustn't use.
> 
> EBCDIC is clearly not ASCII-compatible and may be unique amongst the 
> character sets in the IANA registry in providing the full ASCII 
> repertoire in a different arrangement.
> 
> JIS_C6226-1983 and JIS_X0212-1990 as defined in RFC1345 (i.e., on their 
> own) do not contain basic ASCII characters at all, so it makes little 
> sense to use them for HTML documents without adding ASCII or the 
> ASCII-based JIS C 6220-1969, which would give something like EUC-JP or 
> ISO-2022-JP. JIS_C6226-1983 contains wide versions of ASCII characters, 
> but those are not interpreted as HTML mark-up (unless I am mistaken). 
> JIS_X0212-1990 does not contain ASCII, kana or basic kanji, so it is of 
> extremely limited usefulness on its own even in a plain-text setting.  
> Warning against completely useless encodings seems pointless.
> 
> Many other encodings in the IANA registry are ASCII-incompatible in 
> different ways; what I do not understand is what makes the ones 
> currently mentioned in the HTML5 draft particularly harmful.
> 
> > Finally, Why ISO 2022 series is discouraged is not clear.
> 
> We agree on this point.
> 
> > Anyway, most of charsets defined RFC 1345 are not clear. Conversion 
> > table between [those charsets and] Unicode is needed.
> 
> Quite.  Anne van Kesteren, I and several others are currently trying to 
> document how browsers handle different encodings at 
> <http://wiki.whatwg.org/wiki/Web_Encodings>, and defining mappings to 
> Unicode is one of the goals.  Your contribution would be much 
> appreciated.

Good luck with that. It's much-needed work. 

On Thu, 22 Oct 2009, Philip Taylor wrote:
> 
> The string "ìˆè¨Šæ˜±ç©¿" encoded as ISO-2022-KR is the bytes 0e 3c 73 
> 63 72 69 70 74 3e. A UA that doesn't support ISO-2022-KR (e.g. Chrome, 
> when I last checked) will decode it as Windows-1252 and get the string 
> "<script>", which is bad. So a site that uses ISO-2022-KR is very likely 
> to expose some users to XSS attacks, which seems like a good reason to 
> discourage that encoding. The same applies to other ISO-2022 encodings.

Indeed.

On Thu, 22 Oct 2009, Øistein E. Andersen wrote:
> 
> If that is the reason, at least HZ encoding would seem to be affected as 
> well. Explicitly discouraging a more or less random subset of the 
> problematic encdodings without providing rationale makes it difficult to 
> assess whether or not other, somewhat similar, encodings should be 
> avoided as well, which was the main issue I wanted to raise.

Hopefully this is somewhat addressed now.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'