[whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
Ian Hickson
ian at hixie.ch
Thu Oct 22 20:20:01 PDT 2009
On Wed, 21 Oct 2009, Øistein E. Andersen wrote:
>
> ASCII-compatibility:
> The note in 2.1.5 Character encodings seems to say that variants of
> ISO-2022 (presumably including common ones like ISO-2022-CN, ISO-2022KR and
> ISO-2022-JP) are ASCII-compatible, whereas HZ-GB-2312 is not, and I cannot
> find anything in Section 2.1.5 that would explain this difference.
HZ-GB-2312 uses the byte ASCII uses for "~" as the escape character.
ISO-2022-* uses the control codes. That's the difference.
> Discouraged encodings:
> 4.2.5.5 Specifying the document's character encoding advises against
> certain encodings. In particular:
>
> > Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212
> > (JIS_X0212-1990), encodings based on ISO-2022, and encodings based on
> > EBCDIC.
>
> It is not clear what this means (e.g., the character set JIS_C6226-1983 in
> any encoding, or only when encoded alone according to RFC1345 as described
> above);
This is talking about character encodings, not character sets.
"JIS_C6226-1983" is a registered character encoding in the IANA registry.
> the list of discouraged encodings seems conspicuously short if it is
> supposed to be complete; and the lack of rationale makes it difficult to
> understand why these encodings are considered particularly harmful
> (JIS_C6226-1983 v. JIS_C6226-1978 or ISO-2022 v. HZ, to mention but two
> at least initially puzzling cases).
The reason for including these is to discourage encodings known to have
security issues. I've added HZ-GB-2312, which can be used in a similarly
dangerous fashion. (Basically the danger for user agents is in an attacker
using an encoding that a user agent could autodetect, while a site
interprets the bytes safely; that would allow those encodings to be used
to smuggle <script> elements in a way that a naive whitelisting filter
would think is safe.)
> It might be better to say *why* particular encodings are better avoided,
> whether or not the list of discouraged encodings be presented as
> definitive.
I've added a note.
> (Incidentally, this advice probably deserves not to be hidden in a
> section nominally reserved for character encoding *declaration* issues.)
Yeah. I considered moving it to the Writing HTML documents section, but
that one doesn't apply to conformance checkers, so it ends up being more
of a pain, since the advice would have to be split into multiple pieces so
that it applied appropriately. It's not a big deal.
> Minor grammar detail in 4.2.5.5:
> > Conformance checkers may advise against authors using legacy encodings.
>
> This is ambiguous. It should probably be advise against authors using
> legacy encodings or better advise authors against using legacy
> encodings.
Fixed.
On Fri, 23 Oct 2009, NARUSE, Yui wrote:
> >>>
> >>> Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212
> >>> (JIS_X0212-1990), encodings based on ISO-2022, and encodings based
> >>> on EBCDIC.
>
> First, JIS-X-0208 and JIS-X-0212 are not in IANA Charsets, moreover
> those correct names as spec are JIS X 0208 and JIS X 0212.
On Thu, 22 Oct 2009, Øistein E. Andersen wrote:
>
> I am not sure what you mean; they are both listed at
> <http://www.iana.org/assignments/character-sets>:
>
> Name: JIS_C6226-1983 [RFC1345,KXS2]
> MIBenum: 63
> Source: ECMA registry
> Alias: iso-ir-87
> Alias: x0208
> Alias: JIS_X0208-1983
> Alias: csISO87JISX0208
>
> Name: JIS_X0212-1990 [RFC1345,KXS2]
> MIBenum: 98
> Source: ECMA registry
> Alias: x0212
> Alias: iso-ir-159
> Alias: csISO159JISX02121990
On Fri, 23 Oct 2009, NARUSE, Yui wrote:
>
> Where is the word "JIS-X-0208" ?
> Where is the word "JIS-X-0212" ?
The exact string isn't there, that's why I included the preferred MIME
names in brackets in the spec.
On Fri, 23 Oct 2009, NARUSE, Yui wrote:
>
> Second, JIS_C6226-1983, JIS_X0212-1990, and EBCDICs are not
> ASCII compatible. So they are out of discouraged; mustn't use.
You can use non-ASCII-compatible encodings (e.g. UTF-16).
> Finally, Why ISO 2022 series is discouraged is not clear.
Hopefully this is clear now.
> Anyway, most of charsets defined RFC 1345 are not clear.
> Conversion table between Unicode is needed.
On Thu, 22 Oct 2009, Øistein E. Andersen wrote:
>
> > moreover those correct names as spec are JIS X 0208 and JIS X 0212.
>
> (The IANA registry is internally inconsistent and often disagrees with
> official standards when it comes to capitalisation, dashes/hyphens,
> underscores and spaces, so it is difficult to get this right. Please
> excuse me for not always paying due attention to such details in
> e-mails. Of course, the specifications should follow either IANA or the
> official standard as appropriate, depending on what it is referring to.)
>
> > Second, JIS_C6226-1983, JIS_X0212-1990, and EBCDICs are not ASCII
> > compatible. So they are out of discouraged; mustn't use.
>
> EBCDIC is clearly not ASCII-compatible and may be unique amongst the
> character sets in the IANA registry in providing the full ASCII
> repertoire in a different arrangement.
>
> JIS_C6226-1983 and JIS_X0212-1990 as defined in RFC1345 (i.e., on their
> own) do not contain basic ASCII characters at all, so it makes little
> sense to use them for HTML documents without adding ASCII or the
> ASCII-based JIS C 6220-1969, which would give something like EUC-JP or
> ISO-2022-JP. JIS_C6226-1983 contains wide versions of ASCII characters,
> but those are not interpreted as HTML mark-up (unless I am mistaken).
> JIS_X0212-1990 does not contain ASCII, kana or basic kanji, so it is of
> extremely limited usefulness on its own even in a plain-text setting.
> Warning against completely useless encodings seems pointless.
>
> Many other encodings in the IANA registry are ASCII-incompatible in
> different ways; what I do not understand is what makes the ones
> currently mentioned in the HTML5 draft particularly harmful.
>
> > Finally, Why ISO 2022 series is discouraged is not clear.
>
> We agree on this point.
>
> > Anyway, most of charsets defined RFC 1345 are not clear. Conversion
> > table between [those charsets and] Unicode is needed.
>
> Quite. Anne van Kesteren, I and several others are currently trying to
> document how browsers handle different encodings at
> <http://wiki.whatwg.org/wiki/Web_Encodings>, and defining mappings to
> Unicode is one of the goals. Your contribution would be much
> appreciated.
Good luck with that. It's much-needed work.
On Thu, 22 Oct 2009, Philip Taylor wrote:
>
> The string "ìè¨æ±ç©¿" encoded as ISO-2022-KR is the bytes 0e 3c 73
> 63 72 69 70 74 3e. A UA that doesn't support ISO-2022-KR (e.g. Chrome,
> when I last checked) will decode it as Windows-1252 and get the string
> "<script>", which is bad. So a site that uses ISO-2022-KR is very likely
> to expose some users to XSS attacks, which seems like a good reason to
> discourage that encoding. The same applies to other ISO-2022 encodings.
Indeed.
On Thu, 22 Oct 2009, Øistein E. Andersen wrote:
>
> If that is the reason, at least HZ encoding would seem to be affected as
> well. Explicitly discouraging a more or less random subset of the
> problematic encdodings without providing rationale makes it difficult to
> assess whether or not other, somewhat similar, encodings should be
> avoided as well, which was the main issue I wanted to raise.
Hopefully this is somewhat addressed now.
--
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
More information about the whatwg
mailing list