[html5] r4307 - [e] (0) Reword the stuff about authors not using encodings to make more sense.

Fri Oct 23 15:26:19 PDT 2009

Author: ianh
Date: 2009-10-23 15:26:15 -0700 (Fri, 23 Oct 2009)
New Revision: 4307

Modified:
   complete.html
   index
   source
Log:
[e] (0) Reword the stuff about authors not using encodings to make more sense.

Modified: complete.html
===================================================================

--- complete.html	2009-10-23 22:16:03 UTC (rev 4306)
+++ complete.html	2009-10-23 22:26:15 UTC (rev 4307)
@@ -2075,12 +2075,11 @@
   correspond to single-byte sequences that map to the same Unicode
   characters as those bytes in ANSI_X3.4-1968 (US-ASCII). <a href=#refsRFC1345>[RFC1345]</a></p>
 
-  <p class=note>This includes such encodings as Shift_JIS and
-  variants of ISO-2022, even though it is possible in these encodings
-  for bytes like 0x70 to be part of longer sequences that are
-  unrelated to their interpretation as ASCII. It excludes such
-  encodings as UTF-7, UTF-16, HZ-GB-2312, GSM03.38, and EBCDIC
-  variants.</p>
+  <p class=note>This includes such encodings as Shift_JIS,
+  HZ-GB-2312, and variants of ISO-2022, even though it is possible in
+  these encodings for bytes like 0x70 to be part of longer sequences
+  that are unrelated to their interpretation as ASCII. It excludes
+  such encodings as UTF-7, UTF-16, GSM03.38, and EBCDIC variants.</p>
 
   <!--
    We'll have to change that if anyone comes up with a way to have a
@@ -11881,13 +11880,35 @@
   state</a>, then the character encoding used must be an
   <a href=#ascii-compatible-character-encoding>ASCII-compatible character encoding</a>.</p>
 
-  <p>Authors should not use JIS_C6226-1983<!-- aka JIS-X-0208,
-  x-JIS0208 -->, JIS_X0212-1990<!-- aka JIS-X-0212 -->, HZ-GB-2312<!--
-  has crazy handling of ASCII "~" -->, encodings based on ISO-2022<!--
+  <p>Authors are encouraged to use UTF-8. Conformance checkers may
+  advise authors against using legacy encodings.</p>
+
+  <div class=impl>
+
+  <p>Authoring tools should default to using UTF-8 for newly-created
+  documents.</p>
+
+  </div>
+
+  <p>Encodings in which a series of bytes in the range 0x20 to 0x7E
+  can encode characters other than the corresponding characters in the
+  range U+0020 to U+007E represent a potential security vulnerability:
+  a user agent that does not support the encoding (or does not support
+  the label used to declare the encoding, or does not use the same
+  mechanism to detect the encoding of unlabelled content as another
+  user agent) might end up interpreting technically benign plain text
+  content as HTML tags and JavaScript. In particular, this applies to
+  encodings in which the bytes corresponding to "<code title=""><script></code>" in ASCII can encode a different
+  string. Authors should not use such encodings, which are known to
+  include JIS_C6226-1983<!-- aka JIS-X-0208, x-JIS0208 -->,
+  JIS_X0212-1990<!-- aka JIS-X-0212 -->, HZ-GB-2312<!-- has crazy
+  handling of ASCII "~" -->, encodings based on ISO-2022<!--
   http://krijnhoetmer.nl/irc-logs/whatwg/20090628#l-422 and
   http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-October/023797.html
-  -->, and encodings based on EBCDIC. Authors should not use UTF-32.
-  Authors must not use the CESU-8, UTF-7, BOCU-1 and SCSU encodings.
+  -->, and encodings based on EBCDIC. Furtermore, authors must not use
+  the CESU-8, UTF-7, BOCU-1 and SCSU encodings, which also fall into
+  this category, because these encodings were never intended for use
+  for Web content.
   <a href=#refsRFC1345>[RFC1345]</a><!-- for the JIS types -->
   <a href=#refsRFC1842>[RFC1842]</a><!-- HZ-GB-2312 -->
   <a href=#refsRFC1468>[RFC1468]</a><!-- ISO-2022-JP -->
@@ -11895,7 +11916,6 @@
   <a href=#refsRFC1554>[RFC1554]</a><!-- ISO-2022-JP-2 -->
   <a href=#refsRFC1922>[RFC1922]</a><!-- ISO-2022-CN and ISO-2022-CN-EXT -->
   <a href=#refsRFC1557>[RFC1557]</a><!-- ISO-2022-KR -->
-  <a href=#refsUNICODE>[UNICODE]</a>
   <a href=#refsCESU8>[CESU8]</a>
   <a href=#refsUTF7>[UTF7]</a>
   <a href=#refsBOCU1>[BOCU1]</a>
@@ -11903,26 +11923,9 @@
   <!-- no idea what to reference for EBCDIC, so... -->
   </p>
 
-  <p class=note>Most of these encodings are discouraged because of
-  security concerns. If a hostile user can contribute text to a site
-  using these encodings, bugs in the site's whitelisting filter or in
-  a user agent can easily lead to the filter interpreting the
-  contribution as "safe" while the user agent interprets the same
-  contribution as containing a <code><a href=#script>script</a></code> element. This would
-  enable cross-site scripting attacks. By avoiding these encodings,
-  and always providing a <a href=#character-encoding-declaration>character encoding declaration</a>,
-  an author is less likely to run into this kind of problem.</p>
+  <p>Authors should not use UTF-32, as the HTML5 encoding detection
+  algorithms intentionally do not distinguish it from UTF-16. <a href=#refsUNICODE>[UNICODE]</a></p>
 
-  <p>Authors are encouraged to use UTF-8. Conformance checkers may
-  advise authors against using legacy encodings.</p>
-
-  <div class=impl>
-
-  <p>Authoring tools should default to using UTF-8 for newly-created
-  documents.</p>
-
-  </div>
-
   <p class=note>Using non-UTF-8 encodings can have unexpected
   results on form submission and URL encodings, which use the
   <a href="#document's-character-encoding">document's character encoding</a> by default.</p>

Modified: index
===================================================================
--- index	2009-10-23 22:16:03 UTC (rev 4306)
+++ index	2009-10-23 22:26:15 UTC (rev 4307)
@@ -1885,12 +1885,11 @@
   correspond to single-byte sequences that map to the same Unicode
   characters as those bytes in ANSI_X3.4-1968 (US-ASCII). <a href=#refsRFC1345>[RFC1345]</a></p>
 
-  <p class=note>This includes such encodings as Shift_JIS and
-  variants of ISO-2022, even though it is possible in these encodings
-  for bytes like 0x70 to be part of longer sequences that are
-  unrelated to their interpretation as ASCII. It excludes such
-  encodings as UTF-7, UTF-16, HZ-GB-2312, GSM03.38, and EBCDIC
-  variants.</p>
+  <p class=note>This includes such encodings as Shift_JIS,
+  HZ-GB-2312, and variants of ISO-2022, even though it is possible in
+  these encodings for bytes like 0x70 to be part of longer sequences
+  that are unrelated to their interpretation as ASCII. It excludes
+  such encodings as UTF-7, UTF-16, GSM03.38, and EBCDIC variants.</p>
 
   <!--
    We'll have to change that if anyone comes up with a way to have a
@@ -11691,13 +11690,35 @@
   state</a>, then the character encoding used must be an
   <a href=#ascii-compatible-character-encoding>ASCII-compatible character encoding</a>.</p>
 
-  <p>Authors should not use JIS_C6226-1983<!-- aka JIS-X-0208,
-  x-JIS0208 -->, JIS_X0212-1990<!-- aka JIS-X-0212 -->, HZ-GB-2312<!--
-  has crazy handling of ASCII "~" -->, encodings based on ISO-2022<!--
+  <p>Authors are encouraged to use UTF-8. Conformance checkers may
+  advise authors against using legacy encodings.</p>
+
+  <div class=impl>
+
+  <p>Authoring tools should default to using UTF-8 for newly-created
+  documents.</p>
+
+  </div>
+
+  <p>Encodings in which a series of bytes in the range 0x20 to 0x7E
+  can encode characters other than the corresponding characters in the
+  range U+0020 to U+007E represent a potential security vulnerability:
+  a user agent that does not support the encoding (or does not support
+  the label used to declare the encoding, or does not use the same
+  mechanism to detect the encoding of unlabelled content as another
+  user agent) might end up interpreting technically benign plain text
+  content as HTML tags and JavaScript. In particular, this applies to
+  encodings in which the bytes corresponding to "<code title=""><script></code>" in ASCII can encode a different
+  string. Authors should not use such encodings, which are known to
+  include JIS_C6226-1983<!-- aka JIS-X-0208, x-JIS0208 -->,
+  JIS_X0212-1990<!-- aka JIS-X-0212 -->, HZ-GB-2312<!-- has crazy
+  handling of ASCII "~" -->, encodings based on ISO-2022<!--
   http://krijnhoetmer.nl/irc-logs/whatwg/20090628#l-422 and
   http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-October/023797.html
-  -->, and encodings based on EBCDIC. Authors should not use UTF-32.
-  Authors must not use the CESU-8, UTF-7, BOCU-1 and SCSU encodings.
+  -->, and encodings based on EBCDIC. Furtermore, authors must not use
+  the CESU-8, UTF-7, BOCU-1 and SCSU encodings, which also fall into
+  this category, because these encodings were never intended for use
+  for Web content.
   <a href=#refsRFC1345>[RFC1345]</a><!-- for the JIS types -->
   <a href=#refsRFC1842>[RFC1842]</a><!-- HZ-GB-2312 -->
   <a href=#refsRFC1468>[RFC1468]</a><!-- ISO-2022-JP -->
@@ -11705,7 +11726,6 @@
   <a href=#refsRFC1554>[RFC1554]</a><!-- ISO-2022-JP-2 -->
   <a href=#refsRFC1922>[RFC1922]</a><!-- ISO-2022-CN and ISO-2022-CN-EXT -->
   <a href=#refsRFC1557>[RFC1557]</a><!-- ISO-2022-KR -->
-  <a href=#refsUNICODE>[UNICODE]</a>
   <a href=#refsCESU8>[CESU8]</a>
   <a href=#refsUTF7>[UTF7]</a>
   <a href=#refsBOCU1>[BOCU1]</a>
@@ -11713,26 +11733,9 @@
   <!-- no idea what to reference for EBCDIC, so... -->
   </p>
 
-  <p class=note>Most of these encodings are discouraged because of
-  security concerns. If a hostile user can contribute text to a site
-  using these encodings, bugs in the site's whitelisting filter or in
-  a user agent can easily lead to the filter interpreting the
-  contribution as "safe" while the user agent interprets the same
-  contribution as containing a <code><a href=#script>script</a></code> element. This would
-  enable cross-site scripting attacks. By avoiding these encodings,
-  and always providing a <a href=#character-encoding-declaration>character encoding declaration</a>,
-  an author is less likely to run into this kind of problem.</p>
+  <p>Authors should not use UTF-32, as the HTML5 encoding detection
+  algorithms intentionally do not distinguish it from UTF-16. <a href=#refsUNICODE>[UNICODE]</a></p>
 
-  <p>Authors are encouraged to use UTF-8. Conformance checkers may
-  advise authors against using legacy encodings.</p>
-
-  <div class=impl>
-
-  <p>Authoring tools should default to using UTF-8 for newly-created
-  documents.</p>
-
-  </div>
-
   <p class=note>Using non-UTF-8 encodings can have unexpected
   results on form submission and URL encodings, which use the
   <a href="#document's-character-encoding">document's character encoding</a> by default.</p>

Modified: source
===================================================================
--- source	2009-10-23 22:16:03 UTC (rev 4306)
+++ source	2009-10-23 22:26:15 UTC (rev 4307)
@@ -901,12 +901,11 @@
   characters as those bytes in ANSI_X3.4-1968 (US-ASCII). <a
   href="#refsRFC1345">[RFC1345]</a></p>
 
-  <p class="note">This includes such encodings as Shift_JIS and
-  variants of ISO-2022, even though it is possible in these encodings
-  for bytes like 0x70 to be part of longer sequences that are
-  unrelated to their interpretation as ASCII. It excludes such
-  encodings as UTF-7, UTF-16, HZ-GB-2312, GSM03.38, and EBCDIC
-  variants.</p>
+  <p class="note">This includes such encodings as Shift_JIS,
+  HZ-GB-2312, and variants of ISO-2022, even though it is possible in
+  these encodings for bytes like 0x70 to be part of longer sequences
+  that are unrelated to their interpretation as ASCII. It excludes
+  such encodings as UTF-7, UTF-16, GSM03.38, and EBCDIC variants.</p>
 
   <!--
    We'll have to change that if anyone comes up with a way to have a
@@ -12376,13 +12375,36 @@
   state</span>, then the character encoding used must be an
   <span>ASCII-compatible character encoding</span>.</p>
 
-  <p>Authors should not use JIS_C6226-1983<!-- aka JIS-X-0208,
-  x-JIS0208 -->, JIS_X0212-1990<!-- aka JIS-X-0212 -->, HZ-GB-2312<!--
-  has crazy handling of ASCII "~" -->, encodings based on ISO-2022<!--
+  <p>Authors are encouraged to use UTF-8. Conformance checkers may
+  advise authors against using legacy encodings.</p>
+
+  <div class="impl">
+
+  <p>Authoring tools should default to using UTF-8 for newly-created
+  documents.</p>
+
+  </div>
+
+  <p>Encodings in which a series of bytes in the range 0x20 to 0x7E
+  can encode characters other than the corresponding characters in the
+  range U+0020 to U+007E represent a potential security vulnerability:
+  a user agent that does not support the encoding (or does not support
+  the label used to declare the encoding, or does not use the same
+  mechanism to detect the encoding of unlabelled content as another
+  user agent) might end up interpreting technically benign plain text
+  content as HTML tags and JavaScript. In particular, this applies to
+  encodings in which the bytes corresponding to "<code
+  title=""><script></code>" in ASCII can encode a different
+  string. Authors should not use such encodings, which are known to
+  include JIS_C6226-1983<!-- aka JIS-X-0208, x-JIS0208 -->,
+  JIS_X0212-1990<!-- aka JIS-X-0212 -->, HZ-GB-2312<!-- has crazy
+  handling of ASCII "~" -->, encodings based on ISO-2022<!--
   http://krijnhoetmer.nl/irc-logs/whatwg/20090628#l-422 and
   http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-October/023797.html
-  -->, and encodings based on EBCDIC. Authors should not use UTF-32.
-  Authors must not use the CESU-8, UTF-7, BOCU-1 and SCSU encodings.
+  -->, and encodings based on EBCDIC. Furtermore, authors must not use
+  the CESU-8, UTF-7, BOCU-1 and SCSU encodings, which also fall into
+  this category, because these encodings were never intended for use
+  for Web content.
   <a href="#refsRFC1345">[RFC1345]</a><!-- for the JIS types -->
   <a href="#refsRFC1842">[RFC1842]</a><!-- HZ-GB-2312 -->
   <a href="#refsRFC1468">[RFC1468]</a><!-- ISO-2022-JP -->
@@ -12390,7 +12412,6 @@
   <a href="#refsRFC1554">[RFC1554]</a><!-- ISO-2022-JP-2 -->
   <a href="#refsRFC1922">[RFC1922]</a><!-- ISO-2022-CN and ISO-2022-CN-EXT -->
   <a href="#refsRFC1557">[RFC1557]</a><!-- ISO-2022-KR -->
-  <a href="#refsUNICODE">[UNICODE]</a>
   <a href="#refsCESU8">[CESU8]</a>
   <a href="#refsUTF7">[UTF7]</a>
   <a href="#refsBOCU1">[BOCU1]</a>
@@ -12398,26 +12419,10 @@
   <!-- no idea what to reference for EBCDIC, so... -->
   </p>
 
-  <p class="note">Most of these encodings are discouraged because of
-  security concerns. If a hostile user can contribute text to a site
-  using these encodings, bugs in the site's whitelisting filter or in
-  a user agent can easily lead to the filter interpreting the
-  contribution as "safe" while the user agent interprets the same
-  contribution as containing a <code>script</code> element. This would
-  enable cross-site scripting attacks. By avoiding these encodings,
-  and always providing a <span>character encoding declaration</span>,
-  an author is less likely to run into this kind of problem.</p>
+  <p>Authors should not use UTF-32, as the HTML5 encoding detection
+  algorithms intentionally do not distinguish it from UTF-16. <a
+  href="#refsUNICODE">[UNICODE]</a></p>
 
-  <p>Authors are encouraged to use UTF-8. Conformance checkers may
-  advise authors against using legacy encodings.</p>
-
-  <div class="impl">
-
-  <p>Authoring tools should default to using UTF-8 for newly-created
-  documents.</p>
-
-  </div>
-
   <p class="note">Using non-UTF-8 encodings can have unexpected
   results on form submission and URL encodings, which use the
   <span>document's character encoding</span> by default.</p>