[html5] r4307 - [e] (0) Reword the stuff about authors not using encodings to make more sense.
whatwg at whatwg.org
whatwg at whatwg.org
Fri Oct 23 15:26:19 PDT 2009
Author: ianh
Date: 2009-10-23 15:26:15 -0700 (Fri, 23 Oct 2009)
New Revision: 4307
Modified:
complete.html
index
source
Log:
[e] (0) Reword the stuff about authors not using encodings to make more sense.
Modified: complete.html
===================================================================
--- complete.html 2009-10-23 22:16:03 UTC (rev 4306)
+++ complete.html 2009-10-23 22:26:15 UTC (rev 4307)
@@ -2075,12 +2075,11 @@
correspond to single-byte sequences that map to the same Unicode
characters as those bytes in ANSI_X3.4-1968 (US-ASCII). <a href=#refsRFC1345>[RFC1345]</a></p>
- <p class=note>This includes such encodings as Shift_JIS and
- variants of ISO-2022, even though it is possible in these encodings
- for bytes like 0x70 to be part of longer sequences that are
- unrelated to their interpretation as ASCII. It excludes such
- encodings as UTF-7, UTF-16, HZ-GB-2312, GSM03.38, and EBCDIC
- variants.</p>
+ <p class=note>This includes such encodings as Shift_JIS,
+ HZ-GB-2312, and variants of ISO-2022, even though it is possible in
+ these encodings for bytes like 0x70 to be part of longer sequences
+ that are unrelated to their interpretation as ASCII. It excludes
+ such encodings as UTF-7, UTF-16, GSM03.38, and EBCDIC variants.</p>
<!--
We'll have to change that if anyone comes up with a way to have a
@@ -11881,13 +11880,35 @@
state</a>, then the character encoding used must be an
<a href=#ascii-compatible-character-encoding>ASCII-compatible character encoding</a>.</p>
- <p>Authors should not use JIS_C6226-1983<!-- aka JIS-X-0208,
- x-JIS0208 -->, JIS_X0212-1990<!-- aka JIS-X-0212 -->, HZ-GB-2312<!--
- has crazy handling of ASCII "~" -->, encodings based on ISO-2022<!--
+ <p>Authors are encouraged to use UTF-8. Conformance checkers may
+ advise authors against using legacy encodings.</p>
+
+ <div class=impl>
+
+ <p>Authoring tools should default to using UTF-8 for newly-created
+ documents.</p>
+
+ </div>
+
+ <p>Encodings in which a series of bytes in the range 0x20 to 0x7E
+ can encode characters other than the corresponding characters in the
+ range U+0020 to U+007E represent a potential security vulnerability:
+ a user agent that does not support the encoding (or does not support
+ the label used to declare the encoding, or does not use the same
+ mechanism to detect the encoding of unlabelled content as another
+ user agent) might end up interpreting technically benign plain text
+ content as HTML tags and JavaScript. In particular, this applies to
+ encodings in which the bytes corresponding to "<code title=""><script></code>" in ASCII can encode a different
+ string. Authors should not use such encodings, which are known to
+ include JIS_C6226-1983<!-- aka JIS-X-0208, x-JIS0208 -->,
+ JIS_X0212-1990<!-- aka JIS-X-0212 -->, HZ-GB-2312<!-- has crazy
+ handling of ASCII "~" -->, encodings based on ISO-2022<!--
http://krijnhoetmer.nl/irc-logs/whatwg/20090628#l-422 and
http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-October/023797.html
- -->, and encodings based on EBCDIC. Authors should not use UTF-32.
- Authors must not use the CESU-8, UTF-7, BOCU-1 and SCSU encodings.
+ -->, and encodings based on EBCDIC. Furtermore, authors must not use
+ the CESU-8, UTF-7, BOCU-1 and SCSU encodings, which also fall into
+ this category, because these encodings were never intended for use
+ for Web content.
<a href=#refsRFC1345>[RFC1345]</a><!-- for the JIS types -->
<a href=#refsRFC1842>[RFC1842]</a><!-- HZ-GB-2312 -->
<a href=#refsRFC1468>[RFC1468]</a><!-- ISO-2022-JP -->
@@ -11895,7 +11916,6 @@
<a href=#refsRFC1554>[RFC1554]</a><!-- ISO-2022-JP-2 -->
<a href=#refsRFC1922>[RFC1922]</a><!-- ISO-2022-CN and ISO-2022-CN-EXT -->
<a href=#refsRFC1557>[RFC1557]</a><!-- ISO-2022-KR -->
- <a href=#refsUNICODE>[UNICODE]</a>
<a href=#refsCESU8>[CESU8]</a>
<a href=#refsUTF7>[UTF7]</a>
<a href=#refsBOCU1>[BOCU1]</a>
@@ -11903,26 +11923,9 @@
<!-- no idea what to reference for EBCDIC, so... -->
</p>
- <p class=note>Most of these encodings are discouraged because of
- security concerns. If a hostile user can contribute text to a site
- using these encodings, bugs in the site's whitelisting filter or in
- a user agent can easily lead to the filter interpreting the
- contribution as "safe" while the user agent interprets the same
- contribution as containing a <code><a href=#script>script</a></code> element. This would
- enable cross-site scripting attacks. By avoiding these encodings,
- and always providing a <a href=#character-encoding-declaration>character encoding declaration</a>,
- an author is less likely to run into this kind of problem.</p>
+ <p>Authors should not use UTF-32, as the HTML5 encoding detection
+ algorithms intentionally do not distinguish it from UTF-16. <a href=#refsUNICODE>[UNICODE]</a></p>
- <p>Authors are encouraged to use UTF-8. Conformance checkers may
- advise authors against using legacy encodings.</p>
-
- <div class=impl>
-
- <p>Authoring tools should default to using UTF-8 for newly-created
- documents.</p>
-
- </div>
-
<p class=note>Using non-UTF-8 encodings can have unexpected
results on form submission and URL encodings, which use the
<a href="#document's-character-encoding">document's character encoding</a> by default.</p>
Modified: index
===================================================================
--- index 2009-10-23 22:16:03 UTC (rev 4306)
+++ index 2009-10-23 22:26:15 UTC (rev 4307)
@@ -1885,12 +1885,11 @@
correspond to single-byte sequences that map to the same Unicode
characters as those bytes in ANSI_X3.4-1968 (US-ASCII). <a href=#refsRFC1345>[RFC1345]</a></p>
- <p class=note>This includes such encodings as Shift_JIS and
- variants of ISO-2022, even though it is possible in these encodings
- for bytes like 0x70 to be part of longer sequences that are
- unrelated to their interpretation as ASCII. It excludes such
- encodings as UTF-7, UTF-16, HZ-GB-2312, GSM03.38, and EBCDIC
- variants.</p>
+ <p class=note>This includes such encodings as Shift_JIS,
+ HZ-GB-2312, and variants of ISO-2022, even though it is possible in
+ these encodings for bytes like 0x70 to be part of longer sequences
+ that are unrelated to their interpretation as ASCII. It excludes
+ such encodings as UTF-7, UTF-16, GSM03.38, and EBCDIC variants.</p>
<!--
We'll have to change that if anyone comes up with a way to have a
@@ -11691,13 +11690,35 @@
state</a>, then the character encoding used must be an
<a href=#ascii-compatible-character-encoding>ASCII-compatible character encoding</a>.</p>
- <p>Authors should not use JIS_C6226-1983<!-- aka JIS-X-0208,
- x-JIS0208 -->, JIS_X0212-1990<!-- aka JIS-X-0212 -->, HZ-GB-2312<!--
- has crazy handling of ASCII "~" -->, encodings based on ISO-2022<!--
+ <p>Authors are encouraged to use UTF-8. Conformance checkers may
+ advise authors against using legacy encodings.</p>
+
+ <div class=impl>
+
+ <p>Authoring tools should default to using UTF-8 for newly-created
+ documents.</p>
+
+ </div>
+
+ <p>Encodings in which a series of bytes in the range 0x20 to 0x7E
+ can encode characters other than the corresponding characters in the
+ range U+0020 to U+007E represent a potential security vulnerability:
+ a user agent that does not support the encoding (or does not support
+ the label used to declare the encoding, or does not use the same
+ mechanism to detect the encoding of unlabelled content as another
+ user agent) might end up interpreting technically benign plain text
+ content as HTML tags and JavaScript. In particular, this applies to
+ encodings in which the bytes corresponding to "<code title=""><script></code>" in ASCII can encode a different
+ string. Authors should not use such encodings, which are known to
+ include JIS_C6226-1983<!-- aka JIS-X-0208, x-JIS0208 -->,
+ JIS_X0212-1990<!-- aka JIS-X-0212 -->, HZ-GB-2312<!-- has crazy
+ handling of ASCII "~" -->, encodings based on ISO-2022<!--
http://krijnhoetmer.nl/irc-logs/whatwg/20090628#l-422 and
http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-October/023797.html
- -->, and encodings based on EBCDIC. Authors should not use UTF-32.
- Authors must not use the CESU-8, UTF-7, BOCU-1 and SCSU encodings.
+ -->, and encodings based on EBCDIC. Furtermore, authors must not use
+ the CESU-8, UTF-7, BOCU-1 and SCSU encodings, which also fall into
+ this category, because these encodings were never intended for use
+ for Web content.
<a href=#refsRFC1345>[RFC1345]</a><!-- for the JIS types -->
<a href=#refsRFC1842>[RFC1842]</a><!-- HZ-GB-2312 -->
<a href=#refsRFC1468>[RFC1468]</a><!-- ISO-2022-JP -->
@@ -11705,7 +11726,6 @@
<a href=#refsRFC1554>[RFC1554]</a><!-- ISO-2022-JP-2 -->
<a href=#refsRFC1922>[RFC1922]</a><!-- ISO-2022-CN and ISO-2022-CN-EXT -->
<a href=#refsRFC1557>[RFC1557]</a><!-- ISO-2022-KR -->
- <a href=#refsUNICODE>[UNICODE]</a>
<a href=#refsCESU8>[CESU8]</a>
<a href=#refsUTF7>[UTF7]</a>
<a href=#refsBOCU1>[BOCU1]</a>
@@ -11713,26 +11733,9 @@
<!-- no idea what to reference for EBCDIC, so... -->
</p>
- <p class=note>Most of these encodings are discouraged because of
- security concerns. If a hostile user can contribute text to a site
- using these encodings, bugs in the site's whitelisting filter or in
- a user agent can easily lead to the filter interpreting the
- contribution as "safe" while the user agent interprets the same
- contribution as containing a <code><a href=#script>script</a></code> element. This would
- enable cross-site scripting attacks. By avoiding these encodings,
- and always providing a <a href=#character-encoding-declaration>character encoding declaration</a>,
- an author is less likely to run into this kind of problem.</p>
+ <p>Authors should not use UTF-32, as the HTML5 encoding detection
+ algorithms intentionally do not distinguish it from UTF-16. <a href=#refsUNICODE>[UNICODE]</a></p>
- <p>Authors are encouraged to use UTF-8. Conformance checkers may
- advise authors against using legacy encodings.</p>
-
- <div class=impl>
-
- <p>Authoring tools should default to using UTF-8 for newly-created
- documents.</p>
-
- </div>
-
<p class=note>Using non-UTF-8 encodings can have unexpected
results on form submission and URL encodings, which use the
<a href="#document's-character-encoding">document's character encoding</a> by default.</p>
Modified: source
===================================================================
--- source 2009-10-23 22:16:03 UTC (rev 4306)
+++ source 2009-10-23 22:26:15 UTC (rev 4307)
@@ -901,12 +901,11 @@
characters as those bytes in ANSI_X3.4-1968 (US-ASCII). <a
href="#refsRFC1345">[RFC1345]</a></p>
- <p class="note">This includes such encodings as Shift_JIS and
- variants of ISO-2022, even though it is possible in these encodings
- for bytes like 0x70 to be part of longer sequences that are
- unrelated to their interpretation as ASCII. It excludes such
- encodings as UTF-7, UTF-16, HZ-GB-2312, GSM03.38, and EBCDIC
- variants.</p>
+ <p class="note">This includes such encodings as Shift_JIS,
+ HZ-GB-2312, and variants of ISO-2022, even though it is possible in
+ these encodings for bytes like 0x70 to be part of longer sequences
+ that are unrelated to their interpretation as ASCII. It excludes
+ such encodings as UTF-7, UTF-16, GSM03.38, and EBCDIC variants.</p>
<!--
We'll have to change that if anyone comes up with a way to have a
@@ -12376,13 +12375,36 @@
state</span>, then the character encoding used must be an
<span>ASCII-compatible character encoding</span>.</p>
- <p>Authors should not use JIS_C6226-1983<!-- aka JIS-X-0208,
- x-JIS0208 -->, JIS_X0212-1990<!-- aka JIS-X-0212 -->, HZ-GB-2312<!--
- has crazy handling of ASCII "~" -->, encodings based on ISO-2022<!--
+ <p>Authors are encouraged to use UTF-8. Conformance checkers may
+ advise authors against using legacy encodings.</p>
+
+ <div class="impl">
+
+ <p>Authoring tools should default to using UTF-8 for newly-created
+ documents.</p>
+
+ </div>
+
+ <p>Encodings in which a series of bytes in the range 0x20 to 0x7E
+ can encode characters other than the corresponding characters in the
+ range U+0020 to U+007E represent a potential security vulnerability:
+ a user agent that does not support the encoding (or does not support
+ the label used to declare the encoding, or does not use the same
+ mechanism to detect the encoding of unlabelled content as another
+ user agent) might end up interpreting technically benign plain text
+ content as HTML tags and JavaScript. In particular, this applies to
+ encodings in which the bytes corresponding to "<code
+ title=""><script></code>" in ASCII can encode a different
+ string. Authors should not use such encodings, which are known to
+ include JIS_C6226-1983<!-- aka JIS-X-0208, x-JIS0208 -->,
+ JIS_X0212-1990<!-- aka JIS-X-0212 -->, HZ-GB-2312<!-- has crazy
+ handling of ASCII "~" -->, encodings based on ISO-2022<!--
http://krijnhoetmer.nl/irc-logs/whatwg/20090628#l-422 and
http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-October/023797.html
- -->, and encodings based on EBCDIC. Authors should not use UTF-32.
- Authors must not use the CESU-8, UTF-7, BOCU-1 and SCSU encodings.
+ -->, and encodings based on EBCDIC. Furtermore, authors must not use
+ the CESU-8, UTF-7, BOCU-1 and SCSU encodings, which also fall into
+ this category, because these encodings were never intended for use
+ for Web content.
<a href="#refsRFC1345">[RFC1345]</a><!-- for the JIS types -->
<a href="#refsRFC1842">[RFC1842]</a><!-- HZ-GB-2312 -->
<a href="#refsRFC1468">[RFC1468]</a><!-- ISO-2022-JP -->
@@ -12390,7 +12412,6 @@
<a href="#refsRFC1554">[RFC1554]</a><!-- ISO-2022-JP-2 -->
<a href="#refsRFC1922">[RFC1922]</a><!-- ISO-2022-CN and ISO-2022-CN-EXT -->
<a href="#refsRFC1557">[RFC1557]</a><!-- ISO-2022-KR -->
- <a href="#refsUNICODE">[UNICODE]</a>
<a href="#refsCESU8">[CESU8]</a>
<a href="#refsUTF7">[UTF7]</a>
<a href="#refsBOCU1">[BOCU1]</a>
@@ -12398,26 +12419,10 @@
<!-- no idea what to reference for EBCDIC, so... -->
</p>
- <p class="note">Most of these encodings are discouraged because of
- security concerns. If a hostile user can contribute text to a site
- using these encodings, bugs in the site's whitelisting filter or in
- a user agent can easily lead to the filter interpreting the
- contribution as "safe" while the user agent interprets the same
- contribution as containing a <code>script</code> element. This would
- enable cross-site scripting attacks. By avoiding these encodings,
- and always providing a <span>character encoding declaration</span>,
- an author is less likely to run into this kind of problem.</p>
+ <p>Authors should not use UTF-32, as the HTML5 encoding detection
+ algorithms intentionally do not distinguish it from UTF-16. <a
+ href="#refsUNICODE">[UNICODE]</a></p>
- <p>Authors are encouraged to use UTF-8. Conformance checkers may
- advise authors against using legacy encodings.</p>
-
- <div class="impl">
-
- <p>Authoring tools should default to using UTF-8 for newly-created
- documents.</p>
-
- </div>
-
<p class="note">Using non-UTF-8 encodings can have unexpected
results on form submission and URL encodings, which use the
<span>document's character encoding</span> by default.</p>
More information about the Commit-Watchers
mailing list