[html5] r7782 - [giow] (2) Strip a leading BOM from scripts in workers, if any. Also, use more o [...]
whatwg at whatwg.org
whatwg at whatwg.org
Fri Mar 29 11:45:28 PDT 2013
Author: ianh
Date: 2013-03-29 11:45:27 -0700 (Fri, 29 Mar 2013)
New Revision: 7782
Modified:
complete.html
index
source
Log:
[giow] (2) Strip a leading BOM from scripts in workers, if any. Also, use more of the encoding spec.
Fixing https://www.w3.org/Bugs/Public/show_bug.cgi?id=17839
Affected topics: DOM APIs, HTML, HTML Syntax and Parsing, Offline Web Applications, Workers
Modified: complete.html
===================================================================
--- complete.html 2013-03-29 18:13:03 UTC (rev 7781)
+++ complete.html 2013-03-29 18:45:27 UTC (rev 7782)
@@ -3068,13 +3068,10 @@
<p class=note>This complexity results from the historical decision to define the DOM API in
terms of 16 bit (UTF-16) <a href=#code-unit title="code unit">code units</a>, rather than in terms of <a href=#unicode-character title="Unicode character">Unicode characters</a>.</p>
- <p>When a byte stream is to be <dfn id=decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error handling</dfn>, the user agent
- must return the result of running the <a href=#utf-8-decoder>utf-8 decoder</a> on that byte stream.</p>
-
<h3 id=conformance-requirements><span class=secno>2.2 </span>Conformance requirements</h3>
<p>All diagrams, examples, and notes in this specification are non-normative, as are all sections
@@ -3385,11 +3382,18 @@
<ul class=brief><li><dfn id=getting-an-encoding>Getting an encoding</dfn>
<li>The <dfn id=encoder>encoder</dfn> and <dfn id=decoder>decoder</dfn> algorithms for various encodings, including
- the <dfn id=utf-8-encoder>utf-8 encoder</dfn> and <dfn id=utf-8-decoder>utf-8 decoder</dfn>
+ the <dfn id=utf-8-encoder>UTF-8 encoder</dfn> and <dfn id=utf-8-decoder>UTF-8 decoder</dfn>
- </ul><p class=note>The <a href=#utf-8-decoder>utf-8 decoder</a> is distinct from the <i>utf-8 decode
- algorithm</i>. The latter is not used by this specification.</p>
+ <li>The generic <dfn id=decode>decode</dfn> algorithm which takes a byte stream and an encoding and
+ returns a character stream
+ <li>The <dfn id=utf-8-decode>UTF-8 decode</dfn> algorithm which takes a byte stream and returns a character
+ stream, additionally stripping one leading UTF-8 Byte Order Mark (BOM), if any
+
+ </ul><p class=note>The <a href=#utf-8-decoder>UTF-8 decoder</a> is distinct from the <i>UTF-8 decode
+ algorithm</i>. The latter first strips a Byte Order Mark (BOM), if any, and then invokes the
+ former.</p>
+
</dd>
@@ -8446,7 +8450,7 @@
<code><a href=#document>Document</a></code>'s <a href=#origin>origin</a> is not a scheme/host/port tuple, the user agent must
throw a <code><a href=#securityerror>SecurityError</a></code> exception. Otherwise, the user agent must first <a href=#obtain-the-storage-mutex>obtain
the storage mutex</a> and then return the cookie-string for <a href="#the-document's-address">the document's address</a>
- for a "non-HTTP" API, <a href=#decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error handling</a>. <a href=#refsCOOKIES>[COOKIES]</a>
+ for a "non-HTTP" API, decoded using the <a href=#utf-8-decoder>UTF-8 decoder</a>. <a href=#refsCOOKIES>[COOKIES]</a>
<a class=fingerprint href=#fingerprint><img alt="(This is a fingerprinting vector.)" height=64 src=http://images.whatwg.org/fingerprint.png width=46></a>
</p>
@@ -14643,38 +14647,7 @@
<p>To obtain the Unicode string, the user agent run the following steps:</p>
- <ol><li><p>For each of the rows in the following table, starting with the first one and going
- down, if the file has as many or more bytes available than the number of bytes in the
- first column, and the first bytes of the file match the bytes given in the first column,
- then set <var title="">character encoding</var> to the encoding given in the cell in the
- second column of that row, and jump to the bottom step in this series of steps:</p>
-
- <!-- this table is present in several forms in this file; keep them in sync -->
- <table id=table-script-bom><thead><tr><th>Bytes in Hexadecimal
- <th>Encoding
- <tbody><!-- nobody uses this
- <tr>
- <td>00 00 FE FF
- <td>UTF-32BE
- <tr>
- <td>FF FE 00 00
- <td>UTF-32LE
- --><tr><td>FE FF
- <td>Big-endian UTF-16
- <tr><td>FF FE
- <td>Little-endian UTF-16
- <tr><td>EF BB BF
- <td>UTF-8
- <!-- nobody uses this
- <tr>
- <td>DD 73 66 73
- <td>UTF-EBCDIC
- -->
- </table><p class=note>This step looks for Unicode Byte Order Marks (BOMs).</p>
-
- </li>
-
- <li><p>If the resource's <a href=#content-type title=Content-Type>Content Type metadata</a>, if any,
+ <ol><li><p>If the resource's <a href=#content-type title=Content-Type>Content Type metadata</a>, if any,
specifies a character encoding, and the user agent supports that encoding, then let <var title="">character encoding</var> be that encoding, and jump to the bottom step in this
series of steps.</li>
@@ -14685,10 +14658,21 @@
<li><p>Let <var title="">character encoding</var> be <var><a href="#the-script-block's-fallback-character-encoding">the script block's fallback
character encoding</a></var>.</li>
- <li><p>Convert the file to Unicode using <var>character encoding</var>, following the
- rules for doing so given by the specification for <var><a href="#the-script-block's-type">the script block's
- type</a></var>.</li>
+ <li>
+ <p>If the specification for <var><a href="#the-script-block's-type">the script block's type</a></var> gives specific rules for
+ decoding files in that format to Unicode, follow them, using <var>character
+ encoding</var> as the character encoding specified by higher-level protocols, if
+ necessary.</p> <!-- e.g. XML -->
+
+ <p>Otherwise, <a href=#decode>decode</a> the file to Unicode, using <var>character
+ encoding</var> as the fallback encoding.</p>
+
+ <p class=note>The <a href=#decode>decode</a> algorithm overrides <var>character
+ encoding</var> if the file contains a BOM.</p>
+
+ </li>
+
</ol></dd>
<dt>If the script is from an external file and <var><a href="#the-script-block's-type">the script block's type</a></var> is an
@@ -68758,12 +68742,18 @@
<p>When a user agent is to <dfn id=parse-a-manifest>parse a manifest</dfn>, it means that the user agent must run the
following steps:</p>
- <ol><li><p>Decode the byte stream corresponding with the manifest to be parsed <a href=#decoded-as-utf-8,-with-error-handling title="decoded
- as UTF-8, with error handling">as UTF-8, with error handling</a>. <!--All U+0000 NULL
- characters must be replaced by U+FFFD REPLACEMENT CHARACTERs. (this isn't black-box testable
- since neither U+0000 nor U+FFFD are valid anywhere in the syntax and thus both will be treated
- the same anyway)--></li>
+ <ol><li>
+ <p><a href=#utf-8-decode>UTF-8 decode</a> the byte stream corresponding with the manifest to be parsed.</p>
+
+ <p class=note>The <a href=#utf-8-decode>UTF-8 decode</a> algorithm strips a leading BOM, if any.</p>
+
+ <!--All U+0000 NULL characters must be replaced by U+FFFD REPLACEMENT CHARACTERs. (this isn't
+ black-box testable since neither U+0000 nor U+FFFD are valid anywhere in the syntax and thus
+ both will be treated the same anyway)-->
+
+ </li>
+
<li><p>Let <var title="">base URL</var> be the <a href=#absolute-url>absolute URL</a> representing the
manifest.</li>
@@ -68792,9 +68782,6 @@
<li><p>Let <var title="">position</var> be a pointer into <var title="">input</var>, initially
pointing at the first character.</li>
- <li><p>If <var title="">position</var> is pointing at a U+FEFF BYTE ORDER MARK (BOM) character,
- then advance <var title="">position</var> to the next character.</li>
-
<li><p>If the characters starting from <var title="">position</var> are "CACHE", followed by a
U+0020 SPACE character, followed by "MANIFEST", then advance <var title="">position</var> to the
next character after those. Otherwise, this isn't a cache manifest; abort this algorithm with a
@@ -78794,9 +78781,8 @@
a simple event</a> named <code title=event-error>error</code> at that object. Abort these
steps.</p>
- <p>If the attempt succeeds, then let <var title="">source</var> be the script resource
- <a href=#decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error handling</a>.
- </p>
+ <p>If the attempt succeeds, then let <var title="">source</var> be the result of running the
+ <a href=#utf-8-decode>UTF-8 decode</a> algorithm on the script resource.</p>
<p>Let <var title="">language</var> be JavaScript.</p>
@@ -79479,10 +79465,8 @@
<code><a href=#networkerror>NetworkError</a></code> exception and abort all these
steps.</p>
- <p>If the attempt succeeds, then let <var title="">source</var> be
- the script resource <a href=#decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error
- handling</a>.
- </p>
+ <p>If the attempt succeeds, then let <var title="">source</var> be the result of running the
+ <a href=#utf-8-decode>UTF-8 decode</a> algorithm on the script resource.</p>
<p>Let <var title="">language</var> be JavaScript.</p>
@@ -80101,11 +80085,10 @@
<h4 id=event-stream-interpretation><span class=secno>10.2.5 </span>Interpreting an event stream</h4>
- <p>Streams must be <a href=#decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error
- handling</a>.
- </p>
+ <p>Streams must be decoded using the <a href=#utf-8-decode>UTF-8 decode</a> algorithm.</p>
- <p>One leading U+FEFF BYTE ORDER MARK character must be ignored if any are present.</p>
+ <p class=note>The <a href=#utf-8-decode>UTF-8 decode</a> algorithm strips one leading UTF-8 Byte Order Mark
+ (BOM), if any.</p>
<p>The stream must then be parsed by reading everything line by line, with a U+000D CARRIAGE
RETURN U+000A LINE FEED (CRLF) character pair, a single U+000A LINE FEED (LF) character not
@@ -81115,9 +81098,9 @@
action, whose <code title=dom-CloseEvent-wasClean><a href=#dom-closeevent-wasclean>wasClean</a></code> attribute is initialized to
true if the connection closed <i title="">cleanly</i> and false otherwise, whose <code title=dom-CloseEvent-code><a href=#dom-closeevent-code>code</a></code> attribute is initialized to <i><a href=#the-websocket-connection-close-code>the WebSocket connection
close code</a></i>, and whose <code title=dom-CloseEvent-reason><a href=#dom-closeevent-reason>reason</a></code> attribute is
- initialized to <i><a href=#the-websocket-connection-close-reason>the WebSocket connection close reason</a></i> <a href=#decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error
- handling</a>, and <a href=#concept-event-dispatch title=concept-event-dispatch>dispatch</a> the event at the
- <code><a href=#websocket>WebSocket</a></code> object. <a href=#refsWSP>[WSP]</a></li>
+ initialized to the result of applying the <a href=#utf-8-decoder>UTF-8 decoder</a> to <i><a href=#the-websocket-connection-close-reason>the WebSocket
+ connection close reason</a></i>, and <a href=#concept-event-dispatch title=concept-event-dispatch>dispatch</a> the event
+ at the <code><a href=#websocket>WebSocket</a></code> object. <a href=#refsWSP>[WSP]</a></li>
</ol><div class=warning>
@@ -84062,6 +84045,7 @@
<h4 id=the-input-byte-stream><span class=secno>12.2.2 </span>The <dfn>input byte stream</dfn></h4>
+<!--CLEANUP-->
<p>The stream of Unicode code points that comprises the input to the
tokenization stage will be initially seen by the user agent as a
stream of bytes (typically coming over the network or from the local
@@ -84079,25 +84063,22 @@
<p>Given a character encoding, the bytes in the <a href=#the-input-byte-stream>input byte
stream</a> must be converted to Unicode code points for the
tokenizer's <a href=#input-stream>input stream</a>, as described by the rules for
- that encoding, except that the leading U+FEFF BYTE ORDER MARK
- character, if any, must not be stripped by the encoding layer (it is
- stripped by the rule below).</p> <!-- this is to prevent two leading
- BOMs from being both stripped, once by the decoder, and once by the
- parser -->
+ that encoding's <a href=#decoder>decoder</a>.</p>
- <p>Bytes or sequences of bytes in the original byte stream that
- could not be converted to Unicode code points must be converted to
- U+FFFD REPLACEMENT CHARACTERs. Specifically, if the encoding is
- UTF-8, the bytes must be <a href=#decoded-as-utf-8,-with-error-handling title="decoded as UTF-8, with error
- handling">decoded with the error handling</a> defined in this
- specification.</p>
-
<p class=note>Bytes or sequences of bytes in the original byte
stream that did not conform to the encoding specification (e.g.
invalid UTF-8 byte sequences in a UTF-8 input byte stream) are
errors that conformance checkers are expected to report.</p>
+ <p class=note>Leading Byte Order Marks (BOMs) are not stripped by the decoder algorithms, they
+ are stripped by the algorithm below.</p>
+ <p class=warning>The decoder algorithms describe how to handle invalid input; for security
+ reasons, it is imperative that those rules be followed precisely. Differences in how invalid byte
+ sequences are handled can result in, amongst other problems, script injection vulnerabilities
+ ("XSS").</p>
+
+
<h5 id=determining-the-character-encoding><span class=secno>12.2.2.1 </span>Determining the character encoding</h5>
<p>In some cases, it might be impractical to unambiguously determine the encoding before parsing
@@ -84688,8 +84669,8 @@
UTF-32 in its algorithms; support and use of these encodings can thus lead to unexpected behavior
in implementations of this specification.</p>
- <p>When a user agent is to use the self-describing UTF-16 encoding but no BOM has been found, user
- agents must default to little-endian UTF-16.</p>
+ <p>When a user agent is to use the self-describing UTF-16 encoding but no Byte Order Mark (BOM)
+ has been found, user agents must default to little-endian UTF-16.</p>
<p class=note>The requirement to default UTF-16 to little-endian rather than big-endian is a
<a href=#willful-violation>willful violation</a> of RFC 2781, motivated by a desire for compatibility with legacy
Modified: index
===================================================================
--- index 2013-03-29 18:13:03 UTC (rev 7781)
+++ index 2013-03-29 18:45:27 UTC (rev 7782)
@@ -3068,13 +3068,10 @@
<p class=note>This complexity results from the historical decision to define the DOM API in
terms of 16 bit (UTF-16) <a href=#code-unit title="code unit">code units</a>, rather than in terms of <a href=#unicode-character title="Unicode character">Unicode characters</a>.</p>
- <p>When a byte stream is to be <dfn id=decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error handling</dfn>, the user agent
- must return the result of running the <a href=#utf-8-decoder>utf-8 decoder</a> on that byte stream.</p>
-
<h3 id=conformance-requirements><span class=secno>2.2 </span>Conformance requirements</h3>
<p>All diagrams, examples, and notes in this specification are non-normative, as are all sections
@@ -3385,11 +3382,18 @@
<ul class=brief><li><dfn id=getting-an-encoding>Getting an encoding</dfn>
<li>The <dfn id=encoder>encoder</dfn> and <dfn id=decoder>decoder</dfn> algorithms for various encodings, including
- the <dfn id=utf-8-encoder>utf-8 encoder</dfn> and <dfn id=utf-8-decoder>utf-8 decoder</dfn>
+ the <dfn id=utf-8-encoder>UTF-8 encoder</dfn> and <dfn id=utf-8-decoder>UTF-8 decoder</dfn>
- </ul><p class=note>The <a href=#utf-8-decoder>utf-8 decoder</a> is distinct from the <i>utf-8 decode
- algorithm</i>. The latter is not used by this specification.</p>
+ <li>The generic <dfn id=decode>decode</dfn> algorithm which takes a byte stream and an encoding and
+ returns a character stream
+ <li>The <dfn id=utf-8-decode>UTF-8 decode</dfn> algorithm which takes a byte stream and returns a character
+ stream, additionally stripping one leading UTF-8 Byte Order Mark (BOM), if any
+
+ </ul><p class=note>The <a href=#utf-8-decoder>UTF-8 decoder</a> is distinct from the <i>UTF-8 decode
+ algorithm</i>. The latter first strips a Byte Order Mark (BOM), if any, and then invokes the
+ former.</p>
+
</dd>
@@ -8446,7 +8450,7 @@
<code><a href=#document>Document</a></code>'s <a href=#origin>origin</a> is not a scheme/host/port tuple, the user agent must
throw a <code><a href=#securityerror>SecurityError</a></code> exception. Otherwise, the user agent must first <a href=#obtain-the-storage-mutex>obtain
the storage mutex</a> and then return the cookie-string for <a href="#the-document's-address">the document's address</a>
- for a "non-HTTP" API, <a href=#decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error handling</a>. <a href=#refsCOOKIES>[COOKIES]</a>
+ for a "non-HTTP" API, decoded using the <a href=#utf-8-decoder>UTF-8 decoder</a>. <a href=#refsCOOKIES>[COOKIES]</a>
<a class=fingerprint href=#fingerprint><img alt="(This is a fingerprinting vector.)" height=64 src=http://images.whatwg.org/fingerprint.png width=46></a>
</p>
@@ -14643,38 +14647,7 @@
<p>To obtain the Unicode string, the user agent run the following steps:</p>
- <ol><li><p>For each of the rows in the following table, starting with the first one and going
- down, if the file has as many or more bytes available than the number of bytes in the
- first column, and the first bytes of the file match the bytes given in the first column,
- then set <var title="">character encoding</var> to the encoding given in the cell in the
- second column of that row, and jump to the bottom step in this series of steps:</p>
-
- <!-- this table is present in several forms in this file; keep them in sync -->
- <table id=table-script-bom><thead><tr><th>Bytes in Hexadecimal
- <th>Encoding
- <tbody><!-- nobody uses this
- <tr>
- <td>00 00 FE FF
- <td>UTF-32BE
- <tr>
- <td>FF FE 00 00
- <td>UTF-32LE
- --><tr><td>FE FF
- <td>Big-endian UTF-16
- <tr><td>FF FE
- <td>Little-endian UTF-16
- <tr><td>EF BB BF
- <td>UTF-8
- <!-- nobody uses this
- <tr>
- <td>DD 73 66 73
- <td>UTF-EBCDIC
- -->
- </table><p class=note>This step looks for Unicode Byte Order Marks (BOMs).</p>
-
- </li>
-
- <li><p>If the resource's <a href=#content-type title=Content-Type>Content Type metadata</a>, if any,
+ <ol><li><p>If the resource's <a href=#content-type title=Content-Type>Content Type metadata</a>, if any,
specifies a character encoding, and the user agent supports that encoding, then let <var title="">character encoding</var> be that encoding, and jump to the bottom step in this
series of steps.</li>
@@ -14685,10 +14658,21 @@
<li><p>Let <var title="">character encoding</var> be <var><a href="#the-script-block's-fallback-character-encoding">the script block's fallback
character encoding</a></var>.</li>
- <li><p>Convert the file to Unicode using <var>character encoding</var>, following the
- rules for doing so given by the specification for <var><a href="#the-script-block's-type">the script block's
- type</a></var>.</li>
+ <li>
+ <p>If the specification for <var><a href="#the-script-block's-type">the script block's type</a></var> gives specific rules for
+ decoding files in that format to Unicode, follow them, using <var>character
+ encoding</var> as the character encoding specified by higher-level protocols, if
+ necessary.</p> <!-- e.g. XML -->
+
+ <p>Otherwise, <a href=#decode>decode</a> the file to Unicode, using <var>character
+ encoding</var> as the fallback encoding.</p>
+
+ <p class=note>The <a href=#decode>decode</a> algorithm overrides <var>character
+ encoding</var> if the file contains a BOM.</p>
+
+ </li>
+
</ol></dd>
<dt>If the script is from an external file and <var><a href="#the-script-block's-type">the script block's type</a></var> is an
@@ -68758,12 +68742,18 @@
<p>When a user agent is to <dfn id=parse-a-manifest>parse a manifest</dfn>, it means that the user agent must run the
following steps:</p>
- <ol><li><p>Decode the byte stream corresponding with the manifest to be parsed <a href=#decoded-as-utf-8,-with-error-handling title="decoded
- as UTF-8, with error handling">as UTF-8, with error handling</a>. <!--All U+0000 NULL
- characters must be replaced by U+FFFD REPLACEMENT CHARACTERs. (this isn't black-box testable
- since neither U+0000 nor U+FFFD are valid anywhere in the syntax and thus both will be treated
- the same anyway)--></li>
+ <ol><li>
+ <p><a href=#utf-8-decode>UTF-8 decode</a> the byte stream corresponding with the manifest to be parsed.</p>
+
+ <p class=note>The <a href=#utf-8-decode>UTF-8 decode</a> algorithm strips a leading BOM, if any.</p>
+
+ <!--All U+0000 NULL characters must be replaced by U+FFFD REPLACEMENT CHARACTERs. (this isn't
+ black-box testable since neither U+0000 nor U+FFFD are valid anywhere in the syntax and thus
+ both will be treated the same anyway)-->
+
+ </li>
+
<li><p>Let <var title="">base URL</var> be the <a href=#absolute-url>absolute URL</a> representing the
manifest.</li>
@@ -68792,9 +68782,6 @@
<li><p>Let <var title="">position</var> be a pointer into <var title="">input</var>, initially
pointing at the first character.</li>
- <li><p>If <var title="">position</var> is pointing at a U+FEFF BYTE ORDER MARK (BOM) character,
- then advance <var title="">position</var> to the next character.</li>
-
<li><p>If the characters starting from <var title="">position</var> are "CACHE", followed by a
U+0020 SPACE character, followed by "MANIFEST", then advance <var title="">position</var> to the
next character after those. Otherwise, this isn't a cache manifest; abort this algorithm with a
@@ -78794,9 +78781,8 @@
a simple event</a> named <code title=event-error>error</code> at that object. Abort these
steps.</p>
- <p>If the attempt succeeds, then let <var title="">source</var> be the script resource
- <a href=#decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error handling</a>.
- </p>
+ <p>If the attempt succeeds, then let <var title="">source</var> be the result of running the
+ <a href=#utf-8-decode>UTF-8 decode</a> algorithm on the script resource.</p>
<p>Let <var title="">language</var> be JavaScript.</p>
@@ -79479,10 +79465,8 @@
<code><a href=#networkerror>NetworkError</a></code> exception and abort all these
steps.</p>
- <p>If the attempt succeeds, then let <var title="">source</var> be
- the script resource <a href=#decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error
- handling</a>.
- </p>
+ <p>If the attempt succeeds, then let <var title="">source</var> be the result of running the
+ <a href=#utf-8-decode>UTF-8 decode</a> algorithm on the script resource.</p>
<p>Let <var title="">language</var> be JavaScript.</p>
@@ -80101,11 +80085,10 @@
<h4 id=event-stream-interpretation><span class=secno>10.2.5 </span>Interpreting an event stream</h4>
- <p>Streams must be <a href=#decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error
- handling</a>.
- </p>
+ <p>Streams must be decoded using the <a href=#utf-8-decode>UTF-8 decode</a> algorithm.</p>
- <p>One leading U+FEFF BYTE ORDER MARK character must be ignored if any are present.</p>
+ <p class=note>The <a href=#utf-8-decode>UTF-8 decode</a> algorithm strips one leading UTF-8 Byte Order Mark
+ (BOM), if any.</p>
<p>The stream must then be parsed by reading everything line by line, with a U+000D CARRIAGE
RETURN U+000A LINE FEED (CRLF) character pair, a single U+000A LINE FEED (LF) character not
@@ -81115,9 +81098,9 @@
action, whose <code title=dom-CloseEvent-wasClean><a href=#dom-closeevent-wasclean>wasClean</a></code> attribute is initialized to
true if the connection closed <i title="">cleanly</i> and false otherwise, whose <code title=dom-CloseEvent-code><a href=#dom-closeevent-code>code</a></code> attribute is initialized to <i><a href=#the-websocket-connection-close-code>the WebSocket connection
close code</a></i>, and whose <code title=dom-CloseEvent-reason><a href=#dom-closeevent-reason>reason</a></code> attribute is
- initialized to <i><a href=#the-websocket-connection-close-reason>the WebSocket connection close reason</a></i> <a href=#decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error
- handling</a>, and <a href=#concept-event-dispatch title=concept-event-dispatch>dispatch</a> the event at the
- <code><a href=#websocket>WebSocket</a></code> object. <a href=#refsWSP>[WSP]</a></li>
+ initialized to the result of applying the <a href=#utf-8-decoder>UTF-8 decoder</a> to <i><a href=#the-websocket-connection-close-reason>the WebSocket
+ connection close reason</a></i>, and <a href=#concept-event-dispatch title=concept-event-dispatch>dispatch</a> the event
+ at the <code><a href=#websocket>WebSocket</a></code> object. <a href=#refsWSP>[WSP]</a></li>
</ol><div class=warning>
@@ -84062,6 +84045,7 @@
<h4 id=the-input-byte-stream><span class=secno>12.2.2 </span>The <dfn>input byte stream</dfn></h4>
+<!--CLEANUP-->
<p>The stream of Unicode code points that comprises the input to the
tokenization stage will be initially seen by the user agent as a
stream of bytes (typically coming over the network or from the local
@@ -84079,25 +84063,22 @@
<p>Given a character encoding, the bytes in the <a href=#the-input-byte-stream>input byte
stream</a> must be converted to Unicode code points for the
tokenizer's <a href=#input-stream>input stream</a>, as described by the rules for
- that encoding, except that the leading U+FEFF BYTE ORDER MARK
- character, if any, must not be stripped by the encoding layer (it is
- stripped by the rule below).</p> <!-- this is to prevent two leading
- BOMs from being both stripped, once by the decoder, and once by the
- parser -->
+ that encoding's <a href=#decoder>decoder</a>.</p>
- <p>Bytes or sequences of bytes in the original byte stream that
- could not be converted to Unicode code points must be converted to
- U+FFFD REPLACEMENT CHARACTERs. Specifically, if the encoding is
- UTF-8, the bytes must be <a href=#decoded-as-utf-8,-with-error-handling title="decoded as UTF-8, with error
- handling">decoded with the error handling</a> defined in this
- specification.</p>
-
<p class=note>Bytes or sequences of bytes in the original byte
stream that did not conform to the encoding specification (e.g.
invalid UTF-8 byte sequences in a UTF-8 input byte stream) are
errors that conformance checkers are expected to report.</p>
+ <p class=note>Leading Byte Order Marks (BOMs) are not stripped by the decoder algorithms, they
+ are stripped by the algorithm below.</p>
+ <p class=warning>The decoder algorithms describe how to handle invalid input; for security
+ reasons, it is imperative that those rules be followed precisely. Differences in how invalid byte
+ sequences are handled can result in, amongst other problems, script injection vulnerabilities
+ ("XSS").</p>
+
+
<h5 id=determining-the-character-encoding><span class=secno>12.2.2.1 </span>Determining the character encoding</h5>
<p>In some cases, it might be impractical to unambiguously determine the encoding before parsing
@@ -84688,8 +84669,8 @@
UTF-32 in its algorithms; support and use of these encodings can thus lead to unexpected behavior
in implementations of this specification.</p>
- <p>When a user agent is to use the self-describing UTF-16 encoding but no BOM has been found, user
- agents must default to little-endian UTF-16.</p>
+ <p>When a user agent is to use the self-describing UTF-16 encoding but no Byte Order Mark (BOM)
+ has been found, user agents must default to little-endian UTF-16.</p>
<p class=note>The requirement to default UTF-16 to little-endian rather than big-endian is a
<a href=#willful-violation>willful violation</a> of RFC 2781, motivated by a desire for compatibility with legacy
Modified: source
===================================================================
--- source 2013-03-29 18:13:03 UTC (rev 7781)
+++ source 2013-03-29 18:45:27 UTC (rev 7782)
@@ -1856,11 +1856,8 @@
terms of 16 bit (UTF-16) <span title="code unit">code units</span>, rather than in terms of <span
title="Unicode character">Unicode characters</span>.</p>
- <p>When a byte stream is to be <dfn>decoded as UTF-8, with error handling</dfn>, the user agent
- must return the result of running the <span>utf-8 decoder</span> on that byte stream.</p>
-
<!--END dev-html-->
<h3>Conformance requirements</h3>
@@ -2189,12 +2186,19 @@
<li><dfn>Getting an encoding</dfn>
<li>The <dfn>encoder</dfn> and <dfn>decoder</dfn> algorithms for various encodings, including
- the <dfn>utf-8 encoder</dfn> and <dfn>utf-8 decoder</dfn>
+ the <dfn>UTF-8 encoder</dfn> and <dfn>UTF-8 decoder</dfn>
+ <li>The generic <dfn>decode</dfn> algorithm which takes a byte stream and an encoding and
+ returns a character stream
+
+ <li>The <dfn>UTF-8 decode</dfn> algorithm which takes a byte stream and returns a character
+ stream, additionally stripping one leading UTF-8 Byte Order Mark (BOM), if any
+
</ul>
- <p class="note">The <span>utf-8 decoder</span> is distinct from the <i>utf-8 decode
- algorithm</i>. The latter is not used by this specification.</p>
+ <p class="note">The <span>UTF-8 decoder</span> is distinct from the <i>UTF-8 decode
+ algorithm</i>. The latter first strips a Byte Order Mark (BOM), if any, and then invokes the
+ former.</p>
</dd>
@@ -8172,7 +8176,7 @@
<code>Document</code>'s <span>origin</span> is not a scheme/host/port tuple, the user agent must
throw a <code>SecurityError</code> exception. Otherwise, the user agent must first <span>obtain
the storage mutex</span> and then return the cookie-string for <span>the document's address</span>
- for a "non-HTTP" API, <span>decoded as UTF-8, with error handling</span>. <a
+ for a "non-HTTP" API, decoded using the <span>UTF-8 decoder</span>. <a
href="#refsCOOKIES">[COOKIES]</a>
<!--INSERT FINGERPRINT-->
</p>
@@ -15219,47 +15223,6 @@
<ol>
- <li><p>For each of the rows in the following table, starting with the first one and going
- down, if the file has as many or more bytes available than the number of bytes in the
- first column, and the first bytes of the file match the bytes given in the first column,
- then set <var title="">character encoding</var> to the encoding given in the cell in the
- second column of that row, and jump to the bottom step in this series of steps:</p>
-
- <!-- this table is present in several forms in this file; keep them in sync -->
- <table id="table-script-bom">
- <thead>
- <tr>
- <th>Bytes in Hexadecimal
- <th>Encoding
- <tbody>
- <!-- nobody uses this
- <tr>
- <td>00 00 FE FF
- <td>UTF-32BE
- <tr>
- <td>FF FE 00 00
- <td>UTF-32LE
- -->
- <tr>
- <td>FE FF
- <td>Big-endian UTF-16
- <tr>
- <td>FF FE
- <td>Little-endian UTF-16
- <tr>
- <td>EF BB BF
- <td>UTF-8
- <!-- nobody uses this
- <tr>
- <td>DD 73 66 73
- <td>UTF-EBCDIC
- -->
- </table>
-
- <p class="note">This step looks for Unicode Byte Order Marks (BOMs).</p>
-
- </li>
-
<li><p>If the resource's <span title="Content-Type">Content Type metadata</span>, if any,
specifies a character encoding, and the user agent supports that encoding, then let <var
title="">character encoding</var> be that encoding, and jump to the bottom step in this
@@ -15272,10 +15235,21 @@
<li><p>Let <var title="">character encoding</var> be <var>the script block's fallback
character encoding</var>.</p></li>
- <li><p>Convert the file to Unicode using <var>character encoding</var>, following the
- rules for doing so given by the specification for <var>the script block's
- type</var>.</p></li>
+ <li>
+ <p>If the specification for <var>the script block's type</var> gives specific rules for
+ decoding files in that format to Unicode, follow them, using <var>character
+ encoding</var> as the character encoding specified by higher-level protocols, if
+ necessary.</p> <!-- e.g. XML -->
+
+ <p>Otherwise, <span>decode</span> the file to Unicode, using <var>character
+ encoding</var> as the fallback encoding.</p>
+
+ <p class="note">The <span>decode</span> algorithm overrides <var>character
+ encoding</var> if the file contains a BOM.</p>
+
+ </li>
+
</ol>
</dd>
@@ -81672,12 +81646,18 @@
<ol>
- <li><p>Decode the byte stream corresponding with the manifest to be parsed <span title="decoded
- as UTF-8, with error handling">as UTF-8, with error handling</span>. <!--All U+0000 NULL
- characters must be replaced by U+FFFD REPLACEMENT CHARACTERs. (this isn't black-box testable
- since neither U+0000 nor U+FFFD are valid anywhere in the syntax and thus both will be treated
- the same anyway)--></p></li>
+ <li>
+ <p><span>UTF-8 decode</span> the byte stream corresponding with the manifest to be parsed.</p>
+
+ <p class="note">The <span>UTF-8 decode</span> algorithm strips a leading BOM, if any.</p>
+
+ <!--All U+0000 NULL characters must be replaced by U+FFFD REPLACEMENT CHARACTERs. (this isn't
+ black-box testable since neither U+0000 nor U+FFFD are valid anywhere in the syntax and thus
+ both will be treated the same anyway)-->
+
+ </li>
+
<li><p>Let <var title="">base URL</var> be the <span>absolute URL</span> representing the
manifest.</p></li>
@@ -81709,9 +81689,6 @@
<li><p>Let <var title="">position</var> be a pointer into <var title="">input</var>, initially
pointing at the first character.</p></li>
- <li><p>If <var title="">position</var> is pointing at a U+FEFF BYTE ORDER MARK (BOM) character,
- then advance <var title="">position</var> to the next character.</p></li>
-
<li><p>If the characters starting from <var title="">position</var> are "CACHE", followed by a
U+0020 SPACE character, followed by "MANIFEST", then advance <var title="">position</var> to the
next character after those. Otherwise, this isn't a cache manifest; abort this algorithm with a
@@ -92603,12 +92580,8 @@
a simple event</span> named <code title="event-error">error</code> at that object. Abort these
steps.</p>
- <p>If the attempt succeeds, then let <var title="">source</var> be the script resource
- <span>decoded as UTF-8, with error handling</span>.
- <!--END complete-->
- <a href="#refsHTML">[HTML]</a>
- <!--START complete-->
- </p>
+ <p>If the attempt succeeds, then let <var title="">source</var> be the result of running the
+ <span>UTF-8 decode</span> algorithm on the script resource.</p>
<p>Let <var title="">language</var> be JavaScript.</p>
@@ -93409,13 +93382,8 @@
<code>NetworkError</code> exception and abort all these
steps.</p>
- <p>If the attempt succeeds, then let <var title="">source</var> be
- the script resource <span>decoded as UTF-8, with error
- handling</span>.
- <!--END complete-->
- <a href="#refsHTML">[HTML]</a>
- <!--START complete-->
- </p>
+ <p>If the attempt succeeds, then let <var title="">source</var> be the result of running the
+ <span>UTF-8 decode</span> algorithm on the script resource.</p>
<p>Let <var title="">language</var> be JavaScript.</p>
@@ -94148,14 +94116,10 @@
<h4 id="event-stream-interpretation">Interpreting an event stream</h4>
- <p>Streams must be <span>decoded as UTF-8, with error
- handling</span>.
- <!--END complete-->
- <a href="#refsHTML">[HTML]</a>
- <!--START complete-->
- </p>
+ <p>Streams must be decoded using the <span>UTF-8 decode</span> algorithm.</p>
- <p>One leading U+FEFF BYTE ORDER MARK character must be ignored if any are present.</p>
+ <p class="note">The <span>UTF-8 decode</span> algorithm strips one leading UTF-8 Byte Order Mark
+ (BOM), if any.</p>
<p>The stream must then be parsed by reading everything line by line, with a U+000D CARRIAGE
RETURN U+000A LINE FEED (CRLF) character pair, a single U+000A LINE FEED (LF) character not
@@ -95353,9 +95317,9 @@
true if the connection closed <i title="">cleanly</i> and false otherwise, whose <code
title="dom-CloseEvent-code">code</code> attribute is initialized to <i>the WebSocket connection
close code</i>, and whose <code title="dom-CloseEvent-reason">reason</code> attribute is
- initialized to <i>the WebSocket connection close reason</i> <span>decoded as UTF-8, with error
- handling</span>, and <span title="concept-event-dispatch">dispatch</span> the event at the
- <code>WebSocket</code> object. <a href="#refsWSP">[WSP]</a></p></li>
+ initialized to the result of applying the <span>UTF-8 decoder</span> to <i>the WebSocket
+ connection close reason</i>, and <span title="concept-event-dispatch">dispatch</span> the event
+ at the <code>WebSocket</code> object. <a href="#refsWSP">[WSP]</a></p></li>
</ol>
@@ -98691,6 +98655,7 @@
<h4>The <dfn>input byte stream</dfn></h4>
+<!--CLEANUP-->
<p>The stream of Unicode code points that comprises the input to the
tokenization stage will be initially seen by the user agent as a
stream of bytes (typically coming over the network or from the local
@@ -98709,25 +98674,22 @@
<p>Given a character encoding, the bytes in the <span>input byte
stream</span> must be converted to Unicode code points for the
tokenizer's <span>input stream</span>, as described by the rules for
- that encoding, except that the leading U+FEFF BYTE ORDER MARK
- character, if any, must not be stripped by the encoding layer (it is
- stripped by the rule below).</p> <!-- this is to prevent two leading
- BOMs from being both stripped, once by the decoder, and once by the
- parser -->
+ that encoding's <span>decoder</span>.</p>
- <p>Bytes or sequences of bytes in the original byte stream that
- could not be converted to Unicode code points must be converted to
- U+FFFD REPLACEMENT CHARACTERs. Specifically, if the encoding is
- UTF-8, the bytes must be <span title="decoded as UTF-8, with error
- handling">decoded with the error handling</span> defined in this
- specification.</p>
-
<p class="note">Bytes or sequences of bytes in the original byte
stream that did not conform to the encoding specification (e.g.
invalid UTF-8 byte sequences in a UTF-8 input byte stream) are
errors that conformance checkers are expected to report.</p>
+ <p class="note">Leading Byte Order Marks (BOMs) are not stripped by the decoder algorithms, they
+ are stripped by the algorithm below.</p>
+ <p class="warning">The decoder algorithms describe how to handle invalid input; for security
+ reasons, it is imperative that those rules be followed precisely. Differences in how invalid byte
+ sequences are handled can result in, amongst other problems, script injection vulnerabilities
+ ("XSS").</p>
+
+
<h5>Determining the character encoding</h5>
<p>In some cases, it might be impractical to unambiguously determine the encoding before parsing
@@ -99452,8 +99414,8 @@
UTF-32 in its algorithms; support and use of these encodings can thus lead to unexpected behavior
in implementations of this specification.</p>
- <p>When a user agent is to use the self-describing UTF-16 encoding but no BOM has been found, user
- agents must default to little-endian UTF-16.</p>
+ <p>When a user agent is to use the self-describing UTF-16 encoding but no Byte Order Mark (BOM)
+ has been found, user agents must default to little-endian UTF-16.</p>
<p class="note">The requirement to default UTF-16 to little-endian rather than big-endian is a
<span>willful violation</span> of RFC 2781, motivated by a desire for compatibility with legacy
More information about the Commit-Watchers
mailing list