[html5] r7782 - [giow] (2) Strip a leading BOM from scripts in workers, if any. Also, use more o [...]

whatwg at whatwg.org whatwg at whatwg.org
Fri Mar 29 11:45:28 PDT 2013


Author: ianh
Date: 2013-03-29 11:45:27 -0700 (Fri, 29 Mar 2013)
New Revision: 7782

Modified:
   complete.html
   index
   source
Log:
[giow] (2) Strip a leading BOM from scripts in workers, if any. Also, use more of the encoding spec.
Fixing https://www.w3.org/Bugs/Public/show_bug.cgi?id=17839
Affected topics: DOM APIs, HTML, HTML Syntax and Parsing, Offline Web Applications, Workers

Modified: complete.html
===================================================================
--- complete.html	2013-03-29 18:13:03 UTC (rev 7781)
+++ complete.html	2013-03-29 18:45:27 UTC (rev 7782)
@@ -3068,13 +3068,10 @@
   <p class=note>This complexity results from the historical decision to define the DOM API in
   terms of 16 bit (UTF-16) <a href=#code-unit title="code unit">code units</a>, rather than in terms of <a href=#unicode-character title="Unicode character">Unicode characters</a>.</p>
 
-  <p>When a byte stream is to be <dfn id=decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error handling</dfn>, the user agent
-  must return the result of running the <a href=#utf-8-decoder>utf-8 decoder</a> on that byte stream.</p>
 
 
 
 
-
   <h3 id=conformance-requirements><span class=secno>2.2 </span>Conformance requirements</h3>
 
   <p>All diagrams, examples, and notes in this specification are non-normative, as are all sections
@@ -3385,11 +3382,18 @@
     <ul class=brief><li><dfn id=getting-an-encoding>Getting an encoding</dfn>
 
      <li>The <dfn id=encoder>encoder</dfn> and <dfn id=decoder>decoder</dfn> algorithms for various encodings, including
-     the <dfn id=utf-8-encoder>utf-8 encoder</dfn> and <dfn id=utf-8-decoder>utf-8 decoder</dfn>
+     the <dfn id=utf-8-encoder>UTF-8 encoder</dfn> and <dfn id=utf-8-decoder>UTF-8 decoder</dfn>
 
-    </ul><p class=note>The <a href=#utf-8-decoder>utf-8 decoder</a> is distinct from the <i>utf-8 decode
-    algorithm</i>. The latter is not used by this specification.</p>
+     <li>The generic <dfn id=decode>decode</dfn> algorithm which takes a byte stream and an encoding and
+     returns a character stream
 
+     <li>The <dfn id=utf-8-decode>UTF-8 decode</dfn> algorithm which takes a byte stream and returns a character
+     stream, additionally stripping one leading UTF-8 Byte Order Mark (BOM), if any
+
+    </ul><p class=note>The <a href=#utf-8-decoder>UTF-8 decoder</a> is distinct from the <i>UTF-8 decode
+    algorithm</i>. The latter first strips a Byte Order Mark (BOM), if any, and then invokes the
+    former.</p>
+
    </dd>
 
 
@@ -8446,7 +8450,7 @@
   <code><a href=#document>Document</a></code>'s <a href=#origin>origin</a> is not a scheme/host/port tuple, the user agent must
   throw a <code><a href=#securityerror>SecurityError</a></code> exception. Otherwise, the user agent must first <a href=#obtain-the-storage-mutex>obtain
   the storage mutex</a> and then return the cookie-string for <a href="#the-document's-address">the document's address</a>
-  for a "non-HTTP" API, <a href=#decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error handling</a>. <a href=#refsCOOKIES>[COOKIES]</a>
+  for a "non-HTTP" API, decoded using the <a href=#utf-8-decoder>UTF-8 decoder</a>. <a href=#refsCOOKIES>[COOKIES]</a>
   <a class=fingerprint href=#fingerprint><img alt="(This is a fingerprinting vector.)" height=64 src=http://images.whatwg.org/fingerprint.png width=46></a>
   </p>
 
@@ -14643,38 +14647,7 @@
 
           <p>To obtain the Unicode string, the user agent run the following steps:</p>
 
-          <ol><li><p>For each of the rows in the following table, starting with the first one and going
-           down, if the file has as many or more bytes available than the number of bytes in the
-           first column, and the first bytes of the file match the bytes given in the first column,
-           then set <var title="">character encoding</var> to the encoding given in the cell in the
-           second column of that row, and jump to the bottom step in this series of steps:</p>
-
-            <!-- this table is present in several forms in this file; keep them in sync -->
-            <table id=table-script-bom><thead><tr><th>Bytes in Hexadecimal
-               <th>Encoding
-             <tbody><!-- nobody uses this
-              <tr>
-               <td>00 00 FE FF
-               <td>UTF-32BE
-              <tr>
-               <td>FF FE 00 00
-               <td>UTF-32LE
-    --><tr><td>FE FF
-               <td>Big-endian UTF-16
-              <tr><td>FF FE
-               <td>Little-endian UTF-16
-              <tr><td>EF BB BF
-               <td>UTF-8
-    <!-- nobody uses this
-              <tr>
-               <td>DD 73 66 73
-               <td>UTF-EBCDIC
-    -->
-            </table><p class=note>This step looks for Unicode Byte Order Marks (BOMs).</p>
-
-           </li>
-
-           <li><p>If the resource's <a href=#content-type title=Content-Type>Content Type metadata</a>, if any,
+          <ol><li><p>If the resource's <a href=#content-type title=Content-Type>Content Type metadata</a>, if any,
            specifies a character encoding, and the user agent supports that encoding, then let <var title="">character encoding</var> be that encoding, and jump to the bottom step in this
            series of steps.</li>
 
@@ -14685,10 +14658,21 @@
            <li><p>Let <var title="">character encoding</var> be <var><a href="#the-script-block's-fallback-character-encoding">the script block's fallback
            character encoding</a></var>.</li>
 
-           <li><p>Convert the file to Unicode using <var>character encoding</var>, following the
-           rules for doing so given by the specification for <var><a href="#the-script-block's-type">the script block's
-           type</a></var>.</li>
+           <li>
 
+            <p>If the specification for <var><a href="#the-script-block's-type">the script block's type</a></var> gives specific rules for
+            decoding files in that format to Unicode, follow them, using <var>character
+            encoding</var> as the character encoding specified by higher-level protocols, if
+            necessary.</p> <!-- e.g. XML -->
+
+            <p>Otherwise, <a href=#decode>decode</a> the file to Unicode, using <var>character
+            encoding</var> as the fallback encoding.</p>
+
+            <p class=note>The <a href=#decode>decode</a> algorithm overrides <var>character
+            encoding</var> if the file contains a BOM.</p>
+
+           </li>
+
           </ol></dd>
 
          <dt>If the script is from an external file and <var><a href="#the-script-block's-type">the script block's type</a></var> is an
@@ -68758,12 +68742,18 @@
   <p>When a user agent is to <dfn id=parse-a-manifest>parse a manifest</dfn>, it means that the user agent must run the
   following steps:</p>
 
-  <ol><li><p>Decode the byte stream corresponding with the manifest to be parsed <a href=#decoded-as-utf-8,-with-error-handling title="decoded
-   as UTF-8, with error handling">as UTF-8, with error handling</a>. <!--All U+0000 NULL
-   characters must be replaced by U+FFFD REPLACEMENT CHARACTERs. (this isn't black-box testable
-   since neither U+0000 nor U+FFFD are valid anywhere in the syntax and thus both will be treated
-   the same anyway)--></li>
+  <ol><li>
 
+    <p><a href=#utf-8-decode>UTF-8 decode</a> the byte stream corresponding with the manifest to be parsed.</p>
+
+    <p class=note>The <a href=#utf-8-decode>UTF-8 decode</a> algorithm strips a leading BOM, if any.</p>
+
+    <!--All U+0000 NULL characters must be replaced by U+FFFD REPLACEMENT CHARACTERs. (this isn't
+    black-box testable since neither U+0000 nor U+FFFD are valid anywhere in the syntax and thus
+    both will be treated the same anyway)-->
+
+   </li>
+
    <li><p>Let <var title="">base URL</var> be the <a href=#absolute-url>absolute URL</a> representing the
    manifest.</li>
 
@@ -68792,9 +68782,6 @@
    <li><p>Let <var title="">position</var> be a pointer into <var title="">input</var>, initially
    pointing at the first character.</li>
 
-   <li><p>If <var title="">position</var> is pointing at a U+FEFF BYTE ORDER MARK (BOM) character,
-   then advance <var title="">position</var> to the next character.</li>
-
    <li><p>If the characters starting from <var title="">position</var> are "CACHE", followed by a
    U+0020 SPACE character, followed by "MANIFEST", then advance <var title="">position</var> to the
    next character after those. Otherwise, this isn't a cache manifest; abort this algorithm with a
@@ -78794,9 +78781,8 @@
     a simple event</a> named <code title=event-error>error</code> at that object. Abort these
     steps.</p>
 
-    <p>If the attempt succeeds, then let <var title="">source</var> be the script resource
-    <a href=#decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error handling</a>.
-    </p>
+    <p>If the attempt succeeds, then let <var title="">source</var> be the result of running the
+    <a href=#utf-8-decode>UTF-8 decode</a> algorithm on the script resource.</p>
 
     <p>Let <var title="">language</var> be JavaScript.</p>
 
@@ -79479,10 +79465,8 @@
       <code><a href=#networkerror>NetworkError</a></code> exception and abort all these
       steps.</p>
 
-      <p>If the attempt succeeds, then let <var title="">source</var> be
-      the script resource <a href=#decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error
-      handling</a>.
-      </p>
+      <p>If the attempt succeeds, then let <var title="">source</var> be the result of running the
+      <a href=#utf-8-decode>UTF-8 decode</a> algorithm on the script resource.</p>
 
       <p>Let <var title="">language</var> be JavaScript.</p>
 
@@ -80101,11 +80085,10 @@
 
   <h4 id=event-stream-interpretation><span class=secno>10.2.5 </span>Interpreting an event stream</h4>
 
-  <p>Streams must be <a href=#decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error
-  handling</a>.
-  </p>
+  <p>Streams must be decoded using the <a href=#utf-8-decode>UTF-8 decode</a> algorithm.</p>
 
-  <p>One leading U+FEFF BYTE ORDER MARK character must be ignored if any are present.</p>
+  <p class=note>The <a href=#utf-8-decode>UTF-8 decode</a> algorithm strips one leading UTF-8 Byte Order Mark
+  (BOM), if any.</p>
 
   <p>The stream must then be parsed by reading everything line by line, with a U+000D CARRIAGE
   RETURN U+000A LINE FEED (CRLF) character pair, a single U+000A LINE FEED (LF) character not
@@ -81115,9 +81098,9 @@
    action, whose <code title=dom-CloseEvent-wasClean><a href=#dom-closeevent-wasclean>wasClean</a></code> attribute is initialized to
    true if the connection closed <i title="">cleanly</i> and false otherwise, whose <code title=dom-CloseEvent-code><a href=#dom-closeevent-code>code</a></code> attribute is initialized to <i><a href=#the-websocket-connection-close-code>the WebSocket connection
    close code</a></i>, and whose <code title=dom-CloseEvent-reason><a href=#dom-closeevent-reason>reason</a></code> attribute is
-   initialized to <i><a href=#the-websocket-connection-close-reason>the WebSocket connection close reason</a></i> <a href=#decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error
-   handling</a>, and <a href=#concept-event-dispatch title=concept-event-dispatch>dispatch</a> the event at the
-   <code><a href=#websocket>WebSocket</a></code> object. <a href=#refsWSP>[WSP]</a></li>
+   initialized to the result of applying the <a href=#utf-8-decoder>UTF-8 decoder</a> to <i><a href=#the-websocket-connection-close-reason>the WebSocket
+   connection close reason</a></i>, and <a href=#concept-event-dispatch title=concept-event-dispatch>dispatch</a> the event
+   at the <code><a href=#websocket>WebSocket</a></code> object. <a href=#refsWSP>[WSP]</a></li>
 
   </ol><div class=warning>
 
@@ -84062,6 +84045,7 @@
 
   <h4 id=the-input-byte-stream><span class=secno>12.2.2 </span>The <dfn>input byte stream</dfn></h4>
 
+<!--CLEANUP-->
   <p>The stream of Unicode code points that comprises the input to the
   tokenization stage will be initially seen by the user agent as a
   stream of bytes (typically coming over the network or from the local
@@ -84079,25 +84063,22 @@
   <p>Given a character encoding, the bytes in the <a href=#the-input-byte-stream>input byte
   stream</a> must be converted to Unicode code points for the
   tokenizer's <a href=#input-stream>input stream</a>, as described by the rules for
-  that encoding, except that the leading U+FEFF BYTE ORDER MARK
-  character, if any, must not be stripped by the encoding layer (it is
-  stripped by the rule below).</p> <!-- this is to prevent two leading
-  BOMs from being both stripped, once by the decoder, and once by the
-  parser -->
+  that encoding's <a href=#decoder>decoder</a>.</p>
 
-  <p>Bytes or sequences of bytes in the original byte stream that
-  could not be converted to Unicode code points must be converted to
-  U+FFFD REPLACEMENT CHARACTERs. Specifically, if the encoding is
-  UTF-8, the bytes must be <a href=#decoded-as-utf-8,-with-error-handling title="decoded as UTF-8, with error
-  handling">decoded with the error handling</a> defined in this
-  specification.</p>
-
   <p class=note>Bytes or sequences of bytes in the original byte
   stream that did not conform to the encoding specification (e.g.
   invalid UTF-8 byte sequences in a UTF-8 input byte stream) are
   errors that conformance checkers are expected to report.</p>
 
+  <p class=note>Leading Byte Order Marks (BOMs) are not stripped by the decoder algorithms, they
+  are stripped by the algorithm below.</p>
 
+  <p class=warning>The decoder algorithms describe how to handle invalid input; for security
+  reasons, it is imperative that those rules be followed precisely. Differences in how invalid byte
+  sequences are handled can result in, amongst other problems, script injection vulnerabilities
+  ("XSS").</p>
+
+
   <h5 id=determining-the-character-encoding><span class=secno>12.2.2.1 </span>Determining the character encoding</h5>
 
   <p>In some cases, it might be impractical to unambiguously determine the encoding before parsing
@@ -84688,8 +84669,8 @@
   UTF-32 in its algorithms; support and use of these encodings can thus lead to unexpected behavior
   in implementations of this specification.</p>
 
-  <p>When a user agent is to use the self-describing UTF-16 encoding but no BOM has been found, user
-  agents must default to little-endian UTF-16.</p>
+  <p>When a user agent is to use the self-describing UTF-16 encoding but no Byte Order Mark (BOM)
+  has been found, user agents must default to little-endian UTF-16.</p>
 
   <p class=note>The requirement to default UTF-16 to little-endian rather than big-endian is a
   <a href=#willful-violation>willful violation</a> of RFC 2781, motivated by a desire for compatibility with legacy

Modified: index
===================================================================
--- index	2013-03-29 18:13:03 UTC (rev 7781)
+++ index	2013-03-29 18:45:27 UTC (rev 7782)
@@ -3068,13 +3068,10 @@
   <p class=note>This complexity results from the historical decision to define the DOM API in
   terms of 16 bit (UTF-16) <a href=#code-unit title="code unit">code units</a>, rather than in terms of <a href=#unicode-character title="Unicode character">Unicode characters</a>.</p>
 
-  <p>When a byte stream is to be <dfn id=decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error handling</dfn>, the user agent
-  must return the result of running the <a href=#utf-8-decoder>utf-8 decoder</a> on that byte stream.</p>
 
 
 
 
-
   <h3 id=conformance-requirements><span class=secno>2.2 </span>Conformance requirements</h3>
 
   <p>All diagrams, examples, and notes in this specification are non-normative, as are all sections
@@ -3385,11 +3382,18 @@
     <ul class=brief><li><dfn id=getting-an-encoding>Getting an encoding</dfn>
 
      <li>The <dfn id=encoder>encoder</dfn> and <dfn id=decoder>decoder</dfn> algorithms for various encodings, including
-     the <dfn id=utf-8-encoder>utf-8 encoder</dfn> and <dfn id=utf-8-decoder>utf-8 decoder</dfn>
+     the <dfn id=utf-8-encoder>UTF-8 encoder</dfn> and <dfn id=utf-8-decoder>UTF-8 decoder</dfn>
 
-    </ul><p class=note>The <a href=#utf-8-decoder>utf-8 decoder</a> is distinct from the <i>utf-8 decode
-    algorithm</i>. The latter is not used by this specification.</p>
+     <li>The generic <dfn id=decode>decode</dfn> algorithm which takes a byte stream and an encoding and
+     returns a character stream
 
+     <li>The <dfn id=utf-8-decode>UTF-8 decode</dfn> algorithm which takes a byte stream and returns a character
+     stream, additionally stripping one leading UTF-8 Byte Order Mark (BOM), if any
+
+    </ul><p class=note>The <a href=#utf-8-decoder>UTF-8 decoder</a> is distinct from the <i>UTF-8 decode
+    algorithm</i>. The latter first strips a Byte Order Mark (BOM), if any, and then invokes the
+    former.</p>
+
    </dd>
 
 
@@ -8446,7 +8450,7 @@
   <code><a href=#document>Document</a></code>'s <a href=#origin>origin</a> is not a scheme/host/port tuple, the user agent must
   throw a <code><a href=#securityerror>SecurityError</a></code> exception. Otherwise, the user agent must first <a href=#obtain-the-storage-mutex>obtain
   the storage mutex</a> and then return the cookie-string for <a href="#the-document's-address">the document's address</a>
-  for a "non-HTTP" API, <a href=#decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error handling</a>. <a href=#refsCOOKIES>[COOKIES]</a>
+  for a "non-HTTP" API, decoded using the <a href=#utf-8-decoder>UTF-8 decoder</a>. <a href=#refsCOOKIES>[COOKIES]</a>
   <a class=fingerprint href=#fingerprint><img alt="(This is a fingerprinting vector.)" height=64 src=http://images.whatwg.org/fingerprint.png width=46></a>
   </p>
 
@@ -14643,38 +14647,7 @@
 
           <p>To obtain the Unicode string, the user agent run the following steps:</p>
 
-          <ol><li><p>For each of the rows in the following table, starting with the first one and going
-           down, if the file has as many or more bytes available than the number of bytes in the
-           first column, and the first bytes of the file match the bytes given in the first column,
-           then set <var title="">character encoding</var> to the encoding given in the cell in the
-           second column of that row, and jump to the bottom step in this series of steps:</p>
-
-            <!-- this table is present in several forms in this file; keep them in sync -->
-            <table id=table-script-bom><thead><tr><th>Bytes in Hexadecimal
-               <th>Encoding
-             <tbody><!-- nobody uses this
-              <tr>
-               <td>00 00 FE FF
-               <td>UTF-32BE
-              <tr>
-               <td>FF FE 00 00
-               <td>UTF-32LE
-    --><tr><td>FE FF
-               <td>Big-endian UTF-16
-              <tr><td>FF FE
-               <td>Little-endian UTF-16
-              <tr><td>EF BB BF
-               <td>UTF-8
-    <!-- nobody uses this
-              <tr>
-               <td>DD 73 66 73
-               <td>UTF-EBCDIC
-    -->
-            </table><p class=note>This step looks for Unicode Byte Order Marks (BOMs).</p>
-
-           </li>
-
-           <li><p>If the resource's <a href=#content-type title=Content-Type>Content Type metadata</a>, if any,
+          <ol><li><p>If the resource's <a href=#content-type title=Content-Type>Content Type metadata</a>, if any,
            specifies a character encoding, and the user agent supports that encoding, then let <var title="">character encoding</var> be that encoding, and jump to the bottom step in this
            series of steps.</li>
 
@@ -14685,10 +14658,21 @@
            <li><p>Let <var title="">character encoding</var> be <var><a href="#the-script-block's-fallback-character-encoding">the script block's fallback
            character encoding</a></var>.</li>
 
-           <li><p>Convert the file to Unicode using <var>character encoding</var>, following the
-           rules for doing so given by the specification for <var><a href="#the-script-block's-type">the script block's
-           type</a></var>.</li>
+           <li>
 
+            <p>If the specification for <var><a href="#the-script-block's-type">the script block's type</a></var> gives specific rules for
+            decoding files in that format to Unicode, follow them, using <var>character
+            encoding</var> as the character encoding specified by higher-level protocols, if
+            necessary.</p> <!-- e.g. XML -->
+
+            <p>Otherwise, <a href=#decode>decode</a> the file to Unicode, using <var>character
+            encoding</var> as the fallback encoding.</p>
+
+            <p class=note>The <a href=#decode>decode</a> algorithm overrides <var>character
+            encoding</var> if the file contains a BOM.</p>
+
+           </li>
+
           </ol></dd>
 
          <dt>If the script is from an external file and <var><a href="#the-script-block's-type">the script block's type</a></var> is an
@@ -68758,12 +68742,18 @@
   <p>When a user agent is to <dfn id=parse-a-manifest>parse a manifest</dfn>, it means that the user agent must run the
   following steps:</p>
 
-  <ol><li><p>Decode the byte stream corresponding with the manifest to be parsed <a href=#decoded-as-utf-8,-with-error-handling title="decoded
-   as UTF-8, with error handling">as UTF-8, with error handling</a>. <!--All U+0000 NULL
-   characters must be replaced by U+FFFD REPLACEMENT CHARACTERs. (this isn't black-box testable
-   since neither U+0000 nor U+FFFD are valid anywhere in the syntax and thus both will be treated
-   the same anyway)--></li>
+  <ol><li>
 
+    <p><a href=#utf-8-decode>UTF-8 decode</a> the byte stream corresponding with the manifest to be parsed.</p>
+
+    <p class=note>The <a href=#utf-8-decode>UTF-8 decode</a> algorithm strips a leading BOM, if any.</p>
+
+    <!--All U+0000 NULL characters must be replaced by U+FFFD REPLACEMENT CHARACTERs. (this isn't
+    black-box testable since neither U+0000 nor U+FFFD are valid anywhere in the syntax and thus
+    both will be treated the same anyway)-->
+
+   </li>
+
    <li><p>Let <var title="">base URL</var> be the <a href=#absolute-url>absolute URL</a> representing the
    manifest.</li>
 
@@ -68792,9 +68782,6 @@
    <li><p>Let <var title="">position</var> be a pointer into <var title="">input</var>, initially
    pointing at the first character.</li>
 
-   <li><p>If <var title="">position</var> is pointing at a U+FEFF BYTE ORDER MARK (BOM) character,
-   then advance <var title="">position</var> to the next character.</li>
-
    <li><p>If the characters starting from <var title="">position</var> are "CACHE", followed by a
    U+0020 SPACE character, followed by "MANIFEST", then advance <var title="">position</var> to the
    next character after those. Otherwise, this isn't a cache manifest; abort this algorithm with a
@@ -78794,9 +78781,8 @@
     a simple event</a> named <code title=event-error>error</code> at that object. Abort these
     steps.</p>
 
-    <p>If the attempt succeeds, then let <var title="">source</var> be the script resource
-    <a href=#decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error handling</a>.
-    </p>
+    <p>If the attempt succeeds, then let <var title="">source</var> be the result of running the
+    <a href=#utf-8-decode>UTF-8 decode</a> algorithm on the script resource.</p>
 
     <p>Let <var title="">language</var> be JavaScript.</p>
 
@@ -79479,10 +79465,8 @@
       <code><a href=#networkerror>NetworkError</a></code> exception and abort all these
       steps.</p>
 
-      <p>If the attempt succeeds, then let <var title="">source</var> be
-      the script resource <a href=#decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error
-      handling</a>.
-      </p>
+      <p>If the attempt succeeds, then let <var title="">source</var> be the result of running the
+      <a href=#utf-8-decode>UTF-8 decode</a> algorithm on the script resource.</p>
 
       <p>Let <var title="">language</var> be JavaScript.</p>
 
@@ -80101,11 +80085,10 @@
 
   <h4 id=event-stream-interpretation><span class=secno>10.2.5 </span>Interpreting an event stream</h4>
 
-  <p>Streams must be <a href=#decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error
-  handling</a>.
-  </p>
+  <p>Streams must be decoded using the <a href=#utf-8-decode>UTF-8 decode</a> algorithm.</p>
 
-  <p>One leading U+FEFF BYTE ORDER MARK character must be ignored if any are present.</p>
+  <p class=note>The <a href=#utf-8-decode>UTF-8 decode</a> algorithm strips one leading UTF-8 Byte Order Mark
+  (BOM), if any.</p>
 
   <p>The stream must then be parsed by reading everything line by line, with a U+000D CARRIAGE
   RETURN U+000A LINE FEED (CRLF) character pair, a single U+000A LINE FEED (LF) character not
@@ -81115,9 +81098,9 @@
    action, whose <code title=dom-CloseEvent-wasClean><a href=#dom-closeevent-wasclean>wasClean</a></code> attribute is initialized to
    true if the connection closed <i title="">cleanly</i> and false otherwise, whose <code title=dom-CloseEvent-code><a href=#dom-closeevent-code>code</a></code> attribute is initialized to <i><a href=#the-websocket-connection-close-code>the WebSocket connection
    close code</a></i>, and whose <code title=dom-CloseEvent-reason><a href=#dom-closeevent-reason>reason</a></code> attribute is
-   initialized to <i><a href=#the-websocket-connection-close-reason>the WebSocket connection close reason</a></i> <a href=#decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error
-   handling</a>, and <a href=#concept-event-dispatch title=concept-event-dispatch>dispatch</a> the event at the
-   <code><a href=#websocket>WebSocket</a></code> object. <a href=#refsWSP>[WSP]</a></li>
+   initialized to the result of applying the <a href=#utf-8-decoder>UTF-8 decoder</a> to <i><a href=#the-websocket-connection-close-reason>the WebSocket
+   connection close reason</a></i>, and <a href=#concept-event-dispatch title=concept-event-dispatch>dispatch</a> the event
+   at the <code><a href=#websocket>WebSocket</a></code> object. <a href=#refsWSP>[WSP]</a></li>
 
   </ol><div class=warning>
 
@@ -84062,6 +84045,7 @@
 
   <h4 id=the-input-byte-stream><span class=secno>12.2.2 </span>The <dfn>input byte stream</dfn></h4>
 
+<!--CLEANUP-->
   <p>The stream of Unicode code points that comprises the input to the
   tokenization stage will be initially seen by the user agent as a
   stream of bytes (typically coming over the network or from the local
@@ -84079,25 +84063,22 @@
   <p>Given a character encoding, the bytes in the <a href=#the-input-byte-stream>input byte
   stream</a> must be converted to Unicode code points for the
   tokenizer's <a href=#input-stream>input stream</a>, as described by the rules for
-  that encoding, except that the leading U+FEFF BYTE ORDER MARK
-  character, if any, must not be stripped by the encoding layer (it is
-  stripped by the rule below).</p> <!-- this is to prevent two leading
-  BOMs from being both stripped, once by the decoder, and once by the
-  parser -->
+  that encoding's <a href=#decoder>decoder</a>.</p>
 
-  <p>Bytes or sequences of bytes in the original byte stream that
-  could not be converted to Unicode code points must be converted to
-  U+FFFD REPLACEMENT CHARACTERs. Specifically, if the encoding is
-  UTF-8, the bytes must be <a href=#decoded-as-utf-8,-with-error-handling title="decoded as UTF-8, with error
-  handling">decoded with the error handling</a> defined in this
-  specification.</p>
-
   <p class=note>Bytes or sequences of bytes in the original byte
   stream that did not conform to the encoding specification (e.g.
   invalid UTF-8 byte sequences in a UTF-8 input byte stream) are
   errors that conformance checkers are expected to report.</p>
 
+  <p class=note>Leading Byte Order Marks (BOMs) are not stripped by the decoder algorithms, they
+  are stripped by the algorithm below.</p>
 
+  <p class=warning>The decoder algorithms describe how to handle invalid input; for security
+  reasons, it is imperative that those rules be followed precisely. Differences in how invalid byte
+  sequences are handled can result in, amongst other problems, script injection vulnerabilities
+  ("XSS").</p>
+
+
   <h5 id=determining-the-character-encoding><span class=secno>12.2.2.1 </span>Determining the character encoding</h5>
 
   <p>In some cases, it might be impractical to unambiguously determine the encoding before parsing
@@ -84688,8 +84669,8 @@
   UTF-32 in its algorithms; support and use of these encodings can thus lead to unexpected behavior
   in implementations of this specification.</p>
 
-  <p>When a user agent is to use the self-describing UTF-16 encoding but no BOM has been found, user
-  agents must default to little-endian UTF-16.</p>
+  <p>When a user agent is to use the self-describing UTF-16 encoding but no Byte Order Mark (BOM)
+  has been found, user agents must default to little-endian UTF-16.</p>
 
   <p class=note>The requirement to default UTF-16 to little-endian rather than big-endian is a
   <a href=#willful-violation>willful violation</a> of RFC 2781, motivated by a desire for compatibility with legacy

Modified: source
===================================================================
--- source	2013-03-29 18:13:03 UTC (rev 7781)
+++ source	2013-03-29 18:45:27 UTC (rev 7782)
@@ -1856,11 +1856,8 @@
   terms of 16 bit (UTF-16) <span title="code unit">code units</span>, rather than in terms of <span
   title="Unicode character">Unicode characters</span>.</p>
 
-  <p>When a byte stream is to be <dfn>decoded as UTF-8, with error handling</dfn>, the user agent
-  must return the result of running the <span>utf-8 decoder</span> on that byte stream.</p>
 
 
-
 <!--END dev-html-->
 
   <h3>Conformance requirements</h3>
@@ -2189,12 +2186,19 @@
      <li><dfn>Getting an encoding</dfn>
 
      <li>The <dfn>encoder</dfn> and <dfn>decoder</dfn> algorithms for various encodings, including
-     the <dfn>utf-8 encoder</dfn> and <dfn>utf-8 decoder</dfn>
+     the <dfn>UTF-8 encoder</dfn> and <dfn>UTF-8 decoder</dfn>
 
+     <li>The generic <dfn>decode</dfn> algorithm which takes a byte stream and an encoding and
+     returns a character stream
+
+     <li>The <dfn>UTF-8 decode</dfn> algorithm which takes a byte stream and returns a character
+     stream, additionally stripping one leading UTF-8 Byte Order Mark (BOM), if any
+
     </ul>
 
-    <p class="note">The <span>utf-8 decoder</span> is distinct from the <i>utf-8 decode
-    algorithm</i>. The latter is not used by this specification.</p>
+    <p class="note">The <span>UTF-8 decoder</span> is distinct from the <i>UTF-8 decode
+    algorithm</i>. The latter first strips a Byte Order Mark (BOM), if any, and then invokes the
+    former.</p>
 
    </dd>
 
@@ -8172,7 +8176,7 @@
   <code>Document</code>'s <span>origin</span> is not a scheme/host/port tuple, the user agent must
   throw a <code>SecurityError</code> exception. Otherwise, the user agent must first <span>obtain
   the storage mutex</span> and then return the cookie-string for <span>the document's address</span>
-  for a "non-HTTP" API, <span>decoded as UTF-8, with error handling</span>. <a
+  for a "non-HTTP" API, decoded using the <span>UTF-8 decoder</span>. <a
   href="#refsCOOKIES">[COOKIES]</a>
   <!--INSERT FINGERPRINT-->
   </p>
@@ -15219,47 +15223,6 @@
 
           <ol>
 
-           <li><p>For each of the rows in the following table, starting with the first one and going
-           down, if the file has as many or more bytes available than the number of bytes in the
-           first column, and the first bytes of the file match the bytes given in the first column,
-           then set <var title="">character encoding</var> to the encoding given in the cell in the
-           second column of that row, and jump to the bottom step in this series of steps:</p>
-
-            <!-- this table is present in several forms in this file; keep them in sync -->
-            <table id="table-script-bom">
-             <thead>
-              <tr>
-               <th>Bytes in Hexadecimal
-               <th>Encoding
-             <tbody>
-    <!-- nobody uses this
-              <tr>
-               <td>00 00 FE FF
-               <td>UTF-32BE
-              <tr>
-               <td>FF FE 00 00
-               <td>UTF-32LE
-    -->
-              <tr>
-               <td>FE FF
-               <td>Big-endian UTF-16
-              <tr>
-               <td>FF FE
-               <td>Little-endian UTF-16
-              <tr>
-               <td>EF BB BF
-               <td>UTF-8
-    <!-- nobody uses this
-              <tr>
-               <td>DD 73 66 73
-               <td>UTF-EBCDIC
-    -->
-            </table>
-
-            <p class="note">This step looks for Unicode Byte Order Marks (BOMs).</p>
-
-           </li>
-
            <li><p>If the resource's <span title="Content-Type">Content Type metadata</span>, if any,
            specifies a character encoding, and the user agent supports that encoding, then let <var
            title="">character encoding</var> be that encoding, and jump to the bottom step in this
@@ -15272,10 +15235,21 @@
            <li><p>Let <var title="">character encoding</var> be <var>the script block's fallback
            character encoding</var>.</p></li>
 
-           <li><p>Convert the file to Unicode using <var>character encoding</var>, following the
-           rules for doing so given by the specification for <var>the script block's
-           type</var>.</p></li>
+           <li>
 
+            <p>If the specification for <var>the script block's type</var> gives specific rules for
+            decoding files in that format to Unicode, follow them, using <var>character
+            encoding</var> as the character encoding specified by higher-level protocols, if
+            necessary.</p> <!-- e.g. XML -->
+
+            <p>Otherwise, <span>decode</span> the file to Unicode, using <var>character
+            encoding</var> as the fallback encoding.</p>
+
+            <p class="note">The <span>decode</span> algorithm overrides <var>character
+            encoding</var> if the file contains a BOM.</p>
+
+           </li>
+
           </ol>
 
          </dd>
@@ -81672,12 +81646,18 @@
 
   <ol>
 
-   <li><p>Decode the byte stream corresponding with the manifest to be parsed <span title="decoded
-   as UTF-8, with error handling">as UTF-8, with error handling</span>. <!--All U+0000 NULL
-   characters must be replaced by U+FFFD REPLACEMENT CHARACTERs. (this isn't black-box testable
-   since neither U+0000 nor U+FFFD are valid anywhere in the syntax and thus both will be treated
-   the same anyway)--></p></li>
+   <li>
 
+    <p><span>UTF-8 decode</span> the byte stream corresponding with the manifest to be parsed.</p>
+
+    <p class="note">The <span>UTF-8 decode</span> algorithm strips a leading BOM, if any.</p>
+
+    <!--All U+0000 NULL characters must be replaced by U+FFFD REPLACEMENT CHARACTERs. (this isn't
+    black-box testable since neither U+0000 nor U+FFFD are valid anywhere in the syntax and thus
+    both will be treated the same anyway)-->
+
+   </li>
+
    <li><p>Let <var title="">base URL</var> be the <span>absolute URL</span> representing the
    manifest.</p></li>
 
@@ -81709,9 +81689,6 @@
    <li><p>Let <var title="">position</var> be a pointer into <var title="">input</var>, initially
    pointing at the first character.</p></li>
 
-   <li><p>If <var title="">position</var> is pointing at a U+FEFF BYTE ORDER MARK (BOM) character,
-   then advance <var title="">position</var> to the next character.</p></li>
-
    <li><p>If the characters starting from <var title="">position</var> are "CACHE", followed by a
    U+0020 SPACE character, followed by "MANIFEST", then advance <var title="">position</var> to the
    next character after those. Otherwise, this isn't a cache manifest; abort this algorithm with a
@@ -92603,12 +92580,8 @@
     a simple event</span> named <code title="event-error">error</code> at that object. Abort these
     steps.</p>
 
-    <p>If the attempt succeeds, then let <var title="">source</var> be the script resource
-    <span>decoded as UTF-8, with error handling</span>.
-    <!--END complete-->
-    <a href="#refsHTML">[HTML]</a>
-    <!--START complete-->
-    </p>
+    <p>If the attempt succeeds, then let <var title="">source</var> be the result of running the
+    <span>UTF-8 decode</span> algorithm on the script resource.</p>
 
     <p>Let <var title="">language</var> be JavaScript.</p>
 
@@ -93409,13 +93382,8 @@
       <code>NetworkError</code> exception and abort all these
       steps.</p>
 
-      <p>If the attempt succeeds, then let <var title="">source</var> be
-      the script resource <span>decoded as UTF-8, with error
-      handling</span>.
-      <!--END complete-->
-      <a href="#refsHTML">[HTML]</a>
-      <!--START complete-->
-      </p>
+      <p>If the attempt succeeds, then let <var title="">source</var> be the result of running the
+      <span>UTF-8 decode</span> algorithm on the script resource.</p>
 
       <p>Let <var title="">language</var> be JavaScript.</p>
 
@@ -94148,14 +94116,10 @@
 
   <h4 id="event-stream-interpretation">Interpreting an event stream</h4>
 
-  <p>Streams must be <span>decoded as UTF-8, with error
-  handling</span>.
-  <!--END complete-->
-  <a href="#refsHTML">[HTML]</a>
-  <!--START complete-->
-  </p>
+  <p>Streams must be decoded using the <span>UTF-8 decode</span> algorithm.</p>
 
-  <p>One leading U+FEFF BYTE ORDER MARK character must be ignored if any are present.</p>
+  <p class="note">The <span>UTF-8 decode</span> algorithm strips one leading UTF-8 Byte Order Mark
+  (BOM), if any.</p>
 
   <p>The stream must then be parsed by reading everything line by line, with a U+000D CARRIAGE
   RETURN U+000A LINE FEED (CRLF) character pair, a single U+000A LINE FEED (LF) character not
@@ -95353,9 +95317,9 @@
    true if the connection closed <i title="">cleanly</i> and false otherwise, whose <code
    title="dom-CloseEvent-code">code</code> attribute is initialized to <i>the WebSocket connection
    close code</i>, and whose <code title="dom-CloseEvent-reason">reason</code> attribute is
-   initialized to <i>the WebSocket connection close reason</i> <span>decoded as UTF-8, with error
-   handling</span>, and <span title="concept-event-dispatch">dispatch</span> the event at the
-   <code>WebSocket</code> object. <a href="#refsWSP">[WSP]</a></p></li>
+   initialized to the result of applying the <span>UTF-8 decoder</span> to <i>the WebSocket
+   connection close reason</i>, and <span title="concept-event-dispatch">dispatch</span> the event
+   at the <code>WebSocket</code> object. <a href="#refsWSP">[WSP]</a></p></li>
 
   </ol>
 
@@ -98691,6 +98655,7 @@
 
   <h4>The <dfn>input byte stream</dfn></h4>
 
+<!--CLEANUP-->
   <p>The stream of Unicode code points that comprises the input to the
   tokenization stage will be initially seen by the user agent as a
   stream of bytes (typically coming over the network or from the local
@@ -98709,25 +98674,22 @@
   <p>Given a character encoding, the bytes in the <span>input byte
   stream</span> must be converted to Unicode code points for the
   tokenizer's <span>input stream</span>, as described by the rules for
-  that encoding, except that the leading U+FEFF BYTE ORDER MARK
-  character, if any, must not be stripped by the encoding layer (it is
-  stripped by the rule below).</p> <!-- this is to prevent two leading
-  BOMs from being both stripped, once by the decoder, and once by the
-  parser -->
+  that encoding's <span>decoder</span>.</p>
 
-  <p>Bytes or sequences of bytes in the original byte stream that
-  could not be converted to Unicode code points must be converted to
-  U+FFFD REPLACEMENT CHARACTERs. Specifically, if the encoding is
-  UTF-8, the bytes must be <span title="decoded as UTF-8, with error
-  handling">decoded with the error handling</span> defined in this
-  specification.</p>
-
   <p class="note">Bytes or sequences of bytes in the original byte
   stream that did not conform to the encoding specification (e.g.
   invalid UTF-8 byte sequences in a UTF-8 input byte stream) are
   errors that conformance checkers are expected to report.</p>
 
+  <p class="note">Leading Byte Order Marks (BOMs) are not stripped by the decoder algorithms, they
+  are stripped by the algorithm below.</p>
 
+  <p class="warning">The decoder algorithms describe how to handle invalid input; for security
+  reasons, it is imperative that those rules be followed precisely. Differences in how invalid byte
+  sequences are handled can result in, amongst other problems, script injection vulnerabilities
+  ("XSS").</p>
+
+
   <h5>Determining the character encoding</h5>
 
   <p>In some cases, it might be impractical to unambiguously determine the encoding before parsing
@@ -99452,8 +99414,8 @@
   UTF-32 in its algorithms; support and use of these encodings can thus lead to unexpected behavior
   in implementations of this specification.</p>
 
-  <p>When a user agent is to use the self-describing UTF-16 encoding but no BOM has been found, user
-  agents must default to little-endian UTF-16.</p>
+  <p>When a user agent is to use the self-describing UTF-16 encoding but no Byte Order Mark (BOM)
+  has been found, user agents must default to little-endian UTF-16.</p>
 
   <p class="note">The requirement to default UTF-16 to little-endian rather than big-endian is a
   <span>willful violation</span> of RFC 2781, motivated by a desire for compatibility with legacy




More information about the Commit-Watchers mailing list