[html5] r8073 - [e] (0) Provide a hook for XHR and web components to incrementally decode with a [...]

Fri Jul 19 11:35:43 PDT 2013

Author: ianh
Date: 2013-07-19 11:35:41 -0700 (Fri, 19 Jul 2013)
New Revision: 8073

Modified:
   complete.html
   index
   source
Log:
[e] (0) Provide a hook for XHR and web components to incrementally decode with a known encoding
Affected topics: HTML Syntax and Parsing

Modified: complete.html
===================================================================

--- complete.html	2013-07-18 17:19:12 UTC (rev 8072)
+++ complete.html	2013-07-19 18:35:41 UTC (rev 8073)
@@ -256,7 +256,7 @@
 
   <header class=head id=head><p><a href=http://www.whatwg.org/ class=logo><img width=101 src=/images/logo alt=WHATWG height=101></a></p>
    <hgroup><h1 class=allcaps>HTML</h1>
-    <h2 class="no-num no-toc">Living Standard — Last Updated 18 July 2013</h2>
+    <h2 class="no-num no-toc">Living Standard — Last Updated 19 July 2013</h2>
    </hgroup><dl><dt><strong>Web developer edition:</strong></dt>
     <dd><strong><a href=http://developers.whatwg.org/>http://developers.whatwg.org/</a></strong></dd>
     <dt>Multiple-page version:</dt>
@@ -1180,10 +1180,11 @@
      <li><a href=#overview-of-the-parsing-model><span class=secno>12.2.1 </span>Overview of the parsing model</a></li>
      <li><a href=#the-input-byte-stream><span class=secno>12.2.2 </span>The input byte stream</a>
       <ol>
-       <li><a href=#determining-the-character-encoding><span class=secno>12.2.2.1 </span>Determining the character encoding</a></li>
-       <li><a href=#character-encodings><span class=secno>12.2.2.2 </span>Character encodings</a></li>
-       <li><a href=#changing-the-encoding-while-parsing><span class=secno>12.2.2.3 </span>Changing the encoding while parsing</a></li>
-       <li><a href=#preprocessing-the-input-stream><span class=secno>12.2.2.4 </span>Preprocessing the input stream</a></ol></li>
+       <li><a href=#parsing-with-a-known-character-encoding><span class=secno>12.2.2.1 </span>Parsing with a known character encoding</a></li>
+       <li><a href=#determining-the-character-encoding><span class=secno>12.2.2.2 </span>Determining the character encoding</a></li>
+       <li><a href=#character-encodings><span class=secno>12.2.2.3 </span>Character encodings</a></li>
+       <li><a href=#changing-the-encoding-while-parsing><span class=secno>12.2.2.4 </span>Changing the encoding while parsing</a></li>
+       <li><a href=#preprocessing-the-input-stream><span class=secno>12.2.2.5 </span>Preprocessing the input stream</a></ol></li>
      <li><a href=#parse-state><span class=secno>12.2.3 </span>Parse state</a>
       <ol>
        <li><a href=#the-insertion-mode><span class=secno>12.2.3.1 </span>The insertion mode</a></li>
@@ -85813,14 +85814,14 @@
   tokenization stage will be initially seen by the user agent as a
   stream of bytes (typically coming over the network or from the local
   file system). The bytes encode the actual characters according to a
-  particular <i>character encoding</i>, which the user agent must use
+  particular <i>character encoding</i>, which the user agent uses
   to decode the bytes into characters.</p>
 
   <p class=note>For XML documents, the algorithm user agents must
   use to determine the character encoding is given by the XML
   specification. This section does not apply to XML documents. <a href=#refsXML>[XML]</a></p>
 
-  <p>The <a href=#encoding-sniffing-algorithm>encoding sniffing algorithm</a> defined below is
+  <p>Usually, the <a href=#encoding-sniffing-algorithm>encoding sniffing algorithm</a> defined below is
   used to determine the character encoding.</p>
 
   <p>Given a character encoding, the bytes in the <a href=#the-input-byte-stream>input byte
@@ -85841,9 +85842,26 @@
   sequences are handled can result in, amongst other problems, script injection vulnerabilities
   ("XSS").</p>
 
+  <p>When the HTML parser is decoding an input byte stream, it uses a character encoding and a <dfn id=concept-encoding-confidence title=concept-encoding-confidence>confidence</dfn>. The confidence is either <i>tentative</i>,
+  <i>certain</i>, or <i>irrelevant</i>. The encoding used, and whether the confidence in that
+  encoding is <i>tentative</i> or <i>certain</i>, is <a href=#meta-charset-during-parse>used
+  during the parsing</a> to determine whether to <a href=#change-the-encoding>change the encoding</a>. If no encoding is
+  necessary, e.g. because the parser is operating on a Unicode stream and doesn't have to use a
+  character encoding at all, then the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is
+  <i>irrelevant</i>.</p>
 
-  <h5 id=determining-the-character-encoding><span class=secno>12.2.2.1 </span>Determining the character encoding</h5>
+  <p class=note>Some algorithms feed the parser by directly adding characters to the <a href=#input-stream>input
+  stream</a> rather than adding bytes to the <a href=#the-input-byte-stream>input byte stream</a>.</p>
 
+
+  <h5 id=parsing-with-a-known-character-encoding><span class=secno>12.2.2.1 </span>Parsing with a known character encoding</h5>
+
+  <p>When the HTML parser is to operate on an input byte stream that has <dfn id=a-known-definite-encoding>a known definite
+  encoding</dfn>, then the character encoding is that encoding and the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is <i>certain</i>.</p>
+
+
+  <h5 id=determining-the-character-encoding><span class=secno>12.2.2.2 </span>Determining the character encoding</h5>
+
   <p>In some cases, it might be impractical to unambiguously determine the encoding before parsing
   the document. Because of this, this specification provides for a two-pass mechanism with an
   optional pre-scan. Implementations are allowed, as described below, to apply a simplified parsing
@@ -85857,13 +85875,8 @@
   sniffing algorithm</dfn>, to determine the character encoding to use when decoding a document in
   the first pass. This algorithm takes as input any out-of-band metadata available to the user agent
   (e.g. the <a href=#content-type title=Content-Type>Content-Type metadata</a> of the document) and all the
-  bytes available so far, and returns a character encoding and a <dfn id=concept-encoding-confidence title=concept-encoding-confidence>confidence</dfn>. The confidence is either <i>tentative</i>,
-  <i>certain</i>, or <i>irrelevant</i>. The encoding used, and whether the confidence in that
-  encoding is <i>tentative</i> or <i>certain</i>, is <a href=#meta-charset-during-parse>used
-  during the parsing</a> to determine whether to <a href=#change-the-encoding>change the encoding</a>. If no encoding is
-  necessary, e.g. because the parser is operating on a Unicode stream and doesn't have to use a
-  character encoding at all, then the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is
-  <i>irrelevant</i>.</p>
+  bytes available so far, and returns a character encoding and a <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> that is either <i>tentative</i> or
+  <i>certain</i>.</p>
 
   <ol><li>
 
@@ -86649,7 +86662,7 @@
 
 
 
-  <h5 id=character-encodings><span class=secno>12.2.2.2 </span>Character encodings</h5>
+  <h5 id=character-encodings><span class=secno>12.2.2.3 </span>Character encodings</h5>
 
   <p>User agents must support the encodings defined in the WHATWG Encoding standard. User agents
   should not support other encodings.</p>
@@ -86672,7 +86685,7 @@
   content. <a href=#refsRFC2781>[RFC2781]</a></p>
 
 
-  <h5 id=changing-the-encoding-while-parsing><span class=secno>12.2.2.3 </span>Changing the encoding while parsing</h5>
+  <h5 id=changing-the-encoding-while-parsing><span class=secno>12.2.2.4 </span>Changing the encoding while parsing</h5>
 
   <p>When the parser requires the user agent to <dfn id=change-the-encoding>change the encoding</dfn>, it must run the
   following steps. This might happen if the <a href=#encoding-sniffing-algorithm>encoding sniffing algorithm</a> described above
@@ -86723,7 +86736,7 @@
    misinterpreted. User agents may notify the user of the situation,
    to aid in application development.</li>
 
-  </ol><h5 id=preprocessing-the-input-stream><span class=secno>12.2.2.4 </span>Preprocessing the input stream</h5>
+  </ol><h5 id=preprocessing-the-input-stream><span class=secno>12.2.2.5 </span>Preprocessing the input stream</h5>
 
   <p>The <dfn id=input-stream>input stream</dfn> consists of the characters pushed
   into it as the <a href=#the-input-byte-stream>input byte stream</a> is decoded or from the

Modified: index
===================================================================
--- index	2013-07-18 17:19:12 UTC (rev 8072)
+++ index	2013-07-19 18:35:41 UTC (rev 8073)
@@ -256,7 +256,7 @@
 
   <header class=head id=head><p><a href=http://www.whatwg.org/ class=logo><img width=101 src=/images/logo alt=WHATWG height=101></a></p>
    <hgroup><h1 class=allcaps>HTML</h1>
-    <h2 class="no-num no-toc">Living Standard — Last Updated 18 July 2013</h2>
+    <h2 class="no-num no-toc">Living Standard — Last Updated 19 July 2013</h2>
    </hgroup><dl><dt><strong>Web developer edition:</strong></dt>
     <dd><strong><a href=http://developers.whatwg.org/>http://developers.whatwg.org/</a></strong></dd>
     <dt>Multiple-page version:</dt>
@@ -1180,10 +1180,11 @@
      <li><a href=#overview-of-the-parsing-model><span class=secno>12.2.1 </span>Overview of the parsing model</a></li>
      <li><a href=#the-input-byte-stream><span class=secno>12.2.2 </span>The input byte stream</a>
       <ol>
-       <li><a href=#determining-the-character-encoding><span class=secno>12.2.2.1 </span>Determining the character encoding</a></li>
-       <li><a href=#character-encodings><span class=secno>12.2.2.2 </span>Character encodings</a></li>
-       <li><a href=#changing-the-encoding-while-parsing><span class=secno>12.2.2.3 </span>Changing the encoding while parsing</a></li>
-       <li><a href=#preprocessing-the-input-stream><span class=secno>12.2.2.4 </span>Preprocessing the input stream</a></ol></li>
+       <li><a href=#parsing-with-a-known-character-encoding><span class=secno>12.2.2.1 </span>Parsing with a known character encoding</a></li>
+       <li><a href=#determining-the-character-encoding><span class=secno>12.2.2.2 </span>Determining the character encoding</a></li>
+       <li><a href=#character-encodings><span class=secno>12.2.2.3 </span>Character encodings</a></li>
+       <li><a href=#changing-the-encoding-while-parsing><span class=secno>12.2.2.4 </span>Changing the encoding while parsing</a></li>
+       <li><a href=#preprocessing-the-input-stream><span class=secno>12.2.2.5 </span>Preprocessing the input stream</a></ol></li>
      <li><a href=#parse-state><span class=secno>12.2.3 </span>Parse state</a>
       <ol>
        <li><a href=#the-insertion-mode><span class=secno>12.2.3.1 </span>The insertion mode</a></li>
@@ -85813,14 +85814,14 @@
   tokenization stage will be initially seen by the user agent as a
   stream of bytes (typically coming over the network or from the local
   file system). The bytes encode the actual characters according to a
-  particular <i>character encoding</i>, which the user agent must use
+  particular <i>character encoding</i>, which the user agent uses
   to decode the bytes into characters.</p>
 
   <p class=note>For XML documents, the algorithm user agents must
   use to determine the character encoding is given by the XML
   specification. This section does not apply to XML documents. <a href=#refsXML>[XML]</a></p>
 
-  <p>The <a href=#encoding-sniffing-algorithm>encoding sniffing algorithm</a> defined below is
+  <p>Usually, the <a href=#encoding-sniffing-algorithm>encoding sniffing algorithm</a> defined below is
   used to determine the character encoding.</p>
 
   <p>Given a character encoding, the bytes in the <a href=#the-input-byte-stream>input byte
@@ -85841,9 +85842,26 @@
   sequences are handled can result in, amongst other problems, script injection vulnerabilities
   ("XSS").</p>
 
+  <p>When the HTML parser is decoding an input byte stream, it uses a character encoding and a <dfn id=concept-encoding-confidence title=concept-encoding-confidence>confidence</dfn>. The confidence is either <i>tentative</i>,
+  <i>certain</i>, or <i>irrelevant</i>. The encoding used, and whether the confidence in that
+  encoding is <i>tentative</i> or <i>certain</i>, is <a href=#meta-charset-during-parse>used
+  during the parsing</a> to determine whether to <a href=#change-the-encoding>change the encoding</a>. If no encoding is
+  necessary, e.g. because the parser is operating on a Unicode stream and doesn't have to use a
+  character encoding at all, then the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is
+  <i>irrelevant</i>.</p>
 
-  <h5 id=determining-the-character-encoding><span class=secno>12.2.2.1 </span>Determining the character encoding</h5>
+  <p class=note>Some algorithms feed the parser by directly adding characters to the <a href=#input-stream>input
+  stream</a> rather than adding bytes to the <a href=#the-input-byte-stream>input byte stream</a>.</p>
 
+
+  <h5 id=parsing-with-a-known-character-encoding><span class=secno>12.2.2.1 </span>Parsing with a known character encoding</h5>
+
+  <p>When the HTML parser is to operate on an input byte stream that has <dfn id=a-known-definite-encoding>a known definite
+  encoding</dfn>, then the character encoding is that encoding and the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is <i>certain</i>.</p>
+
+
+  <h5 id=determining-the-character-encoding><span class=secno>12.2.2.2 </span>Determining the character encoding</h5>
+
   <p>In some cases, it might be impractical to unambiguously determine the encoding before parsing
   the document. Because of this, this specification provides for a two-pass mechanism with an
   optional pre-scan. Implementations are allowed, as described below, to apply a simplified parsing
@@ -85857,13 +85875,8 @@
   sniffing algorithm</dfn>, to determine the character encoding to use when decoding a document in
   the first pass. This algorithm takes as input any out-of-band metadata available to the user agent
   (e.g. the <a href=#content-type title=Content-Type>Content-Type metadata</a> of the document) and all the
-  bytes available so far, and returns a character encoding and a <dfn id=concept-encoding-confidence title=concept-encoding-confidence>confidence</dfn>. The confidence is either <i>tentative</i>,
-  <i>certain</i>, or <i>irrelevant</i>. The encoding used, and whether the confidence in that
-  encoding is <i>tentative</i> or <i>certain</i>, is <a href=#meta-charset-during-parse>used
-  during the parsing</a> to determine whether to <a href=#change-the-encoding>change the encoding</a>. If no encoding is
-  necessary, e.g. because the parser is operating on a Unicode stream and doesn't have to use a
-  character encoding at all, then the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is
-  <i>irrelevant</i>.</p>
+  bytes available so far, and returns a character encoding and a <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> that is either <i>tentative</i> or
+  <i>certain</i>.</p>
 
   <ol><li>
 
@@ -86649,7 +86662,7 @@
 
 
 
-  <h5 id=character-encodings><span class=secno>12.2.2.2 </span>Character encodings</h5>
+  <h5 id=character-encodings><span class=secno>12.2.2.3 </span>Character encodings</h5>
 
   <p>User agents must support the encodings defined in the WHATWG Encoding standard. User agents
   should not support other encodings.</p>
@@ -86672,7 +86685,7 @@
   content. <a href=#refsRFC2781>[RFC2781]</a></p>
 
 
-  <h5 id=changing-the-encoding-while-parsing><span class=secno>12.2.2.3 </span>Changing the encoding while parsing</h5>
+  <h5 id=changing-the-encoding-while-parsing><span class=secno>12.2.2.4 </span>Changing the encoding while parsing</h5>
 
   <p>When the parser requires the user agent to <dfn id=change-the-encoding>change the encoding</dfn>, it must run the
   following steps. This might happen if the <a href=#encoding-sniffing-algorithm>encoding sniffing algorithm</a> described above
@@ -86723,7 +86736,7 @@
    misinterpreted. User agents may notify the user of the situation,
    to aid in application development.</li>
 
-  </ol><h5 id=preprocessing-the-input-stream><span class=secno>12.2.2.4 </span>Preprocessing the input stream</h5>
+  </ol><h5 id=preprocessing-the-input-stream><span class=secno>12.2.2.5 </span>Preprocessing the input stream</h5>
 
   <p>The <dfn id=input-stream>input stream</dfn> consists of the characters pushed
   into it as the <a href=#the-input-byte-stream>input byte stream</a> is decoded or from the

Modified: source
===================================================================
--- source	2013-07-18 17:19:12 UTC (rev 8072)
+++ source	2013-07-19 18:35:41 UTC (rev 8073)
@@ -95751,7 +95751,7 @@
   tokenization stage will be initially seen by the user agent as a
   stream of bytes (typically coming over the network or from the local
   file system). The bytes encode the actual characters according to a
-  particular <i>character encoding</i>, which the user agent must use
+  particular <i>character encoding</i>, which the user agent uses
   to decode the bytes into characters.</p>
 
   <p class="note">For XML documents, the algorithm user agents must
@@ -95759,7 +95759,7 @@
   specification. This section does not apply to XML documents. <a
   href="#refsXML">[XML]</a></p>
 
-  <p>The <span>encoding sniffing algorithm</span> defined below is
+  <p>Usually, the <span>encoding sniffing algorithm</span> defined below is
   used to determine the character encoding.</p>
 
   <p>Given a character encoding, the bytes in the <span>input byte
@@ -95780,7 +95780,26 @@
   sequences are handled can result in, amongst other problems, script injection vulnerabilities
   ("XSS").</p>
 
+  <p>When the HTML parser is decoding an input byte stream, it uses a character encoding and a <dfn
+  title="concept-encoding-confidence">confidence</dfn>. The confidence is either <i>tentative</i>,
+  <i>certain</i>, or <i>irrelevant</i>. The encoding used, and whether the confidence in that
+  encoding is <i>tentative</i> or <i>certain</i>, is <a href="#meta-charset-during-parse">used
+  during the parsing</a> to determine whether to <span>change the encoding</span>. If no encoding is
+  necessary, e.g. because the parser is operating on a Unicode stream and doesn't have to use a
+  character encoding at all, then the <span title="concept-encoding-confidence">confidence</span> is
+  <i>irrelevant</i>.</p>
 
+  <p class="note">Some algorithms feed the parser by directly adding characters to the <span>input
+  stream</span> rather than adding bytes to the <span>input byte stream</span>.</p>
+
+
+  <h5>Parsing with a known character encoding</h5>
+
+  <p>When the HTML parser is to operate on an input byte stream that has <dfn>a known definite
+  encoding</dfn>, then the character encoding is that encoding and the <span
+  title="concept-encoding-confidence">confidence</span> is <i>certain</i>.</p>
+
+
   <h5>Determining the character encoding</h5>
 
   <p>In some cases, it might be impractical to unambiguously determine the encoding before parsing
@@ -95796,14 +95815,9 @@
   sniffing algorithm</dfn>, to determine the character encoding to use when decoding a document in
   the first pass. This algorithm takes as input any out-of-band metadata available to the user agent
   (e.g. the <span title="Content-Type">Content-Type metadata</span> of the document) and all the
-  bytes available so far, and returns a character encoding and a <dfn
-  title="concept-encoding-confidence">confidence</dfn>. The confidence is either <i>tentative</i>,
-  <i>certain</i>, or <i>irrelevant</i>. The encoding used, and whether the confidence in that
-  encoding is <i>tentative</i> or <i>certain</i>, is <a href="#meta-charset-during-parse">used
-  during the parsing</a> to determine whether to <span>change the encoding</span>. If no encoding is
-  necessary, e.g. because the parser is operating on a Unicode stream and doesn't have to use a
-  character encoding at all, then the <span title="concept-encoding-confidence">confidence</span> is
-  <i>irrelevant</i>.</p>
+  bytes available so far, and returns a character encoding and a <span
+  title="concept-encoding-confidence">confidence</span> that is either <i>tentative</i> or
+  <i>certain</i>.</p>
 
   <ol>