[html5] r8073 - [e] (0) Provide a hook for XHR and web components to incrementally decode with a [...]
whatwg at whatwg.org
whatwg at whatwg.org
Fri Jul 19 11:35:43 PDT 2013
Author: ianh
Date: 2013-07-19 11:35:41 -0700 (Fri, 19 Jul 2013)
New Revision: 8073
Modified:
complete.html
index
source
Log:
[e] (0) Provide a hook for XHR and web components to incrementally decode with a known encoding
Affected topics: HTML Syntax and Parsing
Modified: complete.html
===================================================================
--- complete.html 2013-07-18 17:19:12 UTC (rev 8072)
+++ complete.html 2013-07-19 18:35:41 UTC (rev 8073)
@@ -256,7 +256,7 @@
<header class=head id=head><p><a href=http://www.whatwg.org/ class=logo><img width=101 src=/images/logo alt=WHATWG height=101></a></p>
<hgroup><h1 class=allcaps>HTML</h1>
- <h2 class="no-num no-toc">Living Standard — Last Updated 18 July 2013</h2>
+ <h2 class="no-num no-toc">Living Standard — Last Updated 19 July 2013</h2>
</hgroup><dl><dt><strong>Web developer edition:</strong></dt>
<dd><strong><a href=http://developers.whatwg.org/>http://developers.whatwg.org/</a></strong></dd>
<dt>Multiple-page version:</dt>
@@ -1180,10 +1180,11 @@
<li><a href=#overview-of-the-parsing-model><span class=secno>12.2.1 </span>Overview of the parsing model</a></li>
<li><a href=#the-input-byte-stream><span class=secno>12.2.2 </span>The input byte stream</a>
<ol>
- <li><a href=#determining-the-character-encoding><span class=secno>12.2.2.1 </span>Determining the character encoding</a></li>
- <li><a href=#character-encodings><span class=secno>12.2.2.2 </span>Character encodings</a></li>
- <li><a href=#changing-the-encoding-while-parsing><span class=secno>12.2.2.3 </span>Changing the encoding while parsing</a></li>
- <li><a href=#preprocessing-the-input-stream><span class=secno>12.2.2.4 </span>Preprocessing the input stream</a></ol></li>
+ <li><a href=#parsing-with-a-known-character-encoding><span class=secno>12.2.2.1 </span>Parsing with a known character encoding</a></li>
+ <li><a href=#determining-the-character-encoding><span class=secno>12.2.2.2 </span>Determining the character encoding</a></li>
+ <li><a href=#character-encodings><span class=secno>12.2.2.3 </span>Character encodings</a></li>
+ <li><a href=#changing-the-encoding-while-parsing><span class=secno>12.2.2.4 </span>Changing the encoding while parsing</a></li>
+ <li><a href=#preprocessing-the-input-stream><span class=secno>12.2.2.5 </span>Preprocessing the input stream</a></ol></li>
<li><a href=#parse-state><span class=secno>12.2.3 </span>Parse state</a>
<ol>
<li><a href=#the-insertion-mode><span class=secno>12.2.3.1 </span>The insertion mode</a></li>
@@ -85813,14 +85814,14 @@
tokenization stage will be initially seen by the user agent as a
stream of bytes (typically coming over the network or from the local
file system). The bytes encode the actual characters according to a
- particular <i>character encoding</i>, which the user agent must use
+ particular <i>character encoding</i>, which the user agent uses
to decode the bytes into characters.</p>
<p class=note>For XML documents, the algorithm user agents must
use to determine the character encoding is given by the XML
specification. This section does not apply to XML documents. <a href=#refsXML>[XML]</a></p>
- <p>The <a href=#encoding-sniffing-algorithm>encoding sniffing algorithm</a> defined below is
+ <p>Usually, the <a href=#encoding-sniffing-algorithm>encoding sniffing algorithm</a> defined below is
used to determine the character encoding.</p>
<p>Given a character encoding, the bytes in the <a href=#the-input-byte-stream>input byte
@@ -85841,9 +85842,26 @@
sequences are handled can result in, amongst other problems, script injection vulnerabilities
("XSS").</p>
+ <p>When the HTML parser is decoding an input byte stream, it uses a character encoding and a <dfn id=concept-encoding-confidence title=concept-encoding-confidence>confidence</dfn>. The confidence is either <i>tentative</i>,
+ <i>certain</i>, or <i>irrelevant</i>. The encoding used, and whether the confidence in that
+ encoding is <i>tentative</i> or <i>certain</i>, is <a href=#meta-charset-during-parse>used
+ during the parsing</a> to determine whether to <a href=#change-the-encoding>change the encoding</a>. If no encoding is
+ necessary, e.g. because the parser is operating on a Unicode stream and doesn't have to use a
+ character encoding at all, then the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is
+ <i>irrelevant</i>.</p>
- <h5 id=determining-the-character-encoding><span class=secno>12.2.2.1 </span>Determining the character encoding</h5>
+ <p class=note>Some algorithms feed the parser by directly adding characters to the <a href=#input-stream>input
+ stream</a> rather than adding bytes to the <a href=#the-input-byte-stream>input byte stream</a>.</p>
+
+ <h5 id=parsing-with-a-known-character-encoding><span class=secno>12.2.2.1 </span>Parsing with a known character encoding</h5>
+
+ <p>When the HTML parser is to operate on an input byte stream that has <dfn id=a-known-definite-encoding>a known definite
+ encoding</dfn>, then the character encoding is that encoding and the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is <i>certain</i>.</p>
+
+
+ <h5 id=determining-the-character-encoding><span class=secno>12.2.2.2 </span>Determining the character encoding</h5>
+
<p>In some cases, it might be impractical to unambiguously determine the encoding before parsing
the document. Because of this, this specification provides for a two-pass mechanism with an
optional pre-scan. Implementations are allowed, as described below, to apply a simplified parsing
@@ -85857,13 +85875,8 @@
sniffing algorithm</dfn>, to determine the character encoding to use when decoding a document in
the first pass. This algorithm takes as input any out-of-band metadata available to the user agent
(e.g. the <a href=#content-type title=Content-Type>Content-Type metadata</a> of the document) and all the
- bytes available so far, and returns a character encoding and a <dfn id=concept-encoding-confidence title=concept-encoding-confidence>confidence</dfn>. The confidence is either <i>tentative</i>,
- <i>certain</i>, or <i>irrelevant</i>. The encoding used, and whether the confidence in that
- encoding is <i>tentative</i> or <i>certain</i>, is <a href=#meta-charset-during-parse>used
- during the parsing</a> to determine whether to <a href=#change-the-encoding>change the encoding</a>. If no encoding is
- necessary, e.g. because the parser is operating on a Unicode stream and doesn't have to use a
- character encoding at all, then the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is
- <i>irrelevant</i>.</p>
+ bytes available so far, and returns a character encoding and a <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> that is either <i>tentative</i> or
+ <i>certain</i>.</p>
<ol><li>
@@ -86649,7 +86662,7 @@
- <h5 id=character-encodings><span class=secno>12.2.2.2 </span>Character encodings</h5>
+ <h5 id=character-encodings><span class=secno>12.2.2.3 </span>Character encodings</h5>
<p>User agents must support the encodings defined in the WHATWG Encoding standard. User agents
should not support other encodings.</p>
@@ -86672,7 +86685,7 @@
content. <a href=#refsRFC2781>[RFC2781]</a></p>
- <h5 id=changing-the-encoding-while-parsing><span class=secno>12.2.2.3 </span>Changing the encoding while parsing</h5>
+ <h5 id=changing-the-encoding-while-parsing><span class=secno>12.2.2.4 </span>Changing the encoding while parsing</h5>
<p>When the parser requires the user agent to <dfn id=change-the-encoding>change the encoding</dfn>, it must run the
following steps. This might happen if the <a href=#encoding-sniffing-algorithm>encoding sniffing algorithm</a> described above
@@ -86723,7 +86736,7 @@
misinterpreted. User agents may notify the user of the situation,
to aid in application development.</li>
- </ol><h5 id=preprocessing-the-input-stream><span class=secno>12.2.2.4 </span>Preprocessing the input stream</h5>
+ </ol><h5 id=preprocessing-the-input-stream><span class=secno>12.2.2.5 </span>Preprocessing the input stream</h5>
<p>The <dfn id=input-stream>input stream</dfn> consists of the characters pushed
into it as the <a href=#the-input-byte-stream>input byte stream</a> is decoded or from the
Modified: index
===================================================================
--- index 2013-07-18 17:19:12 UTC (rev 8072)
+++ index 2013-07-19 18:35:41 UTC (rev 8073)
@@ -256,7 +256,7 @@
<header class=head id=head><p><a href=http://www.whatwg.org/ class=logo><img width=101 src=/images/logo alt=WHATWG height=101></a></p>
<hgroup><h1 class=allcaps>HTML</h1>
- <h2 class="no-num no-toc">Living Standard — Last Updated 18 July 2013</h2>
+ <h2 class="no-num no-toc">Living Standard — Last Updated 19 July 2013</h2>
</hgroup><dl><dt><strong>Web developer edition:</strong></dt>
<dd><strong><a href=http://developers.whatwg.org/>http://developers.whatwg.org/</a></strong></dd>
<dt>Multiple-page version:</dt>
@@ -1180,10 +1180,11 @@
<li><a href=#overview-of-the-parsing-model><span class=secno>12.2.1 </span>Overview of the parsing model</a></li>
<li><a href=#the-input-byte-stream><span class=secno>12.2.2 </span>The input byte stream</a>
<ol>
- <li><a href=#determining-the-character-encoding><span class=secno>12.2.2.1 </span>Determining the character encoding</a></li>
- <li><a href=#character-encodings><span class=secno>12.2.2.2 </span>Character encodings</a></li>
- <li><a href=#changing-the-encoding-while-parsing><span class=secno>12.2.2.3 </span>Changing the encoding while parsing</a></li>
- <li><a href=#preprocessing-the-input-stream><span class=secno>12.2.2.4 </span>Preprocessing the input stream</a></ol></li>
+ <li><a href=#parsing-with-a-known-character-encoding><span class=secno>12.2.2.1 </span>Parsing with a known character encoding</a></li>
+ <li><a href=#determining-the-character-encoding><span class=secno>12.2.2.2 </span>Determining the character encoding</a></li>
+ <li><a href=#character-encodings><span class=secno>12.2.2.3 </span>Character encodings</a></li>
+ <li><a href=#changing-the-encoding-while-parsing><span class=secno>12.2.2.4 </span>Changing the encoding while parsing</a></li>
+ <li><a href=#preprocessing-the-input-stream><span class=secno>12.2.2.5 </span>Preprocessing the input stream</a></ol></li>
<li><a href=#parse-state><span class=secno>12.2.3 </span>Parse state</a>
<ol>
<li><a href=#the-insertion-mode><span class=secno>12.2.3.1 </span>The insertion mode</a></li>
@@ -85813,14 +85814,14 @@
tokenization stage will be initially seen by the user agent as a
stream of bytes (typically coming over the network or from the local
file system). The bytes encode the actual characters according to a
- particular <i>character encoding</i>, which the user agent must use
+ particular <i>character encoding</i>, which the user agent uses
to decode the bytes into characters.</p>
<p class=note>For XML documents, the algorithm user agents must
use to determine the character encoding is given by the XML
specification. This section does not apply to XML documents. <a href=#refsXML>[XML]</a></p>
- <p>The <a href=#encoding-sniffing-algorithm>encoding sniffing algorithm</a> defined below is
+ <p>Usually, the <a href=#encoding-sniffing-algorithm>encoding sniffing algorithm</a> defined below is
used to determine the character encoding.</p>
<p>Given a character encoding, the bytes in the <a href=#the-input-byte-stream>input byte
@@ -85841,9 +85842,26 @@
sequences are handled can result in, amongst other problems, script injection vulnerabilities
("XSS").</p>
+ <p>When the HTML parser is decoding an input byte stream, it uses a character encoding and a <dfn id=concept-encoding-confidence title=concept-encoding-confidence>confidence</dfn>. The confidence is either <i>tentative</i>,
+ <i>certain</i>, or <i>irrelevant</i>. The encoding used, and whether the confidence in that
+ encoding is <i>tentative</i> or <i>certain</i>, is <a href=#meta-charset-during-parse>used
+ during the parsing</a> to determine whether to <a href=#change-the-encoding>change the encoding</a>. If no encoding is
+ necessary, e.g. because the parser is operating on a Unicode stream and doesn't have to use a
+ character encoding at all, then the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is
+ <i>irrelevant</i>.</p>
- <h5 id=determining-the-character-encoding><span class=secno>12.2.2.1 </span>Determining the character encoding</h5>
+ <p class=note>Some algorithms feed the parser by directly adding characters to the <a href=#input-stream>input
+ stream</a> rather than adding bytes to the <a href=#the-input-byte-stream>input byte stream</a>.</p>
+
+ <h5 id=parsing-with-a-known-character-encoding><span class=secno>12.2.2.1 </span>Parsing with a known character encoding</h5>
+
+ <p>When the HTML parser is to operate on an input byte stream that has <dfn id=a-known-definite-encoding>a known definite
+ encoding</dfn>, then the character encoding is that encoding and the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is <i>certain</i>.</p>
+
+
+ <h5 id=determining-the-character-encoding><span class=secno>12.2.2.2 </span>Determining the character encoding</h5>
+
<p>In some cases, it might be impractical to unambiguously determine the encoding before parsing
the document. Because of this, this specification provides for a two-pass mechanism with an
optional pre-scan. Implementations are allowed, as described below, to apply a simplified parsing
@@ -85857,13 +85875,8 @@
sniffing algorithm</dfn>, to determine the character encoding to use when decoding a document in
the first pass. This algorithm takes as input any out-of-band metadata available to the user agent
(e.g. the <a href=#content-type title=Content-Type>Content-Type metadata</a> of the document) and all the
- bytes available so far, and returns a character encoding and a <dfn id=concept-encoding-confidence title=concept-encoding-confidence>confidence</dfn>. The confidence is either <i>tentative</i>,
- <i>certain</i>, or <i>irrelevant</i>. The encoding used, and whether the confidence in that
- encoding is <i>tentative</i> or <i>certain</i>, is <a href=#meta-charset-during-parse>used
- during the parsing</a> to determine whether to <a href=#change-the-encoding>change the encoding</a>. If no encoding is
- necessary, e.g. because the parser is operating on a Unicode stream and doesn't have to use a
- character encoding at all, then the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is
- <i>irrelevant</i>.</p>
+ bytes available so far, and returns a character encoding and a <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> that is either <i>tentative</i> or
+ <i>certain</i>.</p>
<ol><li>
@@ -86649,7 +86662,7 @@
- <h5 id=character-encodings><span class=secno>12.2.2.2 </span>Character encodings</h5>
+ <h5 id=character-encodings><span class=secno>12.2.2.3 </span>Character encodings</h5>
<p>User agents must support the encodings defined in the WHATWG Encoding standard. User agents
should not support other encodings.</p>
@@ -86672,7 +86685,7 @@
content. <a href=#refsRFC2781>[RFC2781]</a></p>
- <h5 id=changing-the-encoding-while-parsing><span class=secno>12.2.2.3 </span>Changing the encoding while parsing</h5>
+ <h5 id=changing-the-encoding-while-parsing><span class=secno>12.2.2.4 </span>Changing the encoding while parsing</h5>
<p>When the parser requires the user agent to <dfn id=change-the-encoding>change the encoding</dfn>, it must run the
following steps. This might happen if the <a href=#encoding-sniffing-algorithm>encoding sniffing algorithm</a> described above
@@ -86723,7 +86736,7 @@
misinterpreted. User agents may notify the user of the situation,
to aid in application development.</li>
- </ol><h5 id=preprocessing-the-input-stream><span class=secno>12.2.2.4 </span>Preprocessing the input stream</h5>
+ </ol><h5 id=preprocessing-the-input-stream><span class=secno>12.2.2.5 </span>Preprocessing the input stream</h5>
<p>The <dfn id=input-stream>input stream</dfn> consists of the characters pushed
into it as the <a href=#the-input-byte-stream>input byte stream</a> is decoded or from the
Modified: source
===================================================================
--- source 2013-07-18 17:19:12 UTC (rev 8072)
+++ source 2013-07-19 18:35:41 UTC (rev 8073)
@@ -95751,7 +95751,7 @@
tokenization stage will be initially seen by the user agent as a
stream of bytes (typically coming over the network or from the local
file system). The bytes encode the actual characters according to a
- particular <i>character encoding</i>, which the user agent must use
+ particular <i>character encoding</i>, which the user agent uses
to decode the bytes into characters.</p>
<p class="note">For XML documents, the algorithm user agents must
@@ -95759,7 +95759,7 @@
specification. This section does not apply to XML documents. <a
href="#refsXML">[XML]</a></p>
- <p>The <span>encoding sniffing algorithm</span> defined below is
+ <p>Usually, the <span>encoding sniffing algorithm</span> defined below is
used to determine the character encoding.</p>
<p>Given a character encoding, the bytes in the <span>input byte
@@ -95780,7 +95780,26 @@
sequences are handled can result in, amongst other problems, script injection vulnerabilities
("XSS").</p>
+ <p>When the HTML parser is decoding an input byte stream, it uses a character encoding and a <dfn
+ title="concept-encoding-confidence">confidence</dfn>. The confidence is either <i>tentative</i>,
+ <i>certain</i>, or <i>irrelevant</i>. The encoding used, and whether the confidence in that
+ encoding is <i>tentative</i> or <i>certain</i>, is <a href="#meta-charset-during-parse">used
+ during the parsing</a> to determine whether to <span>change the encoding</span>. If no encoding is
+ necessary, e.g. because the parser is operating on a Unicode stream and doesn't have to use a
+ character encoding at all, then the <span title="concept-encoding-confidence">confidence</span> is
+ <i>irrelevant</i>.</p>
+ <p class="note">Some algorithms feed the parser by directly adding characters to the <span>input
+ stream</span> rather than adding bytes to the <span>input byte stream</span>.</p>
+
+
+ <h5>Parsing with a known character encoding</h5>
+
+ <p>When the HTML parser is to operate on an input byte stream that has <dfn>a known definite
+ encoding</dfn>, then the character encoding is that encoding and the <span
+ title="concept-encoding-confidence">confidence</span> is <i>certain</i>.</p>
+
+
<h5>Determining the character encoding</h5>
<p>In some cases, it might be impractical to unambiguously determine the encoding before parsing
@@ -95796,14 +95815,9 @@
sniffing algorithm</dfn>, to determine the character encoding to use when decoding a document in
the first pass. This algorithm takes as input any out-of-band metadata available to the user agent
(e.g. the <span title="Content-Type">Content-Type metadata</span> of the document) and all the
- bytes available so far, and returns a character encoding and a <dfn
- title="concept-encoding-confidence">confidence</dfn>. The confidence is either <i>tentative</i>,
- <i>certain</i>, or <i>irrelevant</i>. The encoding used, and whether the confidence in that
- encoding is <i>tentative</i> or <i>certain</i>, is <a href="#meta-charset-during-parse">used
- during the parsing</a> to determine whether to <span>change the encoding</span>. If no encoding is
- necessary, e.g. because the parser is operating on a Unicode stream and doesn't have to use a
- character encoding at all, then the <span title="concept-encoding-confidence">confidence</span> is
- <i>irrelevant</i>.</p>
+ bytes available so far, and returns a character encoding and a <span
+ title="concept-encoding-confidence">confidence</span> that is either <i>tentative</i> or
+ <i>certain</i>.</p>
<ol>
More information about the Commit-Watchers
mailing list