[html5] r6991 - / images

whatwg at whatwg.org whatwg at whatwg.org
Mon Feb 13 14:48:12 PST 2012


Author: ianh
Date: 2012-02-13 14:48:10 -0800 (Mon, 13 Feb 2012)
New Revision: 6991

Modified:
   complete.html
   images/parsing-model-overview.png
   index
   source
Log:
[giow] (2) Rejig the wording of the character encoding section to make it more precise and in particular to not make CR processing require look-ahead.
Affected topics: HTML, HTML Syntax and Parsing

Modified: complete.html
===================================================================
--- complete.html	2012-02-13 21:06:58 UTC (rev 6990)
+++ complete.html	2012-02-13 22:48:10 UTC (rev 6991)
@@ -1115,7 +1115,7 @@
    <li><a href=#parsing><span class=secno>12.2 </span>Parsing HTML documents</a>
     <ol>
      <li><a href=#overview-of-the-parsing-model><span class=secno>12.2.1 </span>Overview of the parsing model</a></li>
-     <li><a href=#the-input-stream><span class=secno>12.2.2 </span>The input stream</a>
+     <li><a href=#the-input-byte-stream><span class=secno>12.2.2 </span>The input byte stream</a>
       <ol>
        <li><a href=#determining-the-character-encoding><span class=secno>12.2.2.1 </span>Determining the character encoding</a></li>
        <li><a href=#character-encodings-0><span class=secno>12.2.2.2 </span>Character encodings</a></li>
@@ -13639,7 +13639,7 @@
 
     <p>If the document has an <a href=#active-parser>active parser</a> that isn't a
     <a href=#script-created-parser>script-created parser</a>, and the <a href=#insertion-point>insertion
-    point</a> associated with that parser's <a href=#the-input-stream>input
+    point</a> associated with that parser's <a href=#input-stream>input
     stream</a> is not undefined (that is, it <em>does</em> point to
     somewhere in the input stream), then the method does
     nothing. Abort these steps and return the <code><a href=#document>Document</a></code>
@@ -13783,7 +13783,7 @@
    entry.</li>
 
    <li><p>Finally, set the <a href=#insertion-point>insertion point</a> to point at
-   just before the end of the <a href=#the-input-stream>input stream</a> (which at this
+   just before the end of the <a href=#input-stream>input stream</a> (which at this
    point will be empty).</li>
 
    <li><p>Return the <code><a href=#document>Document</a></code> on which the method was
@@ -13833,7 +13833,7 @@
    with the document, then abort these steps.</li>
 
    <li><p>Insert an <a href=#explicit-eof-character>explicit "EOF" character</a> at the end
-   of the parser's <a href=#the-input-stream>input stream</a>.</li>
+   of the parser's <a href=#input-stream>input stream</a>.</li>
 
    <li><p>If there is a <a href=#pending-parsing-blocking-script>pending parsing-blocking script</a>,
    then abort these steps.</li>
@@ -13922,14 +13922,14 @@
     the user <a href=#refused-to-allow-the-document-to-be-unloaded>refused to allow the document to be
     unloaded</a>, then abort these steps. Otherwise, the
     <a href=#insertion-point>insertion point</a> will point at just before the end of
-    the (empty) <a href=#the-input-stream>input stream</a>.</p>
+    the (empty) <a href=#input-stream>input stream</a>.</p>
 
    </li>
 
    <li>
 
     <p>Insert the string consisting of the concatenation of all the
-    arguments to the method into the <a href=#the-input-stream>input stream</a> just
+    arguments to the method into the <a href=#input-stream>input stream</a> just
     before the <a href=#insertion-point>insertion point</a>.</p>
 
    </li>
@@ -64273,12 +64273,12 @@
   an <a href=#html-documents title="HTML documents">HTML document</a>, set its <a href=#concept-document-content-type title=concept-document-content-type>content type</a> to "<code title="">text/html</code>", create an <a href=#html-parser>HTML parser</a>, and
   associate it with the document. Each <a href=#concept-task title=concept-task>task</a> that the <a href=#networking-task-source>networking task
   source</a> places on the <a href=#task-queue>task queue</a> while the <a href=#fetch title=fetch>fetching algorithm</a> runs must then fill the
-  parser's <a href=#the-input-stream>input stream</a> with the fetched bytes and cause
-  the <a href=#html-parser>HTML parser</a> to perform the appropriate processing
-  of the input stream.</p>
+  parser's <a href=#the-input-byte-stream>input byte stream</a> with the fetched bytes and
+  cause the <a href=#html-parser>HTML parser</a> to perform the appropriate
+  processing of the input stream.</p>
 
-  <p class=note>The <a href=#the-input-stream>input stream</a> converts bytes into
-  characters for use in the <a href=#tokenization title=tokenization>tokenizer</a>. This process relies, in part,
+  <p class=note>The <a href=#the-input-byte-stream>input byte stream</a> converts bytes
+  into characters for use in the <a href=#tokenization title=tokenization>tokenizer</a>. This process relies, in part,
   on character encoding information found in the real <a href=#content-type title=Content-Type>Content-Type metadata</a> of the resource;
   the "sniffed type" is not used for this purpose.</p>
 
@@ -64377,9 +64377,9 @@
   state</a>. Each <a href=#concept-task title=concept-task>task</a> that the
   <a href=#networking-task-source>networking task source</a> places on the <a href=#task-queue>task
   queue</a> while the <a href=#fetch title=fetch>fetching algorithm</a>
-  runs must then fill the parser's <a href=#the-input-stream>input stream</a> with the
-  fetched bytes and cause the <a href=#html-parser>HTML parser</a> to perform the
-  appropriate processing of the input stream.</p>
+  runs must then fill the parser's <a href=#the-input-byte-stream>input byte stream</a> with
+  the fetched bytes and cause the <a href=#html-parser>HTML parser</a> to perform
+  the appropriate processing of the input stream.</p>
 
   <p>The rules for how to convert the bytes of the plain text document
   into actual characters, and the rules for actually rendering the
@@ -81111,13 +81111,13 @@
 
   <h4 id=overview-of-the-parsing-model><span class=secno>12.2.1 </span>Overview of the parsing model</h4>
 
-  <p class=overview><object data=images/parsing-model-overview.svg height=450 width=345><img alt="" height=450 src=http://images.whatwg.org/parsing-model-overview.png width=345></object></p>
+  <p class=overview><object data=images/parsing-model-overview.svg height=535 width=345><img alt="" height=450 src=http://images.whatwg.org/parsing-model-overview.png width=345></object></p>
 
   <p>The input to the HTML parsing process consists of a stream of
-  Unicode code points, which is passed through a
-  <a href=#tokenization>tokenization</a> stage followed by a <a href=#tree-construction>tree
-  construction</a> stage. The output is a <code><a href=#document>Document</a></code>
-  object.</p>
+  <a href=#unicode-code-point title="Unicode code point">Unicode code points</a>, which
+  is passed through a <a href=#tokenization>tokenization</a> stage followed by a
+  <a href=#tree-construction>tree construction</a> stage. The output is a
+  <code><a href=#document>Document</a></code> object.</p>
 
   <p class=note>Implementations that <a href=#non-scripted>do not
   support scripting</a> do not have to actually create a DOM
@@ -81157,22 +81157,51 @@
   </div>
 
 
+
   <div class=impl>
 
-  <h4 id=the-input-stream><span class=secno>12.2.2 </span>The <dfn>input stream</dfn></h4>
+  <h4 id=the-input-byte-stream><span class=secno>12.2.2 </span>The <dfn>input byte stream</dfn></h4>
 
   <p>The stream of Unicode code points that comprises the input to the
   tokenization stage will be initially seen by the user agent as a
   stream of bytes (typically coming over the network or from the local
   file system). The bytes encode the actual characters according to a
-  particular <em>character encoding</em>, which the user agent must
-  use to decode the bytes into characters.</p>
+  particular <i>character encoding</i>, which the user agent must use
+  to decode the bytes into characters.</p>
 
   <p class=note>For XML documents, the algorithm user agents must
   use to determine the character encoding is given by the XML
   specification. This section does not apply to XML documents. <a href=#refsXML>[XML]</a></p>
 
+  <p>The <a href=#encoding-sniffing-algorithm>encoding sniffing algorithm</a> defined below is
+  used to determine the character encoding.</p>
 
+  <p>Given an encoding, the bytes in the <a href=#the-input-byte-stream>input byte
+  stream</a> must be converted to Unicode code points for the
+  tokenizer's <a href=#input-stream>input stream</a>, as described by the rules for
+  that encoding, except that the leading U+FEFF BYTE ORDER MARK
+  character, if any, must not be stripped by the encoding layer (it is
+  stripped by the rule below).</p> <!-- this is to prevent two leading
+  BOMs from being both stripped, once by the decoder, and once by the
+  parser -->
+
+  <p>Bytes or sequences of bytes in the original byte stream that
+  could not be converted to Unicode code points must be converted to
+  U+FFFD REPLACEMENT CHARACTERs. Specifically, if the encoding is
+  UTF-8, the bytes must be <a href=#decoded-as-utf-8,-with-error-handling title="decoded as UTF-8, with error
+  handling">decoded with the error handling</a> defined in this
+  specification.</p>
+
+  <p class=note>Bytes or sequences of bytes in the original byte
+  stream that did not conform to the encoding specification (e.g.
+  invalid UTF-8 byte sequences in a UTF-8 input byte stream) are
+  errors that conformance checkers are expected to report.</p>
+
+  <p>Any byte or sequence of bytes in the original byte stream that is
+  <a href=#misinterpreted-for-compatibility>misinterpreted for compatibility</a> is a <a href=#parse-error>parse
+  error</a>.</p>
+
+
   <h5 id=determining-the-character-encoding><span class=secno>12.2.2.1 </span>Determining the character encoding</h5>
 
   <p>In some cases, it might be impractical to unambiguously determine
@@ -81428,7 +81457,7 @@
   </ol><p>The <a href="#document's-character-encoding">document's character encoding</a> must immediately
   be set to the value returned from this algorithm, at the same time
   as the user agent uses the returned value to select the decoder to
-  use for the input stream.</p>
+  use for the input byte stream.</p>
 
   <hr><p>When an algorithm requires a user agent to <dfn id=prescan-a-byte-stream-to-determine-its-encoding>prescan a byte
   stream to determine its encoding</dfn>, given some defined <var title="">end condition</var>, then it must run the following steps.
@@ -81438,7 +81467,7 @@
   <ol><li>
 
     <p>Let <var title="">position</var> be a pointer to a byte in the
-    input stream, initially pointing at the first byte. If at any
+    input byte stream, initially pointing at the first byte. If at any
     point during these steps the user agent either runs out of bytes
     or reaches its <var title="">end condition</var>, then abort the
     <a href=#prescan-a-byte-stream-to-determine-its-encoding>prescan a byte stream to determine its encoding</a>
@@ -81575,8 +81604,8 @@
     </dl></li>
 
    <li><i>Next byte</i>: Move <var title="">position</var> so it
-   points at the next byte in the input stream, and return to the step
-   above labeld <i>loop</i>.</li>
+   points at the next byte in the input byte stream, and return to the
+   step above labeld <i>loop</i>.</li>
 
   </ol><p>When the <a href=#prescan-a-byte-stream-to-determine-its-encoding>prescan a byte stream to determine its
   encoding</a> algorithm says to <dfn id=concept-get-attributes-when-sniffing title=concept-get-attributes-when-sniffing>get an attribute</dfn>,
@@ -81851,32 +81880,12 @@
 
   <h5 id=preprocessing-the-input-stream><span class=secno>12.2.2.3 </span>Preprocessing the input stream</h5>
 
-  <p>Given an encoding, the bytes in the input stream must be
-  converted to Unicode code points for the tokenizer, as described by
-  the rules for that encoding, except that the leading U+FEFF BYTE
-  ORDER MARK character, if any, must not be stripped by the encoding
-  layer (it is stripped by the rule below).</p> <!-- this is to
-  prevent two leading BOMs from being both stripped, once by the
-  decoder, and once by the parser -->
+  <p>The <dfn id=input-stream>input stream</dfn> consists of the characters pushed
+  into it as the <a href=#the-input-byte-stream>input byte stream</a> is decoded or from the
+  various APIs that directly manipulate the input stream.</p>
 
-  <p>Bytes or sequences of bytes in the original byte stream that
-  could not be converted to Unicode code points must be converted to
-  U+FFFD REPLACEMENT CHARACTERs. Specifically, if the encoding is
-  UTF-8, the bytes must be <a href=#decoded-as-utf-8,-with-error-handling title="decoded as UTF-8, with error
-  handling">decoded with the error handling</a> defined in this
-  specification.</p>
-
-  <p class=note>Bytes or sequences of bytes in the original byte
-  stream that did not conform to the encoding specification
-  (e.g. invalid UTF-8 byte sequences in a UTF-8 input stream) are
-  errors that conformance checkers are expected to report.</p>
-
-  <p>Any byte or sequence of bytes in the original byte stream that is
-  <a href=#misinterpreted-for-compatibility>misinterpreted for compatibility</a> is a <a href=#parse-error>parse
-  error</a>.</p>
-
   <p>One leading U+FEFF BYTE ORDER MARK character must be ignored if
-  any are present.</p>
+  any are present in the <a href=#input-stream>input stream</a>.</p>
 
   <p class=note>The requirement to strip a U+FEFF BYTE ORDER MARK
   character regardless of whether that character was used to determine
@@ -81898,18 +81907,18 @@
   undefined Unicode characters (noncharacters).</p>
 
   <p>U+000D CARRIAGE RETURN (CR) characters and U+000A LINE FEED (LF)
-  characters are treated specially. Any CR characters that are
-  followed by LF characters must be removed, and any CR characters not
-  followed by LF characters must be converted to LF characters. Thus,
-  newlines in HTML DOMs are represented by LF characters, and there
-  are never any CR characters in the input to the
-  <a href=#tokenization>tokenization</a> stage.</p>
+  characters are treated specially. All CR characters must be
+  converted to LF characters, and any LF characters that immediately
+  follow a CR character must be ignored. Thus, newlines in HTML DOMs
+  are represented by LF characters, and there are never any CR
+  characters in the input to the <a href=#tokenization>tokenization</a> stage.</p>
 
   <p>The <dfn id=next-input-character>next input character</dfn> is the first character in the
-  input stream that has not yet been <dfn id=consumed>consumed</dfn>. Initially,
-  the <i><a href=#next-input-character>next input character</a></i> is the first character in the
-  input. The <dfn id=current-input-character>current input character</dfn> is the last character
-  to have been <i><a href=#consumed>consumed</a></i>.</p>
+  <a href=#input-stream>input stream</a> that has not yet been <dfn id=consumed>consumed</dfn>
+  or explicit ignored by the requirements in this section. Initially,
+  the <i><a href=#next-input-character>next input character</a></i> is the first character in the input.
+  The <dfn id=current-input-character>current input character</dfn> is the last character to have
+  been <i><a href=#consumed>consumed</a></i>.</p>
 
   <p>The <dfn id=insertion-point>insertion point</dfn> is the position (just before a
   character or just before the end of the input stream) where content
@@ -81920,9 +81929,9 @@
   undefined.</p>
 
   <p>The "EOF" character in the tables below is a conceptual character
-  representing the end of the <a href=#the-input-stream>input stream</a>. If the parser
+  representing the end of the <a href=#input-stream>input stream</a>. If the parser
   is a <a href=#script-created-parser>script-created parser</a>, then the end of the
-  <a href=#the-input-stream>input stream</a> is reached when an <dfn id=explicit-eof-character>explicit "EOF"
+  <a href=#input-stream>input stream</a> is reached when an <dfn id=explicit-eof-character>explicit "EOF"
   character</dfn> (inserted by the <code title=dom-document-close><a href=#dom-document-close>document.close()</a></code> method) is
   consumed. Otherwise, the "EOF" character is not a real character in
   the stream, but rather the lack of any further characters.</p>
@@ -88477,7 +88486,7 @@
   </ol><p>When the user agent is to <dfn id=abort-a-parser>abort a parser</dfn>, it must run
   the following steps:</p>
 
-  <ol><li><p>Throw away any pending content in the <a href=#the-input-stream>input
+  <ol><li><p>Throw away any pending content in the <a href=#input-stream>input
    stream</a>, and discard any future content that would have been
    added to it.</li>
 
@@ -89291,7 +89300,7 @@
 
    <li>
 
-    <p>Place into the <a href=#the-input-stream>input stream</a> for the <a href=#html-parser>HTML
+    <p>Place into the <a href=#input-stream>input stream</a> for the <a href=#html-parser>HTML
     parser</a> just created the <var title="">input</var>. The
     encoding <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is
     <i>irrelevant</i>.</p>

Modified: images/parsing-model-overview.png
===================================================================
(Binary files differ)

Modified: index
===================================================================
--- index	2012-02-13 21:06:58 UTC (rev 6990)
+++ index	2012-02-13 22:48:10 UTC (rev 6991)
@@ -1115,7 +1115,7 @@
    <li><a href=#parsing><span class=secno>12.2 </span>Parsing HTML documents</a>
     <ol>
      <li><a href=#overview-of-the-parsing-model><span class=secno>12.2.1 </span>Overview of the parsing model</a></li>
-     <li><a href=#the-input-stream><span class=secno>12.2.2 </span>The input stream</a>
+     <li><a href=#the-input-byte-stream><span class=secno>12.2.2 </span>The input byte stream</a>
       <ol>
        <li><a href=#determining-the-character-encoding><span class=secno>12.2.2.1 </span>Determining the character encoding</a></li>
        <li><a href=#character-encodings-0><span class=secno>12.2.2.2 </span>Character encodings</a></li>
@@ -13639,7 +13639,7 @@
 
     <p>If the document has an <a href=#active-parser>active parser</a> that isn't a
     <a href=#script-created-parser>script-created parser</a>, and the <a href=#insertion-point>insertion
-    point</a> associated with that parser's <a href=#the-input-stream>input
+    point</a> associated with that parser's <a href=#input-stream>input
     stream</a> is not undefined (that is, it <em>does</em> point to
     somewhere in the input stream), then the method does
     nothing. Abort these steps and return the <code><a href=#document>Document</a></code>
@@ -13783,7 +13783,7 @@
    entry.</li>
 
    <li><p>Finally, set the <a href=#insertion-point>insertion point</a> to point at
-   just before the end of the <a href=#the-input-stream>input stream</a> (which at this
+   just before the end of the <a href=#input-stream>input stream</a> (which at this
    point will be empty).</li>
 
    <li><p>Return the <code><a href=#document>Document</a></code> on which the method was
@@ -13833,7 +13833,7 @@
    with the document, then abort these steps.</li>
 
    <li><p>Insert an <a href=#explicit-eof-character>explicit "EOF" character</a> at the end
-   of the parser's <a href=#the-input-stream>input stream</a>.</li>
+   of the parser's <a href=#input-stream>input stream</a>.</li>
 
    <li><p>If there is a <a href=#pending-parsing-blocking-script>pending parsing-blocking script</a>,
    then abort these steps.</li>
@@ -13922,14 +13922,14 @@
     the user <a href=#refused-to-allow-the-document-to-be-unloaded>refused to allow the document to be
     unloaded</a>, then abort these steps. Otherwise, the
     <a href=#insertion-point>insertion point</a> will point at just before the end of
-    the (empty) <a href=#the-input-stream>input stream</a>.</p>
+    the (empty) <a href=#input-stream>input stream</a>.</p>
 
    </li>
 
    <li>
 
     <p>Insert the string consisting of the concatenation of all the
-    arguments to the method into the <a href=#the-input-stream>input stream</a> just
+    arguments to the method into the <a href=#input-stream>input stream</a> just
     before the <a href=#insertion-point>insertion point</a>.</p>
 
    </li>
@@ -64273,12 +64273,12 @@
   an <a href=#html-documents title="HTML documents">HTML document</a>, set its <a href=#concept-document-content-type title=concept-document-content-type>content type</a> to "<code title="">text/html</code>", create an <a href=#html-parser>HTML parser</a>, and
   associate it with the document. Each <a href=#concept-task title=concept-task>task</a> that the <a href=#networking-task-source>networking task
   source</a> places on the <a href=#task-queue>task queue</a> while the <a href=#fetch title=fetch>fetching algorithm</a> runs must then fill the
-  parser's <a href=#the-input-stream>input stream</a> with the fetched bytes and cause
-  the <a href=#html-parser>HTML parser</a> to perform the appropriate processing
-  of the input stream.</p>
+  parser's <a href=#the-input-byte-stream>input byte stream</a> with the fetched bytes and
+  cause the <a href=#html-parser>HTML parser</a> to perform the appropriate
+  processing of the input stream.</p>
 
-  <p class=note>The <a href=#the-input-stream>input stream</a> converts bytes into
-  characters for use in the <a href=#tokenization title=tokenization>tokenizer</a>. This process relies, in part,
+  <p class=note>The <a href=#the-input-byte-stream>input byte stream</a> converts bytes
+  into characters for use in the <a href=#tokenization title=tokenization>tokenizer</a>. This process relies, in part,
   on character encoding information found in the real <a href=#content-type title=Content-Type>Content-Type metadata</a> of the resource;
   the "sniffed type" is not used for this purpose.</p>
 
@@ -64377,9 +64377,9 @@
   state</a>. Each <a href=#concept-task title=concept-task>task</a> that the
   <a href=#networking-task-source>networking task source</a> places on the <a href=#task-queue>task
   queue</a> while the <a href=#fetch title=fetch>fetching algorithm</a>
-  runs must then fill the parser's <a href=#the-input-stream>input stream</a> with the
-  fetched bytes and cause the <a href=#html-parser>HTML parser</a> to perform the
-  appropriate processing of the input stream.</p>
+  runs must then fill the parser's <a href=#the-input-byte-stream>input byte stream</a> with
+  the fetched bytes and cause the <a href=#html-parser>HTML parser</a> to perform
+  the appropriate processing of the input stream.</p>
 
   <p>The rules for how to convert the bytes of the plain text document
   into actual characters, and the rules for actually rendering the
@@ -81111,13 +81111,13 @@
 
   <h4 id=overview-of-the-parsing-model><span class=secno>12.2.1 </span>Overview of the parsing model</h4>
 
-  <p class=overview><object data=images/parsing-model-overview.svg height=450 width=345><img alt="" height=450 src=http://images.whatwg.org/parsing-model-overview.png width=345></object></p>
+  <p class=overview><object data=images/parsing-model-overview.svg height=535 width=345><img alt="" height=450 src=http://images.whatwg.org/parsing-model-overview.png width=345></object></p>
 
   <p>The input to the HTML parsing process consists of a stream of
-  Unicode code points, which is passed through a
-  <a href=#tokenization>tokenization</a> stage followed by a <a href=#tree-construction>tree
-  construction</a> stage. The output is a <code><a href=#document>Document</a></code>
-  object.</p>
+  <a href=#unicode-code-point title="Unicode code point">Unicode code points</a>, which
+  is passed through a <a href=#tokenization>tokenization</a> stage followed by a
+  <a href=#tree-construction>tree construction</a> stage. The output is a
+  <code><a href=#document>Document</a></code> object.</p>
 
   <p class=note>Implementations that <a href=#non-scripted>do not
   support scripting</a> do not have to actually create a DOM
@@ -81157,22 +81157,51 @@
   </div>
 
 
+
   <div class=impl>
 
-  <h4 id=the-input-stream><span class=secno>12.2.2 </span>The <dfn>input stream</dfn></h4>
+  <h4 id=the-input-byte-stream><span class=secno>12.2.2 </span>The <dfn>input byte stream</dfn></h4>
 
   <p>The stream of Unicode code points that comprises the input to the
   tokenization stage will be initially seen by the user agent as a
   stream of bytes (typically coming over the network or from the local
   file system). The bytes encode the actual characters according to a
-  particular <em>character encoding</em>, which the user agent must
-  use to decode the bytes into characters.</p>
+  particular <i>character encoding</i>, which the user agent must use
+  to decode the bytes into characters.</p>
 
   <p class=note>For XML documents, the algorithm user agents must
   use to determine the character encoding is given by the XML
   specification. This section does not apply to XML documents. <a href=#refsXML>[XML]</a></p>
 
+  <p>The <a href=#encoding-sniffing-algorithm>encoding sniffing algorithm</a> defined below is
+  used to determine the character encoding.</p>
 
+  <p>Given an encoding, the bytes in the <a href=#the-input-byte-stream>input byte
+  stream</a> must be converted to Unicode code points for the
+  tokenizer's <a href=#input-stream>input stream</a>, as described by the rules for
+  that encoding, except that the leading U+FEFF BYTE ORDER MARK
+  character, if any, must not be stripped by the encoding layer (it is
+  stripped by the rule below).</p> <!-- this is to prevent two leading
+  BOMs from being both stripped, once by the decoder, and once by the
+  parser -->
+
+  <p>Bytes or sequences of bytes in the original byte stream that
+  could not be converted to Unicode code points must be converted to
+  U+FFFD REPLACEMENT CHARACTERs. Specifically, if the encoding is
+  UTF-8, the bytes must be <a href=#decoded-as-utf-8,-with-error-handling title="decoded as UTF-8, with error
+  handling">decoded with the error handling</a> defined in this
+  specification.</p>
+
+  <p class=note>Bytes or sequences of bytes in the original byte
+  stream that did not conform to the encoding specification (e.g.
+  invalid UTF-8 byte sequences in a UTF-8 input byte stream) are
+  errors that conformance checkers are expected to report.</p>
+
+  <p>Any byte or sequence of bytes in the original byte stream that is
+  <a href=#misinterpreted-for-compatibility>misinterpreted for compatibility</a> is a <a href=#parse-error>parse
+  error</a>.</p>
+
+
   <h5 id=determining-the-character-encoding><span class=secno>12.2.2.1 </span>Determining the character encoding</h5>
 
   <p>In some cases, it might be impractical to unambiguously determine
@@ -81428,7 +81457,7 @@
   </ol><p>The <a href="#document's-character-encoding">document's character encoding</a> must immediately
   be set to the value returned from this algorithm, at the same time
   as the user agent uses the returned value to select the decoder to
-  use for the input stream.</p>
+  use for the input byte stream.</p>
 
   <hr><p>When an algorithm requires a user agent to <dfn id=prescan-a-byte-stream-to-determine-its-encoding>prescan a byte
   stream to determine its encoding</dfn>, given some defined <var title="">end condition</var>, then it must run the following steps.
@@ -81438,7 +81467,7 @@
   <ol><li>
 
     <p>Let <var title="">position</var> be a pointer to a byte in the
-    input stream, initially pointing at the first byte. If at any
+    input byte stream, initially pointing at the first byte. If at any
     point during these steps the user agent either runs out of bytes
     or reaches its <var title="">end condition</var>, then abort the
     <a href=#prescan-a-byte-stream-to-determine-its-encoding>prescan a byte stream to determine its encoding</a>
@@ -81575,8 +81604,8 @@
     </dl></li>
 
    <li><i>Next byte</i>: Move <var title="">position</var> so it
-   points at the next byte in the input stream, and return to the step
-   above labeld <i>loop</i>.</li>
+   points at the next byte in the input byte stream, and return to the
+   step above labeld <i>loop</i>.</li>
 
   </ol><p>When the <a href=#prescan-a-byte-stream-to-determine-its-encoding>prescan a byte stream to determine its
   encoding</a> algorithm says to <dfn id=concept-get-attributes-when-sniffing title=concept-get-attributes-when-sniffing>get an attribute</dfn>,
@@ -81851,32 +81880,12 @@
 
   <h5 id=preprocessing-the-input-stream><span class=secno>12.2.2.3 </span>Preprocessing the input stream</h5>
 
-  <p>Given an encoding, the bytes in the input stream must be
-  converted to Unicode code points for the tokenizer, as described by
-  the rules for that encoding, except that the leading U+FEFF BYTE
-  ORDER MARK character, if any, must not be stripped by the encoding
-  layer (it is stripped by the rule below).</p> <!-- this is to
-  prevent two leading BOMs from being both stripped, once by the
-  decoder, and once by the parser -->
+  <p>The <dfn id=input-stream>input stream</dfn> consists of the characters pushed
+  into it as the <a href=#the-input-byte-stream>input byte stream</a> is decoded or from the
+  various APIs that directly manipulate the input stream.</p>
 
-  <p>Bytes or sequences of bytes in the original byte stream that
-  could not be converted to Unicode code points must be converted to
-  U+FFFD REPLACEMENT CHARACTERs. Specifically, if the encoding is
-  UTF-8, the bytes must be <a href=#decoded-as-utf-8,-with-error-handling title="decoded as UTF-8, with error
-  handling">decoded with the error handling</a> defined in this
-  specification.</p>
-
-  <p class=note>Bytes or sequences of bytes in the original byte
-  stream that did not conform to the encoding specification
-  (e.g. invalid UTF-8 byte sequences in a UTF-8 input stream) are
-  errors that conformance checkers are expected to report.</p>
-
-  <p>Any byte or sequence of bytes in the original byte stream that is
-  <a href=#misinterpreted-for-compatibility>misinterpreted for compatibility</a> is a <a href=#parse-error>parse
-  error</a>.</p>
-
   <p>One leading U+FEFF BYTE ORDER MARK character must be ignored if
-  any are present.</p>
+  any are present in the <a href=#input-stream>input stream</a>.</p>
 
   <p class=note>The requirement to strip a U+FEFF BYTE ORDER MARK
   character regardless of whether that character was used to determine
@@ -81898,18 +81907,18 @@
   undefined Unicode characters (noncharacters).</p>
 
   <p>U+000D CARRIAGE RETURN (CR) characters and U+000A LINE FEED (LF)
-  characters are treated specially. Any CR characters that are
-  followed by LF characters must be removed, and any CR characters not
-  followed by LF characters must be converted to LF characters. Thus,
-  newlines in HTML DOMs are represented by LF characters, and there
-  are never any CR characters in the input to the
-  <a href=#tokenization>tokenization</a> stage.</p>
+  characters are treated specially. All CR characters must be
+  converted to LF characters, and any LF characters that immediately
+  follow a CR character must be ignored. Thus, newlines in HTML DOMs
+  are represented by LF characters, and there are never any CR
+  characters in the input to the <a href=#tokenization>tokenization</a> stage.</p>
 
   <p>The <dfn id=next-input-character>next input character</dfn> is the first character in the
-  input stream that has not yet been <dfn id=consumed>consumed</dfn>. Initially,
-  the <i><a href=#next-input-character>next input character</a></i> is the first character in the
-  input. The <dfn id=current-input-character>current input character</dfn> is the last character
-  to have been <i><a href=#consumed>consumed</a></i>.</p>
+  <a href=#input-stream>input stream</a> that has not yet been <dfn id=consumed>consumed</dfn>
+  or explicit ignored by the requirements in this section. Initially,
+  the <i><a href=#next-input-character>next input character</a></i> is the first character in the input.
+  The <dfn id=current-input-character>current input character</dfn> is the last character to have
+  been <i><a href=#consumed>consumed</a></i>.</p>
 
   <p>The <dfn id=insertion-point>insertion point</dfn> is the position (just before a
   character or just before the end of the input stream) where content
@@ -81920,9 +81929,9 @@
   undefined.</p>
 
   <p>The "EOF" character in the tables below is a conceptual character
-  representing the end of the <a href=#the-input-stream>input stream</a>. If the parser
+  representing the end of the <a href=#input-stream>input stream</a>. If the parser
   is a <a href=#script-created-parser>script-created parser</a>, then the end of the
-  <a href=#the-input-stream>input stream</a> is reached when an <dfn id=explicit-eof-character>explicit "EOF"
+  <a href=#input-stream>input stream</a> is reached when an <dfn id=explicit-eof-character>explicit "EOF"
   character</dfn> (inserted by the <code title=dom-document-close><a href=#dom-document-close>document.close()</a></code> method) is
   consumed. Otherwise, the "EOF" character is not a real character in
   the stream, but rather the lack of any further characters.</p>
@@ -88477,7 +88486,7 @@
   </ol><p>When the user agent is to <dfn id=abort-a-parser>abort a parser</dfn>, it must run
   the following steps:</p>
 
-  <ol><li><p>Throw away any pending content in the <a href=#the-input-stream>input
+  <ol><li><p>Throw away any pending content in the <a href=#input-stream>input
    stream</a>, and discard any future content that would have been
    added to it.</li>
 
@@ -89291,7 +89300,7 @@
 
    <li>
 
-    <p>Place into the <a href=#the-input-stream>input stream</a> for the <a href=#html-parser>HTML
+    <p>Place into the <a href=#input-stream>input stream</a> for the <a href=#html-parser>HTML
     parser</a> just created the <var title="">input</var>. The
     encoding <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is
     <i>irrelevant</i>.</p>

Modified: source
===================================================================
--- source	2012-02-13 21:06:58 UTC (rev 6990)
+++ source	2012-02-13 22:48:10 UTC (rev 6991)
@@ -75132,12 +75132,12 @@
   title="concept-task">task</span> that the <span>networking task
   source</span> places on the <span>task queue</span> while the <span
   title="fetch">fetching algorithm</span> runs must then fill the
-  parser's <span>input stream</span> with the fetched bytes and cause
-  the <span>HTML parser</span> to perform the appropriate processing
-  of the input stream.</p>
+  parser's <span>input byte stream</span> with the fetched bytes and
+  cause the <span>HTML parser</span> to perform the appropriate
+  processing of the input stream.</p>
 
-  <p class="note">The <span>input stream</span> converts bytes into
-  characters for use in the <span
+  <p class="note">The <span>input byte stream</span> converts bytes
+  into characters for use in the <span
   title="tokenization">tokenizer</span>. This process relies, in part,
   on character encoding information found in the real <span
   title="Content-Type">Content-Type metadata</span> of the resource;
@@ -75249,9 +75249,9 @@
   state</span>. Each <span title="concept-task">task</span> that the
   <span>networking task source</span> places on the <span>task
   queue</span> while the <span title="fetch">fetching algorithm</span>
-  runs must then fill the parser's <span>input stream</span> with the
-  fetched bytes and cause the <span>HTML parser</span> to perform the
-  appropriate processing of the input stream.</p>
+  runs must then fill the parser's <span>input byte stream</span> with
+  the fetched bytes and cause the <span>HTML parser</span> to perform
+  the appropriate processing of the input stream.</p>
 
   <p>The rules for how to convert the bytes of the plain text document
   into actual characters, and the rules for actually rendering the
@@ -94069,13 +94069,13 @@
 
   <h4>Overview of the parsing model</h4>
 
-  <p class="overview"><object data="images/parsing-model-overview.svg" width="345" height="450"><img src="images/parsing-model-overview.png" width="345" height="450" alt=""></object></p>
+  <p class="overview"><object data="images/parsing-model-overview.svg" width="345" height="535"><img src="images/parsing-model-overview.png" width="345" height="450" alt=""></object></p>
 
   <p>The input to the HTML parsing process consists of a stream of
-  Unicode code points, which is passed through a
-  <span>tokenization</span> stage followed by a <span>tree
-  construction</span> stage. The output is a <code>Document</code>
-  object.</p>
+  <span title="Unicode code point">Unicode code points</span>, which
+  is passed through a <span>tokenization</span> stage followed by a
+  <span>tree construction</span> stage. The output is a
+  <code>Document</code> object.</p>
 
   <p class="note">Implementations that <a href="#non-scripted">do not
   support scripting</a> do not have to actually create a DOM
@@ -94116,23 +94116,52 @@
   </div>
 
 
+
   <div class="impl">
 
-  <h4>The <dfn>input stream</dfn></h4>
+  <h4>The <dfn>input byte stream</dfn></h4>
 
   <p>The stream of Unicode code points that comprises the input to the
   tokenization stage will be initially seen by the user agent as a
   stream of bytes (typically coming over the network or from the local
   file system). The bytes encode the actual characters according to a
-  particular <em>character encoding</em>, which the user agent must
-  use to decode the bytes into characters.</p>
+  particular <i>character encoding</i>, which the user agent must use
+  to decode the bytes into characters.</p>
 
   <p class="note">For XML documents, the algorithm user agents must
   use to determine the character encoding is given by the XML
   specification. This section does not apply to XML documents. <a
   href="#refsXML">[XML]</a></p>
 
+  <p>The <span>encoding sniffing algorithm</span> defined below is
+  used to determine the character encoding.</p>
 
+  <p>Given an encoding, the bytes in the <span>input byte
+  stream</span> must be converted to Unicode code points for the
+  tokenizer's <span>input stream</span>, as described by the rules for
+  that encoding, except that the leading U+FEFF BYTE ORDER MARK
+  character, if any, must not be stripped by the encoding layer (it is
+  stripped by the rule below).</p> <!-- this is to prevent two leading
+  BOMs from being both stripped, once by the decoder, and once by the
+  parser -->
+
+  <p>Bytes or sequences of bytes in the original byte stream that
+  could not be converted to Unicode code points must be converted to
+  U+FFFD REPLACEMENT CHARACTERs. Specifically, if the encoding is
+  UTF-8, the bytes must be <span title="decoded as UTF-8, with error
+  handling">decoded with the error handling</span> defined in this
+  specification.</p>
+
+  <p class="note">Bytes or sequences of bytes in the original byte
+  stream that did not conform to the encoding specification (e.g.
+  invalid UTF-8 byte sequences in a UTF-8 input byte stream) are
+  errors that conformance checkers are expected to report.</p>
+
+  <p>Any byte or sequence of bytes in the original byte stream that is
+  <span>misinterpreted for compatibility</span> is a <span>parse
+  error</span>.</p>
+
+
   <h5>Determining the character encoding</h5>
 
   <p>In some cases, it might be impractical to unambiguously determine
@@ -94451,7 +94480,7 @@
   <p>The <span>document's character encoding</span> must immediately
   be set to the value returned from this algorithm, at the same time
   as the user agent uses the returned value to select the decoder to
-  use for the input stream.</p>
+  use for the input byte stream.</p>
 
   <hr>
 
@@ -94466,7 +94495,7 @@
    <li>
 
     <p>Let <var title="">position</var> be a pointer to a byte in the
-    input stream, initially pointing at the first byte. If at any
+    input byte stream, initially pointing at the first byte. If at any
     point during these steps the user agent either runs out of bytes
     or reaches its <var title="">end condition</var>, then abort the
     <span>prescan a byte stream to determine its encoding</span>
@@ -94630,8 +94659,8 @@
    </li>
 
    <li><i>Next byte</i>: Move <var title="">position</var> so it
-   points at the next byte in the input stream, and return to the step
-   above labeld <i>loop</i>.</li>
+   points at the next byte in the input byte stream, and return to the
+   step above labeld <i>loop</i>.</li>
 
   </ol>
 
@@ -94970,32 +94999,12 @@
 
   <h5>Preprocessing the input stream</h5>
 
-  <p>Given an encoding, the bytes in the input stream must be
-  converted to Unicode code points for the tokenizer, as described by
-  the rules for that encoding, except that the leading U+FEFF BYTE
-  ORDER MARK character, if any, must not be stripped by the encoding
-  layer (it is stripped by the rule below).</p> <!-- this is to
-  prevent two leading BOMs from being both stripped, once by the
-  decoder, and once by the parser -->
+  <p>The <dfn>input stream</dfn> consists of the characters pushed
+  into it as the <span>input byte stream</span> is decoded or from the
+  various APIs that directly manipulate the input stream.</p>
 
-  <p>Bytes or sequences of bytes in the original byte stream that
-  could not be converted to Unicode code points must be converted to
-  U+FFFD REPLACEMENT CHARACTERs. Specifically, if the encoding is
-  UTF-8, the bytes must be <span title="decoded as UTF-8, with error
-  handling">decoded with the error handling</span> defined in this
-  specification.</p>
-
-  <p class="note">Bytes or sequences of bytes in the original byte
-  stream that did not conform to the encoding specification
-  (e.g. invalid UTF-8 byte sequences in a UTF-8 input stream) are
-  errors that conformance checkers are expected to report.</p>
-
-  <p>Any byte or sequence of bytes in the original byte stream that is
-  <span>misinterpreted for compatibility</span> is a <span>parse
-  error</span>.</p>
-
   <p>One leading U+FEFF BYTE ORDER MARK character must be ignored if
-  any are present.</p>
+  any are present in the <span>input stream</span>.</p>
 
   <p class="note">The requirement to strip a U+FEFF BYTE ORDER MARK
   character regardless of whether that character was used to determine
@@ -95017,18 +95026,18 @@
   undefined Unicode characters (noncharacters).</p>
 
   <p>U+000D CARRIAGE RETURN (CR) characters and U+000A LINE FEED (LF)
-  characters are treated specially. Any CR characters that are
-  followed by LF characters must be removed, and any CR characters not
-  followed by LF characters must be converted to LF characters. Thus,
-  newlines in HTML DOMs are represented by LF characters, and there
-  are never any CR characters in the input to the
-  <span>tokenization</span> stage.</p>
+  characters are treated specially. All CR characters must be
+  converted to LF characters, and any LF characters that immediately
+  follow a CR character must be ignored. Thus, newlines in HTML DOMs
+  are represented by LF characters, and there are never any CR
+  characters in the input to the <span>tokenization</span> stage.</p>
 
   <p>The <dfn>next input character</dfn> is the first character in the
-  input stream that has not yet been <dfn>consumed</dfn>. Initially,
-  the <i>next input character</i> is the first character in the
-  input. The <dfn>current input character</dfn> is the last character
-  to have been <i>consumed</i>.</p>
+  <span>input stream</span> that has not yet been <dfn>consumed</dfn>
+  or explicit ignored by the requirements in this section. Initially,
+  the <i>next input character</i> is the first character in the input.
+  The <dfn>current input character</dfn> is the last character to have
+  been <i>consumed</i>.</p>
 
   <p>The <dfn>insertion point</dfn> is the position (just before a
   character or just before the end of the input stream) where content




More information about the Commit-Watchers mailing list