[html5] r6990 - [e] (0) Factor out the prescan algorithm for reuse in other specs. Fixing https: [...]
whatwg at whatwg.org
whatwg at whatwg.org
Mon Feb 13 13:07:00 PST 2012
Author: ianh
Date: 2012-02-13 13:06:58 -0800 (Mon, 13 Feb 2012)
New Revision: 6990
Modified:
complete.html
index
source
Log:
[e] (0) Factor out the prescan algorithm for reuse in other specs.
Fixing https://www.w3.org/Bugs/Public/show_bug.cgi?id=14284
Affected topics: HTML Syntax and Parsing
Modified: complete.html
===================================================================
--- complete.html 2012-02-11 18:45:11 UTC (rev 6989)
+++ complete.html 2012-02-13 21:06:58 UTC (rev 6990)
@@ -240,7 +240,7 @@
<header class=head id=head><p><a class=logo href=http://www.whatwg.org/><img alt=WHATWG height=101 src=/images/logo width=101></a></p>
<hgroup><h1 class=allcaps>HTML</h1>
- <h2 class="no-num no-toc">Living Standard — Last Updated 11 February 2012</h2>
+ <h2 class="no-num no-toc">Living Standard — Last Updated 13 February 2012</h2>
</hgroup><dl><dt><strong>Web developer edition:</strong></dt>
<dd><strong><a href=http://developers.whatwg.org/>http://developers.whatwg.org/</a></strong></dd>
<dt>Multiple-page version:</dt>
@@ -81188,10 +81188,10 @@
parse of the document with the real encoding.</p>
<p id=documentEncoding>User agents must use the following
- algorithm (the <dfn id=encoding-sniffing-algorithm>encoding sniffing algorithm</dfn>) to determine
- the character encoding to use when decoding a document in the first
- pass. This algorithm takes as input any out-of-band metadata
- available to the user agent (e.g. the <a href=#content-type title=Content-Type>Content-Type metadata</a> of the document)
+ algorithm, called the <dfn id=encoding-sniffing-algorithm>encoding sniffing algorithm</dfn>, to
+ determine the character encoding to use when decoding a document in
+ the first pass. This algorithm takes as input any out-of-band
+ metadata available to the user agent (e.g. the <a href=#content-type title=Content-Type>Content-Type metadata</a> of the document)
and all the bytes available so far, and returns an encoding and a
<dfn id=concept-encoding-confidence title=concept-encoding-confidence>confidence</dfn>. The
confidence is either <i>tentative</i>, <i>certain</i>, or
@@ -81227,9 +81227,9 @@
<p class=note>The authoring conformance requirements for
character encoding declarations limit them to only appearing <a href=#charset1024>in the first 1024 bytes</a>. User agents are
- therefore encouraged to use the preparse algorithm below (part of
- these steps) on the first 1024 bytes, but not to stall beyond
- that.</p>
+ therefore encouraged to use the prescan algorithm below (as
+ invoked by these steps) on the first 1024 bytes, but not to stall
+ beyond that.</p>
</li>
@@ -81265,317 +81265,28 @@
</table><p class=note>This step looks for Unicode Byte Order Marks
(BOMs).</li>
- <li><p>Otherwise, the user agent will have to search for explicit
- character encoding information in the file itself. This should
- proceed as follows:
+ <li>
- <p>Let <var title="">position</var> be a pointer to a byte in the
- input stream, initially pointing at the first byte. If at any
- point during these substeps the user agent either runs out of
- bytes or decides that scanning further bytes would not be
- efficient, then skip to the next step of the overall character
- encoding detection algorithm. User agents may decide that scanning
- <em>any</em> bytes is not efficient, in which case these substeps
- are entirely skipped.</p>
+ <p>Otherwise, optionally <a href=#prescan-a-byte-stream-to-determine-its-encoding title="prescan a byte stream to
+ determine its encoding">prescan the byte stream to determine its
+ encoding</a>. The <var title="">end condition</var> is that the
+ user agent decides that scanning further bytes would not be
+ efficient. User agents are encouraged to only prescan the first
+ 1024 bytes. User agents may decide that scanning <em>any</em>
+ bytes is not efficient, in which case these substeps are entirely
+ skipped.</p>
- <p>Now, repeat the following "two" steps until the algorithm
- aborts (either because user agent aborts, as described above, or
- because a character encoding is found):</p>
+ <p>The aforementioned algorithm either aborts unsuccessfully or
+ returns a character encoding. If it returns a character encoding,
+ then this algorithm must be aborted, returning the same encoding,
+ with <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a>
+ <i>tentative</i>.</p>
- <ol><li><p>If <var title="">position</var> points to:</p>
-
- <dl class=switch><dt>A sequence of bytes starting with: 0x3C 0x21 0x2D 0x2D (ASCII '<!--')</dt>
- <dd>
-
- <p>Advance the <var title="">position</var> pointer so that it
- points at the first 0x3E byte which is preceded by two 0x2D
- bytes (i.e. at the end of an ASCII '-->' sequence) and comes
- after the 0x3C byte that was found. (The two 0x2D bytes can be
- the same as the those in the '<!--' sequence.)</p>
-
- </dd>
-
- <dt>A sequence of bytes starting with: 0x3C, 0x4D or 0x6D, 0x45 or 0x65, 0x54 or 0x74, 0x41 or 0x61, and one of 0x09, 0x0A, 0x0C, 0x0D, 0x20, 0x2F (case-insensitive ASCII '<meta' followed by a space or slash)</dt>
- <dd>
-
- <ol><li><p>Advance the <var title="">position</var> pointer so
- that it points at the next 0x09, 0x0A, 0x0C, 0x0D, 0x20, or
- 0x2F byte (the one in sequence of characters matched
- above).</li>
-
- <li><p>Let <var title="">attribute list</var> be an empty
- list of strings.</li> <!-- so long as we only care about
- http-equiv, content, and charset, this can be a 3-bit
- bitfield -->
-
- <li><p>Let <var title="">got pragma</var> be false.</li>
-
- <li><p>Let <var title="">need pragma</var> be null.</li>
-
- <li><p>Let <var title="">charset</var> be the null value
- (which, for the purposes of this algorithm, is distinct from
- an unrecognised encoding or the empty string).</li>
-
- <li><p><i>Attributes</i>: <a href=#concept-get-attributes-when-sniffing title=concept-get-attributes-when-sniffing>Get an
- attribute</a> and its value. If no attribute was sniffed,
- then jump to the <i>processing</i> step below.</li>
-
- <li><p>If the attribute's name is already in <var title="">attribute list</var>, then return to the step
- labeled <i>attributes</i>.</p>
-
- <li><p>Add the attribute's name to <var title="">attribute
- list</var>.</p>
-
- <li>
-
- <p>Run the appropriate step from the following list, if one
- applies:</p>
-
- <dl class=switch><dt>If the attribute's name is "<code title="">http-equiv</code>"</dt>
-
- <dd><p>If the attribute's value is "<code title="">content-type</code>", then set <var title="">got
- pragma</var> to true.</dd>
-
- <dt>If the attribute's name is "<code title="">content</code>"</dt>
-
- <dd><p>Apply the <a href=#algorithm-for-extracting-an-encoding-from-a-meta-element>algorithm for extracting an encoding
- from a <code>meta</code> element</a>, giving the
- attribute's value as the string to parse. If an encoding is
- returned, and if <var title="">charset</var> is still set
- to null, let <var title="">charset</var> be the encoding
- returned, and set <var title="">need pragma</var> to
- true.</dd>
-
- <dt>If the attribute's name is "<code title="">charset</code>"</dt>
-
- <dd><p>Let <var title="">charset</var> be the encoding
- corresponding to the attribute's value, and set <var title="">need pragma</var> to false.</dd>
-
- </dl></li>
-
- <li><p>Return to the step labeled <i>attributes</i>.</li>
-
- <li><p><i>Processing</i>: If <var title="">need pragma</var>
- is null, then jump to the second step of the overall "two
- step" algorithm.</li>
-
- <li><p>If <var title="">need pragma</var> is true but <var title="">got pragma</var> is false, then jump to the second
- step of the overall "two step" algorithm.</li>
-
- <li><p>If <var title="">charset</var> is <a href=#a-utf-16-encoding>a UTF-16
- encoding</a>, change the value of <var title="">charset</var> to UTF-8.</li>
-
- <li><p>If <var title="">charset</var> is not a supported
- character encoding, then jump to the second step of the
- overall "two step" algorithm.</li>
-
- <li><p>Return the encoding given by <var title="">charset</var>, with <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a>
- <i>tentative</i>, and abort all these steps.</li>
-
- </ol></dd>
-
- <dt>A sequence of bytes starting with a 0x3C byte (ASCII <), optionally a 0x2F byte (ASCII /), and finally a byte in the range 0x41-0x5A or 0x61-0x7A (an ASCII letter)</dt>
- <dd>
-
- <ol><li><p>Advance the <var title="">position</var> pointer so
- that it points at the next 0x09 (ASCII TAB), 0x0A (ASCII LF),
- 0x0C (ASCII FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E
- (ASCII >) byte.</li>
-
- <li><p>Repeatedly <a href=#concept-get-attributes-when-sniffing title=concept-get-attributes-when-sniffing>get an
- attribute</a> until no further attributes can be found,
- then jump to the second step in the overall "two step"
- algorithm.</li>
-
- </ol></dd>
-
- <dt>A sequence of bytes starting with: 0x3C 0x21 (ASCII '<!')</dt>
- <dt>A sequence of bytes starting with: 0x3C 0x2F (ASCII '</')</dt>
- <dt>A sequence of bytes starting with: 0x3C 0x3F (ASCII '<?')</dt>
- <dd>
-
- <p>Advance the <var title="">position</var> pointer so that it
- points at the first 0x3E byte (ASCII >) that comes after the
- 0x3C byte that was found.</p>
-
- </dd>
-
- <dt>Any other byte</dt>
- <dd>
-
- <p>Do nothing with that byte.</p>
-
- </dd>
-
- </dl></li>
-
- <li>Move <var title="">position</var> so it points at the next
- byte in the input stream, and return to the first step of this
- "two step" algorithm.</li>
-
- </ol><p>When the above "two step" algorithm says to <dfn id=concept-get-attributes-when-sniffing title=concept-get-attributes-when-sniffing>get an
- attribute</dfn>, it means doing this:</p>
-
- <ol><li><p>If the byte at <var title="">position</var> is one of 0x09
- (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR),
- 0x20 (ASCII space), or 0x2F (ASCII /) then advance <var title="">position</var> to the next byte and redo this
- substep.</li>
-
- <li><p>If the byte at <var title="">position</var> is 0x3E (ASCII
- >), then abort the "get an attribute" algorithm. There isn't
- one.</li>
-
- <li><p>Otherwise, the byte at <var title="">position</var> is the
- start of the attribute name. Let <var title="">attribute
- name</var> and <var title="">attribute value</var> be the empty
- string.</li>
-
- <li><p><i>Attribute name</i>: Process the byte at <var title="">position</var> as follows:</p>
-
- <dl class=switch><dt>If it is 0x3D (ASCII =), and the <var title="">attribute
- name</var> is longer than the empty string</dt>
-
- <dd>Advance <var title="">position</var> to the next byte and
- jump to the step below labeled <i>value</i>.</dd>
-
- <dt>If it is 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII
- FF), 0x0D (ASCII CR), or 0x20 (ASCII space)</dt>
-
- <dd>Jump to the step below labeled <i>spaces</i>.</dd>
-
- <dt>If it is 0x2F (ASCII /) or 0x3E (ASCII >)</dt>
-
- <dd>Abort the "get an attribute" algorithm. The attribute's
- name is the value of <var title="">attribute name</var>, its
- value is the empty string.</dd>
-
- <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII
- Z)</dt>
-
- <dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute name</var> (where <var title="">b</var> is
- the value of the byte at <var title="">position</var>). (This
- converts the input to lowercase.)</dd>
-
- <dt>Anything else</dt>
-
- <dd>Append the Unicode character with the same code point as the
- value of the byte at <var title="">position</var>) to <var title="">attribute name</var>. (It doesn't actually matter how
- bytes outside the ASCII range are handled here, since only
- ASCII characters can contribute to the detection of a character
- encoding.)</dd>
-
- </dl></li>
-
- <li><p>Advance <var title="">position</var> to the next byte and
- return to the previous step.</li>
-
- <li><p><i>Spaces</i>: If the byte at <var title="">position</var> is one of 0x09 (ASCII TAB), 0x0A (ASCII
- LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then
- advance <var title="">position</var> to the next byte, then,
- repeat this step.</li>
-
- <li><p>If the byte at <var title="">position</var> is
- <em>not</em> 0x3D (ASCII =), abort the "get an attribute"
- algorithm. The attribute's name is the value of <var title="">attribute name</var>, its value is the empty
- string.</li>
-
- <li><p>Advance <var title="">position</var> past the 0x3D (ASCII
- =) byte.</li>
-
- <li><p><i>Value</i>: If the byte at <var title="">position</var> is one of 0x09 (ASCII TAB), 0x0A (ASCII
- LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then
- advance <var title="">position</var> to the next byte, then,
- repeat this step.</li>
-
- <li><p>Process the byte at <var title="">position</var> as
- follows:</p>
-
- <dl class=switch><dt>If it is 0x22 (ASCII ") or 0x27 (ASCII ')</dt>
-
- <dd>
-
- <ol><li>Let <var title="">b</var> be the value of the byte at
- <var title="">position</var>.</li>
-
- <li>Advance <var title="">position</var> to the next
- byte.</li>
-
- <li>If the value of the byte at <var title="">position</var>
- is the value of <var title="">b</var>, then advance <var title="">position</var> to the next byte and abort the "get
- an attribute" algorithm. The attribute's name is the value of
- <var title="">attribute name</var>, and its value is the
- value of <var title="">attribute value</var>.</li>
-
- <li>Otherwise, if the value of the byte at <var title="">position</var> is in the range 0x41 (ASCII A) to
- 0x5A (ASCII Z), then append a Unicode character to <var title="">attribute value</var> whose code point is 0x20 more
- than the value of the byte at <var title="">position</var>.</li>
-
- <li>Otherwise, append a Unicode character to <var title="">attribute value</var> whose code point is the same as
- the value of the byte at <var title="">position</var>.</li>
-
- <li>Return to the second step in these substeps.</li>
-
- </ol></dd>
-
- <dt>If it is 0x3E (ASCII >)</dt>
-
- <dd>Abort the "get an attribute" algorithm. The attribute's
- name is the value of <var title="">attribute name</var>, its
- value is the empty string.</dd>
-
-
- <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII
- Z)</dt>
-
- <dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute
- value</var> (where <var title="">b</var> is the value of the
- byte at <var title="">position</var>). Advance <var title="">position</var> to the next byte.</dd>
-
- <dt>Anything else</dt>
-
- <dd>Append the Unicode character with the same code point as the
- value of the byte at <var title="">position</var>) to <var title="">attribute value</var>. Advance <var title="">position</var> to the next byte.</dd>
-
- </dl></li>
-
- <li><p>Process the byte at <var title="">position</var> as
- follows:</p>
-
- <dl class=switch><dt>If it is 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII
- FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E (ASCII
- >)</dt>
-
- <dd>Abort the "get an attribute" algorithm. The attribute's
- name is the value of <var title="">attribute name</var> and its
- value is the value of <var title="">attribute value</var>.</dd>
-
- <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII
- Z)</dt>
-
- <dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute
- value</var> (where <var title="">b</var> is the value of the
- byte at <var title="">position</var>).</dd>
-
- <dt>Anything else</dt>
-
- <dd>Append the Unicode character with the same code point as the
- value of the byte at <var title="">position</var>) to <var title="">attribute value</var>.</dd>
-
- </dl></li>
-
- <li><p>Advance <var title="">position</var> to the next byte and
- return to the previous step.</li>
-
- </ol><p>For the sake of interoperability, user agents should not use a
- pre-scan algorithm that returns different results than the one
- described above. (But, if you do, please at least let us know, so
- that we can improve this algorithm and benefit everyone...)</p>
-
</li>
- <li><p>If the user agent has information on the likely encoding for
- this page, e.g. based on the encoding of the page when it was last
- visited, then return that encoding, with the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a>
+ <li><p>Otherwise, if the user agent has information on the likely
+ encoding for this page, e.g. based on the encoding of the page when
+ it was last visited, then return that encoding, with the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a>
<i>tentative</i>, and abort these steps.</li>
<li>
@@ -81719,18 +81430,328 @@
as the user agent uses the returned value to select the decoder to
use for the input stream.</p>
+ <hr><p>When an algorithm requires a user agent to <dfn id=prescan-a-byte-stream-to-determine-its-encoding>prescan a byte
+ stream to determine its encoding</dfn>, given some defined <var title="">end condition</var>, then it must run the following steps.
+ These steps either abort unsuccessfully or return a character
+ encoding.</p>
+
+ <ol><li>
+
+ <p>Let <var title="">position</var> be a pointer to a byte in the
+ input stream, initially pointing at the first byte. If at any
+ point during these steps the user agent either runs out of bytes
+ or reaches its <var title="">end condition</var>, then abort the
+ <a href=#prescan-a-byte-stream-to-determine-its-encoding>prescan a byte stream to determine its encoding</a>
+ algorithm unsuccessfully.</p>
+
+ </li>
+
+ <li>
+
+ <p><i>Loop</i>: If <var title="">position</var> points to:</p>
+
+ <dl class=switch><dt>A sequence of bytes starting with: 0x3C 0x21 0x2D 0x2D (ASCII '<!--')</dt>
+ <dd>
+
+ <p>Advance the <var title="">position</var> pointer so that it
+ points at the first 0x3E byte which is preceded by two 0x2D
+ bytes (i.e. at the end of an ASCII '-->' sequence) and comes
+ after the 0x3C byte that was found. (The two 0x2D bytes can be
+ the same as the those in the '<!--' sequence.)</p>
+
+ </dd>
+
+ <dt>A sequence of bytes starting with: 0x3C, 0x4D or 0x6D, 0x45 or 0x65, 0x54 or 0x74, 0x41 or 0x61, and one of 0x09, 0x0A, 0x0C, 0x0D, 0x20, 0x2F (case-insensitive ASCII '<meta' followed by a space or slash)</dt>
+ <dd>
+
+ <ol><li><p>Advance the <var title="">position</var> pointer so
+ that it points at the next 0x09, 0x0A, 0x0C, 0x0D, 0x20, or
+ 0x2F byte (the one in sequence of characters matched
+ above).</li>
+
+ <li><p>Let <var title="">attribute list</var> be an empty
+ list of strings.</li> <!-- so long as we only care about
+ http-equiv, content, and charset, this can be a 3-bit
+ bitfield -->
+
+ <li><p>Let <var title="">got pragma</var> be false.</li>
+
+ <li><p>Let <var title="">need pragma</var> be null.</li>
+
+ <li><p>Let <var title="">charset</var> be the null value
+ (which, for the purposes of this algorithm, is distinct from
+ an unrecognised encoding or the empty string).</li>
+
+ <li><p><i>Attributes</i>: <a href=#concept-get-attributes-when-sniffing title=concept-get-attributes-when-sniffing>Get an
+ attribute</a> and its value. If no attribute was sniffed,
+ then jump to the <i>processing</i> step below.</li>
+
+ <li><p>If the attribute's name is already in <var title="">attribute list</var>, then return to the step
+ labeled <i>attributes</i>.</p>
+
+ <li><p>Add the attribute's name to <var title="">attribute
+ list</var>.</p>
+
+ <li>
+
+ <p>Run the appropriate step from the following list, if one
+ applies:</p>
+
+ <dl class=switch><dt>If the attribute's name is "<code title="">http-equiv</code>"</dt>
+
+ <dd><p>If the attribute's value is "<code title="">content-type</code>", then set <var title="">got
+ pragma</var> to true.</dd>
+
+ <dt>If the attribute's name is "<code title="">content</code>"</dt>
+
+ <dd><p>Apply the <a href=#algorithm-for-extracting-an-encoding-from-a-meta-element>algorithm for extracting an encoding
+ from a <code>meta</code> element</a>, giving the
+ attribute's value as the string to parse. If an encoding is
+ returned, and if <var title="">charset</var> is still set
+ to null, let <var title="">charset</var> be the encoding
+ returned, and set <var title="">need pragma</var> to
+ true.</dd>
+
+ <dt>If the attribute's name is "<code title="">charset</code>"</dt>
+
+ <dd><p>Let <var title="">charset</var> be the encoding
+ corresponding to the attribute's value, and set <var title="">need pragma</var> to false.</dd>
+
+ </dl></li>
+
+ <li><p>Return to the step labeled <i>attributes</i>.</li>
+
+ <li><p><i>Processing</i>: If <var title="">need pragma</var> is
+ null, then jump to the step below labeled <i>next
+ byte</i>.</li>
+
+ <li><p>If <var title="">need pragma</var> is true but <var title="">got pragma</var> is false, then jump to the step below
+ labeled <i>next byte</i>.</li>
+
+ <li><p>If <var title="">charset</var> is <a href=#a-utf-16-encoding>a UTF-16
+ encoding</a>, change the value of <var title="">charset</var> to UTF-8.</li>
+
+ <li><p>If <var title="">charset</var> is not a supported
+ character encoding, then jump to the step below labeled <i>next
+ byte</i>.</li>
+
+ <li><p>Abort the <a href=#prescan-a-byte-stream-to-determine-its-encoding>prescan a byte stream to determine its
+ encoding</a> algorithm, returning the encoding given by <var title="">charset</var>.</li>
+
+ </ol></dd>
+
+ <dt>A sequence of bytes starting with a 0x3C byte (ASCII <), optionally a 0x2F byte (ASCII /), and finally a byte in the range 0x41-0x5A or 0x61-0x7A (an ASCII letter)</dt>
+ <dd>
+
+ <ol><li><p>Advance the <var title="">position</var> pointer so
+ that it points at the next 0x09 (ASCII TAB), 0x0A (ASCII LF),
+ 0x0C (ASCII FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E
+ (ASCII >) byte.</li>
+
+ <li><p>Repeatedly <a href=#concept-get-attributes-when-sniffing title=concept-get-attributes-when-sniffing>get an
+ attribute</a> until no further attributes can be found, then
+ jump to the step below labeled <i>next byte</i>.</li>
+
+ </ol></dd>
+
+ <dt>A sequence of bytes starting with: 0x3C 0x21 (ASCII '<!')</dt>
+ <dt>A sequence of bytes starting with: 0x3C 0x2F (ASCII '</')</dt>
+ <dt>A sequence of bytes starting with: 0x3C 0x3F (ASCII '<?')</dt>
+ <dd>
+
+ <p>Advance the <var title="">position</var> pointer so that it
+ points at the first 0x3E byte (ASCII >) that comes after the
+ 0x3C byte that was found.</p>
+
+ </dd>
+
+ <dt>Any other byte</dt>
+ <dd>
+
+ <p>Do nothing with that byte.</p>
+
+ </dd>
+
+ </dl></li>
+
+ <li><i>Next byte</i>: Move <var title="">position</var> so it
+ points at the next byte in the input stream, and return to the step
+ above labeld <i>loop</i>.</li>
+
+ </ol><p>When the <a href=#prescan-a-byte-stream-to-determine-its-encoding>prescan a byte stream to determine its
+ encoding</a> algorithm says to <dfn id=concept-get-attributes-when-sniffing title=concept-get-attributes-when-sniffing>get an attribute</dfn>,
+ it means doing this:</p>
+
+ <ol><li><p>If the byte at <var title="">position</var> is one of 0x09
+ (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR),
+ 0x20 (ASCII space), or 0x2F (ASCII /) then advance <var title="">position</var> to the next byte and redo this
+ step.</li>
+
+ <li><p>If the byte at <var title="">position</var> is 0x3E (ASCII
+ >), then abort the <a href=#concept-get-attributes-when-sniffing title=concept-get-attributes-when-sniffing>get an
+ attribute</a> algorithm. There isn't one.</li>
+
+ <li><p>Otherwise, the byte at <var title="">position</var> is the
+ start of the attribute name. Let <var title="">attribute name</var>
+ and <var title="">attribute value</var> be the empty
+ string.</li>
+
+ <li><p><i>Attribute name</i>: Process the byte at <var title="">position</var> as follows:</p>
+
+ <dl class=switch><dt>If it is 0x3D (ASCII =), and the <var title="">attribute
+ name</var> is longer than the empty string</dt>
+
+ <dd>Advance <var title="">position</var> to the next byte and
+ jump to the step below labeled <i>value</i>.</dd>
+
+ <dt>If it is 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII
+ FF), 0x0D (ASCII CR), or 0x20 (ASCII space)</dt>
+
+ <dd>Jump to the step below labeled <i>spaces</i>.</dd>
+
+ <dt>If it is 0x2F (ASCII /) or 0x3E (ASCII >)</dt>
+
+ <dd>Abort the <a href=#concept-get-attributes-when-sniffing title=concept-get-attributes-when-sniffing>get an
+ attribute</a> algorithm. The attribute's name is the value of
+ <var title="">attribute name</var>, its value is the empty
+ string.</dd>
+
+ <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII
+ Z)</dt>
+
+ <dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute name</var> (where <var title="">b</var> is
+ the value of the byte at <var title="">position</var>). (This
+ converts the input to lowercase.)</dd>
+
+ <dt>Anything else</dt>
+
+ <dd>Append the Unicode character with the same code point as the
+ value of the byte at <var title="">position</var>) to <var title="">attribute name</var>. (It doesn't actually matter how
+ bytes outside the ASCII range are handled here, since only
+ ASCII characters can contribute to the detection of a character
+ encoding.)</dd>
+
+ </dl></li>
+
+ <li><p>Advance <var title="">position</var> to the next byte and
+ return to the previous step.</li>
+
+ <li><p><i>Spaces</i>: If the byte at <var title="">position</var> is one of 0x09 (ASCII TAB), 0x0A (ASCII
+ LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then
+ advance <var title="">position</var> to the next byte, then,
+ repeat this step.</li>
+
+ <li><p>If the byte at <var title="">position</var> is <em>not</em>
+ 0x3D (ASCII =), abort the <a href=#concept-get-attributes-when-sniffing title=concept-get-attributes-when-sniffing>get an
+ attribute</a> algorithm. The attribute's name is the value of
+ <var title="">attribute name</var>, its value is the empty
+ string.</li>
+
+ <li><p>Advance <var title="">position</var> past the 0x3D (ASCII
+ =) byte.</li>
+
+ <li><p><i>Value</i>: If the byte at <var title="">position</var> is one of 0x09 (ASCII TAB), 0x0A (ASCII
+ LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then
+ advance <var title="">position</var> to the next byte, then,
+ repeat this step.</li>
+
+ <li><p>Process the byte at <var title="">position</var> as
+ follows:</p>
+
+ <dl class=switch><dt>If it is 0x22 (ASCII ") or 0x27 (ASCII ')</dt>
+
+ <dd>
+
+ <ol><li>Let <var title="">b</var> be the value of the byte at
+ <var title="">position</var>.</li>
+
+ <li><i>Quote loop</i>: Advance <var title="">position</var> to
+ the next byte.</li>
+
+ <li>If the value of the byte at <var title="">position</var> is
+ the value of <var title="">b</var>, then advance <var title="">position</var> to the next byte and abort the "get an
+ attribute" algorithm. The attribute's name is the value of <var title="">attribute name</var>, and its value is the value of
+ <var title="">attribute value</var>.</li>
+
+ <li>Otherwise, if the value of the byte at <var title="">position</var> is in the range 0x41 (ASCII A) to 0x5A
+ (ASCII Z), then append a Unicode character to <var title="">attribute value</var> whose code point is 0x20 more
+ than the value of the byte at <var title="">position</var>.</li>
+
+ <li>Otherwise, append a Unicode character to <var title="">attribute value</var> whose code point is the same as
+ the value of the byte at <var title="">position</var>.</li>
+
+ <li>Return to the step above labeled <i>quote loop</i>.</li>
+
+ </ol></dd>
+
+ <dt>If it is 0x3E (ASCII >)</dt>
+
+ <dd>Abort the <a href=#concept-get-attributes-when-sniffing title=concept-get-attributes-when-sniffing>get an
+ attribute</a> algorithm. The attribute's name is the value of
+ <var title="">attribute name</var>, its value is the empty
+ string.</dd>
+
+
+ <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII
+ Z)</dt>
+
+ <dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute value</var> (where <var title="">b</var> is
+ the value of the byte at <var title="">position</var>). Advance
+ <var title="">position</var> to the next byte.</dd>
+
+ <dt>Anything else</dt>
+
+ <dd>Append the Unicode character with the same code point as the
+ value of the byte at <var title="">position</var>) to <var title="">attribute value</var>. Advance <var title="">position</var> to the next byte.</dd>
+
+ </dl></li>
+
+ <li><p>Process the byte at <var title="">position</var> as
+ follows:</p>
+
+ <dl class=switch><dt>If it is 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII
+ FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E (ASCII
+ >)</dt>
+
+ <dd>Abort the <a href=#concept-get-attributes-when-sniffing title=concept-get-attributes-when-sniffing>get an
+ attribute</a> algorithm. The attribute's name is the value of
+ <var title="">attribute name</var> and its value is the value of
+ <var title="">attribute value</var>.</dd>
+
+ <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII Z)</dt>
+
+ <dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute value</var> (where <var title="">b</var> is
+ the value of the byte at <var title="">position</var>).</dd>
+
+ <dt>Anything else</dt>
+
+ <dd>Append the Unicode character with the same code point as the
+ value of the byte at <var title="">position</var>) to <var title="">attribute value</var>.</dd>
+
+ </dl></li>
+
+ <li><p>Advance <var title="">position</var> to the next byte and
+ return to the previous step.</li>
+
+ </ol><p>For the sake of interoperability, user agents should not use a
+ pre-scan algorithm that returns different results than the one
+ described above. (But, if you do, please at least let us know, so
+ that we can improve this algorithm and benefit everyone...)</p>
+
<!--(removed this since the specs are being changed)
- <p class="note">This algorithm is a <span>willful violation</span>
- of the HTTP specification, which requires that the encoding be
- assumed to be ISO-8859-1 in the absence of a <span>character
- encoding declaration</span> to the contrary, and of RFC 2046,
- which requires that the encoding be assumed to be US-ASCII in the
- absence of a <span>character encoding declaration</span> to the
- contrary. This specification's third approach is motivated by a
+ <p class="note">These algorithms are a <span>willful
+ violation</span> of the HTTP specification, which requires that the
+ encoding be assumed to be ISO-8859-1 in the absence of a
+ <span>character encoding declaration</span> to the contrary, and of
+ RFC 2046, which requires that the encoding be assumed to be US-ASCII
+ in the absence of a <span>character encoding declaration</span> to
+ the contrary. This specification's third approach is motivated by a
desire to be maximally compatible with legacy content. <a
href="#refsHTTP">[HTTP]</a> <a href="#refsRFC2046">[RFC2046]</a></p>
-->
+
+
<h5 id=character-encodings-0><span class=secno>12.2.2.2 </span>Character encodings</h5>
<p>User agents must at a minimum support the UTF-8 and Windows-1252
Modified: index
===================================================================
--- index 2012-02-11 18:45:11 UTC (rev 6989)
+++ index 2012-02-13 21:06:58 UTC (rev 6990)
@@ -240,7 +240,7 @@
<header class=head id=head><p><a class=logo href=http://www.whatwg.org/><img alt=WHATWG height=101 src=/images/logo width=101></a></p>
<hgroup><h1 class=allcaps>HTML</h1>
- <h2 class="no-num no-toc">Living Standard — Last Updated 11 February 2012</h2>
+ <h2 class="no-num no-toc">Living Standard — Last Updated 13 February 2012</h2>
</hgroup><dl><dt><strong>Web developer edition:</strong></dt>
<dd><strong><a href=http://developers.whatwg.org/>http://developers.whatwg.org/</a></strong></dd>
<dt>Multiple-page version:</dt>
@@ -81188,10 +81188,10 @@
parse of the document with the real encoding.</p>
<p id=documentEncoding>User agents must use the following
- algorithm (the <dfn id=encoding-sniffing-algorithm>encoding sniffing algorithm</dfn>) to determine
- the character encoding to use when decoding a document in the first
- pass. This algorithm takes as input any out-of-band metadata
- available to the user agent (e.g. the <a href=#content-type title=Content-Type>Content-Type metadata</a> of the document)
+ algorithm, called the <dfn id=encoding-sniffing-algorithm>encoding sniffing algorithm</dfn>, to
+ determine the character encoding to use when decoding a document in
+ the first pass. This algorithm takes as input any out-of-band
+ metadata available to the user agent (e.g. the <a href=#content-type title=Content-Type>Content-Type metadata</a> of the document)
and all the bytes available so far, and returns an encoding and a
<dfn id=concept-encoding-confidence title=concept-encoding-confidence>confidence</dfn>. The
confidence is either <i>tentative</i>, <i>certain</i>, or
@@ -81227,9 +81227,9 @@
<p class=note>The authoring conformance requirements for
character encoding declarations limit them to only appearing <a href=#charset1024>in the first 1024 bytes</a>. User agents are
- therefore encouraged to use the preparse algorithm below (part of
- these steps) on the first 1024 bytes, but not to stall beyond
- that.</p>
+ therefore encouraged to use the prescan algorithm below (as
+ invoked by these steps) on the first 1024 bytes, but not to stall
+ beyond that.</p>
</li>
@@ -81265,317 +81265,28 @@
</table><p class=note>This step looks for Unicode Byte Order Marks
(BOMs).</li>
- <li><p>Otherwise, the user agent will have to search for explicit
- character encoding information in the file itself. This should
- proceed as follows:
+ <li>
- <p>Let <var title="">position</var> be a pointer to a byte in the
- input stream, initially pointing at the first byte. If at any
- point during these substeps the user agent either runs out of
- bytes or decides that scanning further bytes would not be
- efficient, then skip to the next step of the overall character
- encoding detection algorithm. User agents may decide that scanning
- <em>any</em> bytes is not efficient, in which case these substeps
- are entirely skipped.</p>
+ <p>Otherwise, optionally <a href=#prescan-a-byte-stream-to-determine-its-encoding title="prescan a byte stream to
+ determine its encoding">prescan the byte stream to determine its
+ encoding</a>. The <var title="">end condition</var> is that the
+ user agent decides that scanning further bytes would not be
+ efficient. User agents are encouraged to only prescan the first
+ 1024 bytes. User agents may decide that scanning <em>any</em>
+ bytes is not efficient, in which case these substeps are entirely
+ skipped.</p>
- <p>Now, repeat the following "two" steps until the algorithm
- aborts (either because user agent aborts, as described above, or
- because a character encoding is found):</p>
+ <p>The aforementioned algorithm either aborts unsuccessfully or
+ returns a character encoding. If it returns a character encoding,
+ then this algorithm must be aborted, returning the same encoding,
+ with <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a>
+ <i>tentative</i>.</p>
- <ol><li><p>If <var title="">position</var> points to:</p>
-
- <dl class=switch><dt>A sequence of bytes starting with: 0x3C 0x21 0x2D 0x2D (ASCII '<!--')</dt>
- <dd>
-
- <p>Advance the <var title="">position</var> pointer so that it
- points at the first 0x3E byte which is preceded by two 0x2D
- bytes (i.e. at the end of an ASCII '-->' sequence) and comes
- after the 0x3C byte that was found. (The two 0x2D bytes can be
- the same as the those in the '<!--' sequence.)</p>
-
- </dd>
-
- <dt>A sequence of bytes starting with: 0x3C, 0x4D or 0x6D, 0x45 or 0x65, 0x54 or 0x74, 0x41 or 0x61, and one of 0x09, 0x0A, 0x0C, 0x0D, 0x20, 0x2F (case-insensitive ASCII '<meta' followed by a space or slash)</dt>
- <dd>
-
- <ol><li><p>Advance the <var title="">position</var> pointer so
- that it points at the next 0x09, 0x0A, 0x0C, 0x0D, 0x20, or
- 0x2F byte (the one in sequence of characters matched
- above).</li>
-
- <li><p>Let <var title="">attribute list</var> be an empty
- list of strings.</li> <!-- so long as we only care about
- http-equiv, content, and charset, this can be a 3-bit
- bitfield -->
-
- <li><p>Let <var title="">got pragma</var> be false.</li>
-
- <li><p>Let <var title="">need pragma</var> be null.</li>
-
- <li><p>Let <var title="">charset</var> be the null value
- (which, for the purposes of this algorithm, is distinct from
- an unrecognised encoding or the empty string).</li>
-
- <li><p><i>Attributes</i>: <a href=#concept-get-attributes-when-sniffing title=concept-get-attributes-when-sniffing>Get an
- attribute</a> and its value. If no attribute was sniffed,
- then jump to the <i>processing</i> step below.</li>
-
- <li><p>If the attribute's name is already in <var title="">attribute list</var>, then return to the step
- labeled <i>attributes</i>.</p>
-
- <li><p>Add the attribute's name to <var title="">attribute
- list</var>.</p>
-
- <li>
-
- <p>Run the appropriate step from the following list, if one
- applies:</p>
-
- <dl class=switch><dt>If the attribute's name is "<code title="">http-equiv</code>"</dt>
-
- <dd><p>If the attribute's value is "<code title="">content-type</code>", then set <var title="">got
- pragma</var> to true.</dd>
-
- <dt>If the attribute's name is "<code title="">content</code>"</dt>
-
- <dd><p>Apply the <a href=#algorithm-for-extracting-an-encoding-from-a-meta-element>algorithm for extracting an encoding
- from a <code>meta</code> element</a>, giving the
- attribute's value as the string to parse. If an encoding is
- returned, and if <var title="">charset</var> is still set
- to null, let <var title="">charset</var> be the encoding
- returned, and set <var title="">need pragma</var> to
- true.</dd>
-
- <dt>If the attribute's name is "<code title="">charset</code>"</dt>
-
- <dd><p>Let <var title="">charset</var> be the encoding
- corresponding to the attribute's value, and set <var title="">need pragma</var> to false.</dd>
-
- </dl></li>
-
- <li><p>Return to the step labeled <i>attributes</i>.</li>
-
- <li><p><i>Processing</i>: If <var title="">need pragma</var>
- is null, then jump to the second step of the overall "two
- step" algorithm.</li>
-
- <li><p>If <var title="">need pragma</var> is true but <var title="">got pragma</var> is false, then jump to the second
- step of the overall "two step" algorithm.</li>
-
- <li><p>If <var title="">charset</var> is <a href=#a-utf-16-encoding>a UTF-16
- encoding</a>, change the value of <var title="">charset</var> to UTF-8.</li>
-
- <li><p>If <var title="">charset</var> is not a supported
- character encoding, then jump to the second step of the
- overall "two step" algorithm.</li>
-
- <li><p>Return the encoding given by <var title="">charset</var>, with <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a>
- <i>tentative</i>, and abort all these steps.</li>
-
- </ol></dd>
-
- <dt>A sequence of bytes starting with a 0x3C byte (ASCII <), optionally a 0x2F byte (ASCII /), and finally a byte in the range 0x41-0x5A or 0x61-0x7A (an ASCII letter)</dt>
- <dd>
-
- <ol><li><p>Advance the <var title="">position</var> pointer so
- that it points at the next 0x09 (ASCII TAB), 0x0A (ASCII LF),
- 0x0C (ASCII FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E
- (ASCII >) byte.</li>
-
- <li><p>Repeatedly <a href=#concept-get-attributes-when-sniffing title=concept-get-attributes-when-sniffing>get an
- attribute</a> until no further attributes can be found,
- then jump to the second step in the overall "two step"
- algorithm.</li>
-
- </ol></dd>
-
- <dt>A sequence of bytes starting with: 0x3C 0x21 (ASCII '<!')</dt>
- <dt>A sequence of bytes starting with: 0x3C 0x2F (ASCII '</')</dt>
- <dt>A sequence of bytes starting with: 0x3C 0x3F (ASCII '<?')</dt>
- <dd>
-
- <p>Advance the <var title="">position</var> pointer so that it
- points at the first 0x3E byte (ASCII >) that comes after the
- 0x3C byte that was found.</p>
-
- </dd>
-
- <dt>Any other byte</dt>
- <dd>
-
- <p>Do nothing with that byte.</p>
-
- </dd>
-
- </dl></li>
-
- <li>Move <var title="">position</var> so it points at the next
- byte in the input stream, and return to the first step of this
- "two step" algorithm.</li>
-
- </ol><p>When the above "two step" algorithm says to <dfn id=concept-get-attributes-when-sniffing title=concept-get-attributes-when-sniffing>get an
- attribute</dfn>, it means doing this:</p>
-
- <ol><li><p>If the byte at <var title="">position</var> is one of 0x09
- (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR),
- 0x20 (ASCII space), or 0x2F (ASCII /) then advance <var title="">position</var> to the next byte and redo this
- substep.</li>
-
- <li><p>If the byte at <var title="">position</var> is 0x3E (ASCII
- >), then abort the "get an attribute" algorithm. There isn't
- one.</li>
-
- <li><p>Otherwise, the byte at <var title="">position</var> is the
- start of the attribute name. Let <var title="">attribute
- name</var> and <var title="">attribute value</var> be the empty
- string.</li>
-
- <li><p><i>Attribute name</i>: Process the byte at <var title="">position</var> as follows:</p>
-
- <dl class=switch><dt>If it is 0x3D (ASCII =), and the <var title="">attribute
- name</var> is longer than the empty string</dt>
-
- <dd>Advance <var title="">position</var> to the next byte and
- jump to the step below labeled <i>value</i>.</dd>
-
- <dt>If it is 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII
- FF), 0x0D (ASCII CR), or 0x20 (ASCII space)</dt>
-
- <dd>Jump to the step below labeled <i>spaces</i>.</dd>
-
- <dt>If it is 0x2F (ASCII /) or 0x3E (ASCII >)</dt>
-
- <dd>Abort the "get an attribute" algorithm. The attribute's
- name is the value of <var title="">attribute name</var>, its
- value is the empty string.</dd>
-
- <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII
- Z)</dt>
-
- <dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute name</var> (where <var title="">b</var> is
- the value of the byte at <var title="">position</var>). (This
- converts the input to lowercase.)</dd>
-
- <dt>Anything else</dt>
-
- <dd>Append the Unicode character with the same code point as the
- value of the byte at <var title="">position</var>) to <var title="">attribute name</var>. (It doesn't actually matter how
- bytes outside the ASCII range are handled here, since only
- ASCII characters can contribute to the detection of a character
- encoding.)</dd>
-
- </dl></li>
-
- <li><p>Advance <var title="">position</var> to the next byte and
- return to the previous step.</li>
-
- <li><p><i>Spaces</i>: If the byte at <var title="">position</var> is one of 0x09 (ASCII TAB), 0x0A (ASCII
- LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then
- advance <var title="">position</var> to the next byte, then,
- repeat this step.</li>
-
- <li><p>If the byte at <var title="">position</var> is
- <em>not</em> 0x3D (ASCII =), abort the "get an attribute"
- algorithm. The attribute's name is the value of <var title="">attribute name</var>, its value is the empty
- string.</li>
-
- <li><p>Advance <var title="">position</var> past the 0x3D (ASCII
- =) byte.</li>
-
- <li><p><i>Value</i>: If the byte at <var title="">position</var> is one of 0x09 (ASCII TAB), 0x0A (ASCII
- LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then
- advance <var title="">position</var> to the next byte, then,
- repeat this step.</li>
-
- <li><p>Process the byte at <var title="">position</var> as
- follows:</p>
-
- <dl class=switch><dt>If it is 0x22 (ASCII ") or 0x27 (ASCII ')</dt>
-
- <dd>
-
- <ol><li>Let <var title="">b</var> be the value of the byte at
- <var title="">position</var>.</li>
-
- <li>Advance <var title="">position</var> to the next
- byte.</li>
-
- <li>If the value of the byte at <var title="">position</var>
- is the value of <var title="">b</var>, then advance <var title="">position</var> to the next byte and abort the "get
- an attribute" algorithm. The attribute's name is the value of
- <var title="">attribute name</var>, and its value is the
- value of <var title="">attribute value</var>.</li>
-
- <li>Otherwise, if the value of the byte at <var title="">position</var> is in the range 0x41 (ASCII A) to
- 0x5A (ASCII Z), then append a Unicode character to <var title="">attribute value</var> whose code point is 0x20 more
- than the value of the byte at <var title="">position</var>.</li>
-
- <li>Otherwise, append a Unicode character to <var title="">attribute value</var> whose code point is the same as
- the value of the byte at <var title="">position</var>.</li>
-
- <li>Return to the second step in these substeps.</li>
-
- </ol></dd>
-
- <dt>If it is 0x3E (ASCII >)</dt>
-
- <dd>Abort the "get an attribute" algorithm. The attribute's
- name is the value of <var title="">attribute name</var>, its
- value is the empty string.</dd>
-
-
- <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII
- Z)</dt>
-
- <dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute
- value</var> (where <var title="">b</var> is the value of the
- byte at <var title="">position</var>). Advance <var title="">position</var> to the next byte.</dd>
-
- <dt>Anything else</dt>
-
- <dd>Append the Unicode character with the same code point as the
- value of the byte at <var title="">position</var>) to <var title="">attribute value</var>. Advance <var title="">position</var> to the next byte.</dd>
-
- </dl></li>
-
- <li><p>Process the byte at <var title="">position</var> as
- follows:</p>
-
- <dl class=switch><dt>If it is 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII
- FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E (ASCII
- >)</dt>
-
- <dd>Abort the "get an attribute" algorithm. The attribute's
- name is the value of <var title="">attribute name</var> and its
- value is the value of <var title="">attribute value</var>.</dd>
-
- <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII
- Z)</dt>
-
- <dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute
- value</var> (where <var title="">b</var> is the value of the
- byte at <var title="">position</var>).</dd>
-
- <dt>Anything else</dt>
-
- <dd>Append the Unicode character with the same code point as the
- value of the byte at <var title="">position</var>) to <var title="">attribute value</var>.</dd>
-
- </dl></li>
-
- <li><p>Advance <var title="">position</var> to the next byte and
- return to the previous step.</li>
-
- </ol><p>For the sake of interoperability, user agents should not use a
- pre-scan algorithm that returns different results than the one
- described above. (But, if you do, please at least let us know, so
- that we can improve this algorithm and benefit everyone...)</p>
-
</li>
- <li><p>If the user agent has information on the likely encoding for
- this page, e.g. based on the encoding of the page when it was last
- visited, then return that encoding, with the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a>
+ <li><p>Otherwise, if the user agent has information on the likely
+ encoding for this page, e.g. based on the encoding of the page when
+ it was last visited, then return that encoding, with the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a>
<i>tentative</i>, and abort these steps.</li>
<li>
@@ -81719,18 +81430,328 @@
as the user agent uses the returned value to select the decoder to
use for the input stream.</p>
+ <hr><p>When an algorithm requires a user agent to <dfn id=prescan-a-byte-stream-to-determine-its-encoding>prescan a byte
+ stream to determine its encoding</dfn>, given some defined <var title="">end condition</var>, then it must run the following steps.
+ These steps either abort unsuccessfully or return a character
+ encoding.</p>
+
+ <ol><li>
+
+ <p>Let <var title="">position</var> be a pointer to a byte in the
+ input stream, initially pointing at the first byte. If at any
+ point during these steps the user agent either runs out of bytes
+ or reaches its <var title="">end condition</var>, then abort the
+ <a href=#prescan-a-byte-stream-to-determine-its-encoding>prescan a byte stream to determine its encoding</a>
+ algorithm unsuccessfully.</p>
+
+ </li>
+
+ <li>
+
+ <p><i>Loop</i>: If <var title="">position</var> points to:</p>
+
+ <dl class=switch><dt>A sequence of bytes starting with: 0x3C 0x21 0x2D 0x2D (ASCII '<!--')</dt>
+ <dd>
+
+ <p>Advance the <var title="">position</var> pointer so that it
+ points at the first 0x3E byte which is preceded by two 0x2D
+ bytes (i.e. at the end of an ASCII '-->' sequence) and comes
+ after the 0x3C byte that was found. (The two 0x2D bytes can be
+ the same as the those in the '<!--' sequence.)</p>
+
+ </dd>
+
+ <dt>A sequence of bytes starting with: 0x3C, 0x4D or 0x6D, 0x45 or 0x65, 0x54 or 0x74, 0x41 or 0x61, and one of 0x09, 0x0A, 0x0C, 0x0D, 0x20, 0x2F (case-insensitive ASCII '<meta' followed by a space or slash)</dt>
+ <dd>
+
+ <ol><li><p>Advance the <var title="">position</var> pointer so
+ that it points at the next 0x09, 0x0A, 0x0C, 0x0D, 0x20, or
+ 0x2F byte (the one in sequence of characters matched
+ above).</li>
+
+ <li><p>Let <var title="">attribute list</var> be an empty
+ list of strings.</li> <!-- so long as we only care about
+ http-equiv, content, and charset, this can be a 3-bit
+ bitfield -->
+
+ <li><p>Let <var title="">got pragma</var> be false.</li>
+
+ <li><p>Let <var title="">need pragma</var> be null.</li>
+
+ <li><p>Let <var title="">charset</var> be the null value
+ (which, for the purposes of this algorithm, is distinct from
+ an unrecognised encoding or the empty string).</li>
+
+ <li><p><i>Attributes</i>: <a href=#concept-get-attributes-when-sniffing title=concept-get-attributes-when-sniffing>Get an
+ attribute</a> and its value. If no attribute was sniffed,
+ then jump to the <i>processing</i> step below.</li>
+
+ <li><p>If the attribute's name is already in <var title="">attribute list</var>, then return to the step
+ labeled <i>attributes</i>.</p>
+
+ <li><p>Add the attribute's name to <var title="">attribute
+ list</var>.</p>
+
+ <li>
+
+ <p>Run the appropriate step from the following list, if one
+ applies:</p>
+
+ <dl class=switch><dt>If the attribute's name is "<code title="">http-equiv</code>"</dt>
+
+ <dd><p>If the attribute's value is "<code title="">content-type</code>", then set <var title="">got
+ pragma</var> to true.</dd>
+
+ <dt>If the attribute's name is "<code title="">content</code>"</dt>
+
+ <dd><p>Apply the <a href=#algorithm-for-extracting-an-encoding-from-a-meta-element>algorithm for extracting an encoding
+ from a <code>meta</code> element</a>, giving the
+ attribute's value as the string to parse. If an encoding is
+ returned, and if <var title="">charset</var> is still set
+ to null, let <var title="">charset</var> be the encoding
+ returned, and set <var title="">need pragma</var> to
+ true.</dd>
+
+ <dt>If the attribute's name is "<code title="">charset</code>"</dt>
+
+ <dd><p>Let <var title="">charset</var> be the encoding
+ corresponding to the attribute's value, and set <var title="">need pragma</var> to false.</dd>
+
+ </dl></li>
+
+ <li><p>Return to the step labeled <i>attributes</i>.</li>
+
+ <li><p><i>Processing</i>: If <var title="">need pragma</var> is
+ null, then jump to the step below labeled <i>next
+ byte</i>.</li>
+
+ <li><p>If <var title="">need pragma</var> is true but <var title="">got pragma</var> is false, then jump to the step below
+ labeled <i>next byte</i>.</li>
+
+ <li><p>If <var title="">charset</var> is <a href=#a-utf-16-encoding>a UTF-16
+ encoding</a>, change the value of <var title="">charset</var> to UTF-8.</li>
+
+ <li><p>If <var title="">charset</var> is not a supported
+ character encoding, then jump to the step below labeled <i>next
+ byte</i>.</li>
+
+ <li><p>Abort the <a href=#prescan-a-byte-stream-to-determine-its-encoding>prescan a byte stream to determine its
+ encoding</a> algorithm, returning the encoding given by <var title="">charset</var>.</li>
+
+ </ol></dd>
+
+ <dt>A sequence of bytes starting with a 0x3C byte (ASCII <), optionally a 0x2F byte (ASCII /), and finally a byte in the range 0x41-0x5A or 0x61-0x7A (an ASCII letter)</dt>
+ <dd>
+
+ <ol><li><p>Advance the <var title="">position</var> pointer so
+ that it points at the next 0x09 (ASCII TAB), 0x0A (ASCII LF),
+ 0x0C (ASCII FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E
+ (ASCII >) byte.</li>
+
+ <li><p>Repeatedly <a href=#concept-get-attributes-when-sniffing title=concept-get-attributes-when-sniffing>get an
+ attribute</a> until no further attributes can be found, then
+ jump to the step below labeled <i>next byte</i>.</li>
+
+ </ol></dd>
+
+ <dt>A sequence of bytes starting with: 0x3C 0x21 (ASCII '<!')</dt>
+ <dt>A sequence of bytes starting with: 0x3C 0x2F (ASCII '</')</dt>
+ <dt>A sequence of bytes starting with: 0x3C 0x3F (ASCII '<?')</dt>
+ <dd>
+
+ <p>Advance the <var title="">position</var> pointer so that it
+ points at the first 0x3E byte (ASCII >) that comes after the
+ 0x3C byte that was found.</p>
+
+ </dd>
+
+ <dt>Any other byte</dt>
+ <dd>
+
+ <p>Do nothing with that byte.</p>
+
+ </dd>
+
+ </dl></li>
+
+ <li><i>Next byte</i>: Move <var title="">position</var> so it
+ points at the next byte in the input stream, and return to the step
+ above labeld <i>loop</i>.</li>
+
+ </ol><p>When the <a href=#prescan-a-byte-stream-to-determine-its-encoding>prescan a byte stream to determine its
+ encoding</a> algorithm says to <dfn id=concept-get-attributes-when-sniffing title=concept-get-attributes-when-sniffing>get an attribute</dfn>,
+ it means doing this:</p>
+
+ <ol><li><p>If the byte at <var title="">position</var> is one of 0x09
+ (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR),
+ 0x20 (ASCII space), or 0x2F (ASCII /) then advance <var title="">position</var> to the next byte and redo this
+ step.</li>
+
+ <li><p>If the byte at <var title="">position</var> is 0x3E (ASCII
+ >), then abort the <a href=#concept-get-attributes-when-sniffing title=concept-get-attributes-when-sniffing>get an
+ attribute</a> algorithm. There isn't one.</li>
+
+ <li><p>Otherwise, the byte at <var title="">position</var> is the
+ start of the attribute name. Let <var title="">attribute name</var>
+ and <var title="">attribute value</var> be the empty
+ string.</li>
+
+ <li><p><i>Attribute name</i>: Process the byte at <var title="">position</var> as follows:</p>
+
+ <dl class=switch><dt>If it is 0x3D (ASCII =), and the <var title="">attribute
+ name</var> is longer than the empty string</dt>
+
+ <dd>Advance <var title="">position</var> to the next byte and
+ jump to the step below labeled <i>value</i>.</dd>
+
+ <dt>If it is 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII
+ FF), 0x0D (ASCII CR), or 0x20 (ASCII space)</dt>
+
+ <dd>Jump to the step below labeled <i>spaces</i>.</dd>
+
+ <dt>If it is 0x2F (ASCII /) or 0x3E (ASCII >)</dt>
+
+ <dd>Abort the <a href=#concept-get-attributes-when-sniffing title=concept-get-attributes-when-sniffing>get an
+ attribute</a> algorithm. The attribute's name is the value of
+ <var title="">attribute name</var>, its value is the empty
+ string.</dd>
+
+ <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII
+ Z)</dt>
+
+ <dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute name</var> (where <var title="">b</var> is
+ the value of the byte at <var title="">position</var>). (This
+ converts the input to lowercase.)</dd>
+
+ <dt>Anything else</dt>
+
+ <dd>Append the Unicode character with the same code point as the
+ value of the byte at <var title="">position</var>) to <var title="">attribute name</var>. (It doesn't actually matter how
+ bytes outside the ASCII range are handled here, since only
+ ASCII characters can contribute to the detection of a character
+ encoding.)</dd>
+
+ </dl></li>
+
+ <li><p>Advance <var title="">position</var> to the next byte and
+ return to the previous step.</li>
+
+ <li><p><i>Spaces</i>: If the byte at <var title="">position</var> is one of 0x09 (ASCII TAB), 0x0A (ASCII
+ LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then
+ advance <var title="">position</var> to the next byte, then,
+ repeat this step.</li>
+
+ <li><p>If the byte at <var title="">position</var> is <em>not</em>
+ 0x3D (ASCII =), abort the <a href=#concept-get-attributes-when-sniffing title=concept-get-attributes-when-sniffing>get an
+ attribute</a> algorithm. The attribute's name is the value of
+ <var title="">attribute name</var>, its value is the empty
+ string.</li>
+
+ <li><p>Advance <var title="">position</var> past the 0x3D (ASCII
+ =) byte.</li>
+
+ <li><p><i>Value</i>: If the byte at <var title="">position</var> is one of 0x09 (ASCII TAB), 0x0A (ASCII
+ LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then
+ advance <var title="">position</var> to the next byte, then,
+ repeat this step.</li>
+
+ <li><p>Process the byte at <var title="">position</var> as
+ follows:</p>
+
+ <dl class=switch><dt>If it is 0x22 (ASCII ") or 0x27 (ASCII ')</dt>
+
+ <dd>
+
+ <ol><li>Let <var title="">b</var> be the value of the byte at
+ <var title="">position</var>.</li>
+
+ <li><i>Quote loop</i>: Advance <var title="">position</var> to
+ the next byte.</li>
+
+ <li>If the value of the byte at <var title="">position</var> is
+ the value of <var title="">b</var>, then advance <var title="">position</var> to the next byte and abort the "get an
+ attribute" algorithm. The attribute's name is the value of <var title="">attribute name</var>, and its value is the value of
+ <var title="">attribute value</var>.</li>
+
+ <li>Otherwise, if the value of the byte at <var title="">position</var> is in the range 0x41 (ASCII A) to 0x5A
+ (ASCII Z), then append a Unicode character to <var title="">attribute value</var> whose code point is 0x20 more
+ than the value of the byte at <var title="">position</var>.</li>
+
+ <li>Otherwise, append a Unicode character to <var title="">attribute value</var> whose code point is the same as
+ the value of the byte at <var title="">position</var>.</li>
+
+ <li>Return to the step above labeled <i>quote loop</i>.</li>
+
+ </ol></dd>
+
+ <dt>If it is 0x3E (ASCII >)</dt>
+
+ <dd>Abort the <a href=#concept-get-attributes-when-sniffing title=concept-get-attributes-when-sniffing>get an
+ attribute</a> algorithm. The attribute's name is the value of
+ <var title="">attribute name</var>, its value is the empty
+ string.</dd>
+
+
+ <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII
+ Z)</dt>
+
+ <dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute value</var> (where <var title="">b</var> is
+ the value of the byte at <var title="">position</var>). Advance
+ <var title="">position</var> to the next byte.</dd>
+
+ <dt>Anything else</dt>
+
+ <dd>Append the Unicode character with the same code point as the
+ value of the byte at <var title="">position</var>) to <var title="">attribute value</var>. Advance <var title="">position</var> to the next byte.</dd>
+
+ </dl></li>
+
+ <li><p>Process the byte at <var title="">position</var> as
+ follows:</p>
+
+ <dl class=switch><dt>If it is 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII
+ FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E (ASCII
+ >)</dt>
+
+ <dd>Abort the <a href=#concept-get-attributes-when-sniffing title=concept-get-attributes-when-sniffing>get an
+ attribute</a> algorithm. The attribute's name is the value of
+ <var title="">attribute name</var> and its value is the value of
+ <var title="">attribute value</var>.</dd>
+
+ <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII Z)</dt>
+
+ <dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute value</var> (where <var title="">b</var> is
+ the value of the byte at <var title="">position</var>).</dd>
+
+ <dt>Anything else</dt>
+
+ <dd>Append the Unicode character with the same code point as the
+ value of the byte at <var title="">position</var>) to <var title="">attribute value</var>.</dd>
+
+ </dl></li>
+
+ <li><p>Advance <var title="">position</var> to the next byte and
+ return to the previous step.</li>
+
+ </ol><p>For the sake of interoperability, user agents should not use a
+ pre-scan algorithm that returns different results than the one
+ described above. (But, if you do, please at least let us know, so
+ that we can improve this algorithm and benefit everyone...)</p>
+
<!--(removed this since the specs are being changed)
- <p class="note">This algorithm is a <span>willful violation</span>
- of the HTTP specification, which requires that the encoding be
- assumed to be ISO-8859-1 in the absence of a <span>character
- encoding declaration</span> to the contrary, and of RFC 2046,
- which requires that the encoding be assumed to be US-ASCII in the
- absence of a <span>character encoding declaration</span> to the
- contrary. This specification's third approach is motivated by a
+ <p class="note">These algorithms are a <span>willful
+ violation</span> of the HTTP specification, which requires that the
+ encoding be assumed to be ISO-8859-1 in the absence of a
+ <span>character encoding declaration</span> to the contrary, and of
+ RFC 2046, which requires that the encoding be assumed to be US-ASCII
+ in the absence of a <span>character encoding declaration</span> to
+ the contrary. This specification's third approach is motivated by a
desire to be maximally compatible with legacy content. <a
href="#refsHTTP">[HTTP]</a> <a href="#refsRFC2046">[RFC2046]</a></p>
-->
+
+
<h5 id=character-encodings-0><span class=secno>12.2.2.2 </span>Character encodings</h5>
<p>User agents must at a minimum support the UTF-8 and Windows-1252
Modified: source
===================================================================
--- source 2012-02-11 18:45:11 UTC (rev 6989)
+++ source 2012-02-13 21:06:58 UTC (rev 6990)
@@ -94148,10 +94148,10 @@
parse of the document with the real encoding.</p>
<p id="documentEncoding">User agents must use the following
- algorithm (the <dfn>encoding sniffing algorithm</dfn>) to determine
- the character encoding to use when decoding a document in the first
- pass. This algorithm takes as input any out-of-band metadata
- available to the user agent (e.g. the <span
+ algorithm, called the <dfn>encoding sniffing algorithm</dfn>, to
+ determine the character encoding to use when decoding a document in
+ the first pass. This algorithm takes as input any out-of-band
+ metadata available to the user agent (e.g. the <span
title="Content-Type">Content-Type metadata</span> of the document)
and all the bytes available so far, and returns an encoding and a
<dfn title="concept-encoding-confidence">confidence</dfn>. The
@@ -94194,9 +94194,9 @@
<p class="note">The authoring conformance requirements for
character encoding declarations limit them to only appearing <a
href="#charset1024">in the first 1024 bytes</a>. User agents are
- therefore encouraged to use the preparse algorithm below (part of
- these steps) on the first 1024 bytes, but not to stall beyond
- that.</p>
+ therefore encouraged to use the prescan algorithm below (as
+ invoked by these steps) on the first 1024 bytes, but not to stall
+ beyond that.</p>
</li>
@@ -94243,389 +94243,28 @@
<p class="note">This step looks for Unicode Byte Order Marks
(BOMs).</p></li>
- <li><p>Otherwise, the user agent will have to search for explicit
- character encoding information in the file itself. This should
- proceed as follows:
+ <li>
- <p>Let <var title="">position</var> be a pointer to a byte in the
- input stream, initially pointing at the first byte. If at any
- point during these substeps the user agent either runs out of
- bytes or decides that scanning further bytes would not be
- efficient, then skip to the next step of the overall character
- encoding detection algorithm. User agents may decide that scanning
- <em>any</em> bytes is not efficient, in which case these substeps
- are entirely skipped.</p>
+ <p>Otherwise, optionally <span title="prescan a byte stream to
+ determine its encoding">prescan the byte stream to determine its
+ encoding</span>. The <var title="">end condition</var> is that the
+ user agent decides that scanning further bytes would not be
+ efficient. User agents are encouraged to only prescan the first
+ 1024 bytes. User agents may decide that scanning <em>any</em>
+ bytes is not efficient, in which case these substeps are entirely
+ skipped.</p>
- <p>Now, repeat the following "two" steps until the algorithm
- aborts (either because user agent aborts, as described above, or
- because a character encoding is found):</p>
+ <p>The aforementioned algorithm either aborts unsuccessfully or
+ returns a character encoding. If it returns a character encoding,
+ then this algorithm must be aborted, returning the same encoding,
+ with <span title="concept-encoding-confidence">confidence</span>
+ <i>tentative</i>.</p>
- <ol>
-
- <li><p>If <var title="">position</var> points to:</p>
-
- <dl class="switch">
-
- <dt>A sequence of bytes starting with: 0x3C 0x21 0x2D 0x2D (ASCII '<!--')</dt>
- <dd>
-
- <p>Advance the <var title="">position</var> pointer so that it
- points at the first 0x3E byte which is preceded by two 0x2D
- bytes (i.e. at the end of an ASCII '-->' sequence) and comes
- after the 0x3C byte that was found. (The two 0x2D bytes can be
- the same as the those in the '<!--' sequence.)</p>
-
- </dd>
-
- <dt>A sequence of bytes starting with: 0x3C, 0x4D or 0x6D, 0x45 or 0x65, 0x54 or 0x74, 0x41 or 0x61, and one of 0x09, 0x0A, 0x0C, 0x0D, 0x20, 0x2F (case-insensitive ASCII '<meta' followed by a space or slash)</dt>
- <dd>
-
- <ol>
-
- <li><p>Advance the <var title="">position</var> pointer so
- that it points at the next 0x09, 0x0A, 0x0C, 0x0D, 0x20, or
- 0x2F byte (the one in sequence of characters matched
- above).</p></li>
-
- <li><p>Let <var title="">attribute list</var> be an empty
- list of strings.</p></li> <!-- so long as we only care about
- http-equiv, content, and charset, this can be a 3-bit
- bitfield -->
-
- <li><p>Let <var title="">got pragma</var> be false.</p></li>
-
- <li><p>Let <var title="">need pragma</var> be null.</p></li>
-
- <li><p>Let <var title="">charset</var> be the null value
- (which, for the purposes of this algorithm, is distinct from
- an unrecognised encoding or the empty string).</p></li>
-
- <li><p><i>Attributes</i>: <span
- title="concept-get-attributes-when-sniffing">Get an
- attribute</span> and its value. If no attribute was sniffed,
- then jump to the <i>processing</i> step below.</p></li>
-
- <li><p>If the attribute's name is already in <var
- title="">attribute list</var>, then return to the step
- labeled <i>attributes</i>.</p>
-
- <li><p>Add the attribute's name to <var title="">attribute
- list</var>.</p>
-
- <li>
-
- <p>Run the appropriate step from the following list, if one
- applies:</p>
-
- <dl class="switch">
-
- <dt>If the attribute's name is "<code
- title="">http-equiv</code>"</dt>
-
- <dd><p>If the attribute's value is "<code
- title="">content-type</code>", then set <var title="">got
- pragma</var> to true.</p></dd>
-
- <dt>If the attribute's name is "<code
- title="">content</code>"</dt>
-
- <dd><p>Apply the <span>algorithm for extracting an encoding
- from a <code>meta</code> element</span>, giving the
- attribute's value as the string to parse. If an encoding is
- returned, and if <var title="">charset</var> is still set
- to null, let <var title="">charset</var> be the encoding
- returned, and set <var title="">need pragma</var> to
- true.</p></dd>
-
- <dt>If the attribute's name is "<code
- title="">charset</code>"</dt>
-
- <dd><p>Let <var title="">charset</var> be the encoding
- corresponding to the attribute's value, and set <var
- title="">need pragma</var> to false.</p></dd>
-
- </dl>
-
- </li>
-
- <li><p>Return to the step labeled <i>attributes</i>.</p></li>
-
- <li><p><i>Processing</i>: If <var title="">need pragma</var>
- is null, then jump to the second step of the overall "two
- step" algorithm.</p></li>
-
- <li><p>If <var title="">need pragma</var> is true but <var
- title="">got pragma</var> is false, then jump to the second
- step of the overall "two step" algorithm.</p></li>
-
- <li><p>If <var title="">charset</var> is <span>a UTF-16
- encoding</span>, change the value of <var
- title="">charset</var> to UTF-8.</p></li>
-
- <li><p>If <var title="">charset</var> is not a supported
- character encoding, then jump to the second step of the
- overall "two step" algorithm.</p></li>
-
- <li><p>Return the encoding given by <var
- title="">charset</var>, with <span
- title="concept-encoding-confidence">confidence</span>
- <i>tentative</i>, and abort all these steps.</p></li>
-
- </ol>
-
- </dd>
-
- <dt>A sequence of bytes starting with a 0x3C byte (ASCII <), optionally a 0x2F byte (ASCII /), and finally a byte in the range 0x41-0x5A or 0x61-0x7A (an ASCII letter)</dt>
- <dd>
-
- <ol>
-
- <li><p>Advance the <var title="">position</var> pointer so
- that it points at the next 0x09 (ASCII TAB), 0x0A (ASCII LF),
- 0x0C (ASCII FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E
- (ASCII >) byte.</p></li>
-
- <li><p>Repeatedly <span
- title="concept-get-attributes-when-sniffing">get an
- attribute</span> until no further attributes can be found,
- then jump to the second step in the overall "two step"
- algorithm.</p></li>
-
- </ol>
-
- </dd>
-
- <dt>A sequence of bytes starting with: 0x3C 0x21 (ASCII '<!')</dt>
- <dt>A sequence of bytes starting with: 0x3C 0x2F (ASCII '</')</dt>
- <dt>A sequence of bytes starting with: 0x3C 0x3F (ASCII '<?')</dt>
- <dd>
-
- <p>Advance the <var title="">position</var> pointer so that it
- points at the first 0x3E byte (ASCII >) that comes after the
- 0x3C byte that was found.</p>
-
- </dd>
-
- <dt>Any other byte</dt>
- <dd>
-
- <p>Do nothing with that byte.</p>
-
- </dd>
-
- </dl>
-
- </li>
-
- <li>Move <var title="">position</var> so it points at the next
- byte in the input stream, and return to the first step of this
- "two step" algorithm.</li>
-
- </ol>
-
- <p>When the above "two step" algorithm says to <dfn
- title="concept-get-attributes-when-sniffing">get an
- attribute</dfn>, it means doing this:</p>
-
- <ol>
-
- <li><p>If the byte at <var title="">position</var> is one of 0x09
- (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR),
- 0x20 (ASCII space), or 0x2F (ASCII /) then advance <var
- title="">position</var> to the next byte and redo this
- substep.</p></li>
-
- <li><p>If the byte at <var title="">position</var> is 0x3E (ASCII
- >), then abort the "get an attribute" algorithm. There isn't
- one.</p></li>
-
- <li><p>Otherwise, the byte at <var title="">position</var> is the
- start of the attribute name. Let <var title="">attribute
- name</var> and <var title="">attribute value</var> be the empty
- string.</p></li>
-
- <li><p><i>Attribute name</i>: Process the byte at <var
- title="">position</var> as follows:</p>
-
- <dl class="switch">
-
- <dt>If it is 0x3D (ASCII =), and the <var title="">attribute
- name</var> is longer than the empty string</dt>
-
- <dd>Advance <var title="">position</var> to the next byte and
- jump to the step below labeled <i>value</i>.</dd>
-
- <dt>If it is 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII
- FF), 0x0D (ASCII CR), or 0x20 (ASCII space)</dt>
-
- <dd>Jump to the step below labeled <i>spaces</i>.</dd>
-
- <dt>If it is 0x2F (ASCII /) or 0x3E (ASCII >)</dt>
-
- <dd>Abort the "get an attribute" algorithm. The attribute's
- name is the value of <var title="">attribute name</var>, its
- value is the empty string.</dd>
-
- <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII
- Z)</dt>
-
- <dd>Append the Unicode character with code point <span
- title=""><var title="">b</var>+0x20</span> to <var
- title="">attribute name</var> (where <var title="">b</var> is
- the value of the byte at <var title="">position</var>). (This
- converts the input to lowercase.)</dd>
-
- <dt>Anything else</dt>
-
- <dd>Append the Unicode character with the same code point as the
- value of the byte at <var title="">position</var>) to <var
- title="">attribute name</var>. (It doesn't actually matter how
- bytes outside the ASCII range are handled here, since only
- ASCII characters can contribute to the detection of a character
- encoding.)</dd>
-
- </dl>
-
- </li>
-
- <li><p>Advance <var title="">position</var> to the next byte and
- return to the previous step.</p></li>
-
- <li><p><i>Spaces</i>: If the byte at <var
- title="">position</var> is one of 0x09 (ASCII TAB), 0x0A (ASCII
- LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then
- advance <var title="">position</var> to the next byte, then,
- repeat this step.</p></li>
-
- <li><p>If the byte at <var title="">position</var> is
- <em>not</em> 0x3D (ASCII =), abort the "get an attribute"
- algorithm. The attribute's name is the value of <var
- title="">attribute name</var>, its value is the empty
- string.</p></li>
-
- <li><p>Advance <var title="">position</var> past the 0x3D (ASCII
- =) byte.</p></li>
-
- <li><p><i>Value</i>: If the byte at <var
- title="">position</var> is one of 0x09 (ASCII TAB), 0x0A (ASCII
- LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then
- advance <var title="">position</var> to the next byte, then,
- repeat this step.</p></li>
-
- <li><p>Process the byte at <var title="">position</var> as
- follows:</p>
-
- <dl class="switch">
-
- <dt>If it is 0x22 (ASCII ") or 0x27 (ASCII ')</dt>
-
- <dd>
-
- <ol>
-
- <li>Let <var title="">b</var> be the value of the byte at
- <var title="">position</var>.</li>
-
- <li>Advance <var title="">position</var> to the next
- byte.</li>
-
- <li>If the value of the byte at <var title="">position</var>
- is the value of <var title="">b</var>, then advance <var
- title="">position</var> to the next byte and abort the "get
- an attribute" algorithm. The attribute's name is the value of
- <var title="">attribute name</var>, and its value is the
- value of <var title="">attribute value</var>.</li>
-
- <li>Otherwise, if the value of the byte at <var
- title="">position</var> is in the range 0x41 (ASCII A) to
- 0x5A (ASCII Z), then append a Unicode character to <var
- title="">attribute value</var> whose code point is 0x20 more
- than the value of the byte at <var
- title="">position</var>.</li>
-
- <li>Otherwise, append a Unicode character to <var
- title="">attribute value</var> whose code point is the same as
- the value of the byte at <var title="">position</var>.</li>
-
- <li>Return to the second step in these substeps.</li>
-
- </ol>
-
- </dd>
-
- <dt>If it is 0x3E (ASCII >)</dt>
-
- <dd>Abort the "get an attribute" algorithm. The attribute's
- name is the value of <var title="">attribute name</var>, its
- value is the empty string.</dd>
-
-
- <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII
- Z)</dt>
-
- <dd>Append the Unicode character with code point <span title=""><var
- title="">b</var>+0x20</span> to <var title="">attribute
- value</var> (where <var title="">b</var> is the value of the
- byte at <var title="">position</var>). Advance <var
- title="">position</var> to the next byte.</dd>
-
- <dt>Anything else</dt>
-
- <dd>Append the Unicode character with the same code point as the
- value of the byte at <var title="">position</var>) to <var
- title="">attribute value</var>. Advance <var
- title="">position</var> to the next byte.</dd>
-
- </dl>
-
- </li>
-
- <li><p>Process the byte at <var title="">position</var> as
- follows:</p>
-
- <dl class="switch">
-
- <dt>If it is 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII
- FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E (ASCII
- >)</dt>
-
- <dd>Abort the "get an attribute" algorithm. The attribute's
- name is the value of <var title="">attribute name</var> and its
- value is the value of <var title="">attribute value</var>.</dd>
-
- <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII
- Z)</dt>
-
- <dd>Append the Unicode character with code point <span title=""><var
- title="">b</var>+0x20</span> to <var title="">attribute
- value</var> (where <var title="">b</var> is the value of the
- byte at <var title="">position</var>).</dd>
-
- <dt>Anything else</dt>
-
- <dd>Append the Unicode character with the same code point as the
- value of the byte at <var title="">position</var>) to <var
- title="">attribute value</var>.</dd>
-
- </dl>
-
- </li>
-
- <li><p>Advance <var title="">position</var> to the next byte and
- return to the previous step.</p></li>
-
- </ol>
-
- <p>For the sake of interoperability, user agents should not use a
- pre-scan algorithm that returns different results than the one
- described above. (But, if you do, please at least let us know, so
- that we can improve this algorithm and benefit everyone...)</p>
-
</li>
- <li><p>If the user agent has information on the likely encoding for
- this page, e.g. based on the encoding of the page when it was last
- visited, then return that encoding, with the <span
+ <li><p>Otherwise, if the user agent has information on the likely
+ encoding for this page, e.g. based on the encoding of the page when
+ it was last visited, then return that encoding, with the <span
title="concept-encoding-confidence">confidence</span>
<i>tentative</i>, and abort these steps.</p></li>
@@ -94814,18 +94453,408 @@
as the user agent uses the returned value to select the decoder to
use for the input stream.</p>
+ <hr>
+
+ <p>When an algorithm requires a user agent to <dfn>prescan a byte
+ stream to determine its encoding</dfn>, given some defined <var
+ title="">end condition</var>, then it must run the following steps.
+ These steps either abort unsuccessfully or return a character
+ encoding.</p>
+
+ <ol>
+
+ <li>
+
+ <p>Let <var title="">position</var> be a pointer to a byte in the
+ input stream, initially pointing at the first byte. If at any
+ point during these steps the user agent either runs out of bytes
+ or reaches its <var title="">end condition</var>, then abort the
+ <span>prescan a byte stream to determine its encoding</span>
+ algorithm unsuccessfully.</p>
+
+ </li>
+
+ <li>
+
+ <p><i>Loop</i>: If <var title="">position</var> points to:</p>
+
+ <dl class="switch">
+
+ <dt>A sequence of bytes starting with: 0x3C 0x21 0x2D 0x2D (ASCII '<!--')</dt>
+ <dd>
+
+ <p>Advance the <var title="">position</var> pointer so that it
+ points at the first 0x3E byte which is preceded by two 0x2D
+ bytes (i.e. at the end of an ASCII '-->' sequence) and comes
+ after the 0x3C byte that was found. (The two 0x2D bytes can be
+ the same as the those in the '<!--' sequence.)</p>
+
+ </dd>
+
+ <dt>A sequence of bytes starting with: 0x3C, 0x4D or 0x6D, 0x45 or 0x65, 0x54 or 0x74, 0x41 or 0x61, and one of 0x09, 0x0A, 0x0C, 0x0D, 0x20, 0x2F (case-insensitive ASCII '<meta' followed by a space or slash)</dt>
+ <dd>
+
+ <ol>
+
+ <li><p>Advance the <var title="">position</var> pointer so
+ that it points at the next 0x09, 0x0A, 0x0C, 0x0D, 0x20, or
+ 0x2F byte (the one in sequence of characters matched
+ above).</p></li>
+
+ <li><p>Let <var title="">attribute list</var> be an empty
+ list of strings.</p></li> <!-- so long as we only care about
+ http-equiv, content, and charset, this can be a 3-bit
+ bitfield -->
+
+ <li><p>Let <var title="">got pragma</var> be false.</p></li>
+
+ <li><p>Let <var title="">need pragma</var> be null.</p></li>
+
+ <li><p>Let <var title="">charset</var> be the null value
+ (which, for the purposes of this algorithm, is distinct from
+ an unrecognised encoding or the empty string).</p></li>
+
+ <li><p><i>Attributes</i>: <span
+ title="concept-get-attributes-when-sniffing">Get an
+ attribute</span> and its value. If no attribute was sniffed,
+ then jump to the <i>processing</i> step below.</p></li>
+
+ <li><p>If the attribute's name is already in <var
+ title="">attribute list</var>, then return to the step
+ labeled <i>attributes</i>.</p>
+
+ <li><p>Add the attribute's name to <var title="">attribute
+ list</var>.</p>
+
+ <li>
+
+ <p>Run the appropriate step from the following list, if one
+ applies:</p>
+
+ <dl class="switch">
+
+ <dt>If the attribute's name is "<code
+ title="">http-equiv</code>"</dt>
+
+ <dd><p>If the attribute's value is "<code
+ title="">content-type</code>", then set <var title="">got
+ pragma</var> to true.</p></dd>
+
+ <dt>If the attribute's name is "<code
+ title="">content</code>"</dt>
+
+ <dd><p>Apply the <span>algorithm for extracting an encoding
+ from a <code>meta</code> element</span>, giving the
+ attribute's value as the string to parse. If an encoding is
+ returned, and if <var title="">charset</var> is still set
+ to null, let <var title="">charset</var> be the encoding
+ returned, and set <var title="">need pragma</var> to
+ true.</p></dd>
+
+ <dt>If the attribute's name is "<code
+ title="">charset</code>"</dt>
+
+ <dd><p>Let <var title="">charset</var> be the encoding
+ corresponding to the attribute's value, and set <var
+ title="">need pragma</var> to false.</p></dd>
+
+ </dl>
+
+ </li>
+
+ <li><p>Return to the step labeled <i>attributes</i>.</p></li>
+
+ <li><p><i>Processing</i>: If <var title="">need pragma</var> is
+ null, then jump to the step below labeled <i>next
+ byte</i>.</p></li>
+
+ <li><p>If <var title="">need pragma</var> is true but <var
+ title="">got pragma</var> is false, then jump to the step below
+ labeled <i>next byte</i>.</p></li>
+
+ <li><p>If <var title="">charset</var> is <span>a UTF-16
+ encoding</span>, change the value of <var
+ title="">charset</var> to UTF-8.</p></li>
+
+ <li><p>If <var title="">charset</var> is not a supported
+ character encoding, then jump to the step below labeled <i>next
+ byte</i>.</p></li>
+
+ <li><p>Abort the <span>prescan a byte stream to determine its
+ encoding</span> algorithm, returning the encoding given by <var
+ title="">charset</var>.</p></li>
+
+ </ol>
+
+ </dd>
+
+ <dt>A sequence of bytes starting with a 0x3C byte (ASCII <), optionally a 0x2F byte (ASCII /), and finally a byte in the range 0x41-0x5A or 0x61-0x7A (an ASCII letter)</dt>
+ <dd>
+
+ <ol>
+
+ <li><p>Advance the <var title="">position</var> pointer so
+ that it points at the next 0x09 (ASCII TAB), 0x0A (ASCII LF),
+ 0x0C (ASCII FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E
+ (ASCII >) byte.</p></li>
+
+ <li><p>Repeatedly <span
+ title="concept-get-attributes-when-sniffing">get an
+ attribute</span> until no further attributes can be found, then
+ jump to the step below labeled <i>next byte</i>.</p></li>
+
+ </ol>
+
+ </dd>
+
+ <dt>A sequence of bytes starting with: 0x3C 0x21 (ASCII '<!')</dt>
+ <dt>A sequence of bytes starting with: 0x3C 0x2F (ASCII '</')</dt>
+ <dt>A sequence of bytes starting with: 0x3C 0x3F (ASCII '<?')</dt>
+ <dd>
+
+ <p>Advance the <var title="">position</var> pointer so that it
+ points at the first 0x3E byte (ASCII >) that comes after the
+ 0x3C byte that was found.</p>
+
+ </dd>
+
+ <dt>Any other byte</dt>
+ <dd>
+
+ <p>Do nothing with that byte.</p>
+
+ </dd>
+
+ </dl>
+
+ </li>
+
+ <li><i>Next byte</i>: Move <var title="">position</var> so it
+ points at the next byte in the input stream, and return to the step
+ above labeld <i>loop</i>.</li>
+
+ </ol>
+
+ <p>When the <span>prescan a byte stream to determine its
+ encoding</span> algorithm says to <dfn
+ title="concept-get-attributes-when-sniffing">get an attribute</dfn>,
+ it means doing this:</p>
+
+ <ol>
+
+ <li><p>If the byte at <var title="">position</var> is one of 0x09
+ (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR),
+ 0x20 (ASCII space), or 0x2F (ASCII /) then advance <var
+ title="">position</var> to the next byte and redo this
+ step.</p></li>
+
+ <li><p>If the byte at <var title="">position</var> is 0x3E (ASCII
+ >), then abort the <span
+ title="concept-get-attributes-when-sniffing">get an
+ attribute</span> algorithm. There isn't one.</p></li>
+
+ <li><p>Otherwise, the byte at <var title="">position</var> is the
+ start of the attribute name. Let <var title="">attribute name</var>
+ and <var title="">attribute value</var> be the empty
+ string.</p></li>
+
+ <li><p><i>Attribute name</i>: Process the byte at <var
+ title="">position</var> as follows:</p>
+
+ <dl class="switch">
+
+ <dt>If it is 0x3D (ASCII =), and the <var title="">attribute
+ name</var> is longer than the empty string</dt>
+
+ <dd>Advance <var title="">position</var> to the next byte and
+ jump to the step below labeled <i>value</i>.</dd>
+
+ <dt>If it is 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII
+ FF), 0x0D (ASCII CR), or 0x20 (ASCII space)</dt>
+
+ <dd>Jump to the step below labeled <i>spaces</i>.</dd>
+
+ <dt>If it is 0x2F (ASCII /) or 0x3E (ASCII >)</dt>
+
+ <dd>Abort the <span
+ title="concept-get-attributes-when-sniffing">get an
+ attribute</span> algorithm. The attribute's name is the value of
+ <var title="">attribute name</var>, its value is the empty
+ string.</dd>
+
+ <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII
+ Z)</dt>
+
+ <dd>Append the Unicode character with code point <span
+ title=""><var title="">b</var>+0x20</span> to <var
+ title="">attribute name</var> (where <var title="">b</var> is
+ the value of the byte at <var title="">position</var>). (This
+ converts the input to lowercase.)</dd>
+
+ <dt>Anything else</dt>
+
+ <dd>Append the Unicode character with the same code point as the
+ value of the byte at <var title="">position</var>) to <var
+ title="">attribute name</var>. (It doesn't actually matter how
+ bytes outside the ASCII range are handled here, since only
+ ASCII characters can contribute to the detection of a character
+ encoding.)</dd>
+
+ </dl>
+
+ </li>
+
+ <li><p>Advance <var title="">position</var> to the next byte and
+ return to the previous step.</p></li>
+
+ <li><p><i>Spaces</i>: If the byte at <var
+ title="">position</var> is one of 0x09 (ASCII TAB), 0x0A (ASCII
+ LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then
+ advance <var title="">position</var> to the next byte, then,
+ repeat this step.</p></li>
+
+ <li><p>If the byte at <var title="">position</var> is <em>not</em>
+ 0x3D (ASCII =), abort the <span
+ title="concept-get-attributes-when-sniffing">get an
+ attribute</span> algorithm. The attribute's name is the value of
+ <var title="">attribute name</var>, its value is the empty
+ string.</p></li>
+
+ <li><p>Advance <var title="">position</var> past the 0x3D (ASCII
+ =) byte.</p></li>
+
+ <li><p><i>Value</i>: If the byte at <var
+ title="">position</var> is one of 0x09 (ASCII TAB), 0x0A (ASCII
+ LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then
+ advance <var title="">position</var> to the next byte, then,
+ repeat this step.</p></li>
+
+ <li><p>Process the byte at <var title="">position</var> as
+ follows:</p>
+
+ <dl class="switch">
+
+ <dt>If it is 0x22 (ASCII ") or 0x27 (ASCII ')</dt>
+
+ <dd>
+
+ <ol>
+
+ <li>Let <var title="">b</var> be the value of the byte at
+ <var title="">position</var>.</li>
+
+ <li><i>Quote loop</i>: Advance <var title="">position</var> to
+ the next byte.</li>
+
+ <li>If the value of the byte at <var title="">position</var> is
+ the value of <var title="">b</var>, then advance <var
+ title="">position</var> to the next byte and abort the "get an
+ attribute" algorithm. The attribute's name is the value of <var
+ title="">attribute name</var>, and its value is the value of
+ <var title="">attribute value</var>.</li>
+
+ <li>Otherwise, if the value of the byte at <var
+ title="">position</var> is in the range 0x41 (ASCII A) to 0x5A
+ (ASCII Z), then append a Unicode character to <var
+ title="">attribute value</var> whose code point is 0x20 more
+ than the value of the byte at <var
+ title="">position</var>.</li>
+
+ <li>Otherwise, append a Unicode character to <var
+ title="">attribute value</var> whose code point is the same as
+ the value of the byte at <var title="">position</var>.</li>
+
+ <li>Return to the step above labeled <i>quote loop</i>.</li>
+
+ </ol>
+
+ </dd>
+
+ <dt>If it is 0x3E (ASCII >)</dt>
+
+ <dd>Abort the <span
+ title="concept-get-attributes-when-sniffing">get an
+ attribute</span> algorithm. The attribute's name is the value of
+ <var title="">attribute name</var>, its value is the empty
+ string.</dd>
+
+
+ <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII
+ Z)</dt>
+
+ <dd>Append the Unicode character with code point <span
+ title=""><var title="">b</var>+0x20</span> to <var
+ title="">attribute value</var> (where <var title="">b</var> is
+ the value of the byte at <var title="">position</var>). Advance
+ <var title="">position</var> to the next byte.</dd>
+
+ <dt>Anything else</dt>
+
+ <dd>Append the Unicode character with the same code point as the
+ value of the byte at <var title="">position</var>) to <var
+ title="">attribute value</var>. Advance <var
+ title="">position</var> to the next byte.</dd>
+
+ </dl>
+
+ </li>
+
+ <li><p>Process the byte at <var title="">position</var> as
+ follows:</p>
+
+ <dl class="switch">
+
+ <dt>If it is 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII
+ FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E (ASCII
+ >)</dt>
+
+ <dd>Abort the <span
+ title="concept-get-attributes-when-sniffing">get an
+ attribute</span> algorithm. The attribute's name is the value of
+ <var title="">attribute name</var> and its value is the value of
+ <var title="">attribute value</var>.</dd>
+
+ <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII Z)</dt>
+
+ <dd>Append the Unicode character with code point <span
+ title=""><var title="">b</var>+0x20</span> to <var
+ title="">attribute value</var> (where <var title="">b</var> is
+ the value of the byte at <var title="">position</var>).</dd>
+
+ <dt>Anything else</dt>
+
+ <dd>Append the Unicode character with the same code point as the
+ value of the byte at <var title="">position</var>) to <var
+ title="">attribute value</var>.</dd>
+
+ </dl>
+
+ </li>
+
+ <li><p>Advance <var title="">position</var> to the next byte and
+ return to the previous step.</p></li>
+
+ </ol>
+
+ <p>For the sake of interoperability, user agents should not use a
+ pre-scan algorithm that returns different results than the one
+ described above. (But, if you do, please at least let us know, so
+ that we can improve this algorithm and benefit everyone...)</p>
+
<!--(removed this since the specs are being changed)
- <p class="note">This algorithm is a <span>willful violation</span>
- of the HTTP specification, which requires that the encoding be
- assumed to be ISO-8859-1 in the absence of a <span>character
- encoding declaration</span> to the contrary, and of RFC 2046,
- which requires that the encoding be assumed to be US-ASCII in the
- absence of a <span>character encoding declaration</span> to the
- contrary. This specification's third approach is motivated by a
+ <p class="note">These algorithms are a <span>willful
+ violation</span> of the HTTP specification, which requires that the
+ encoding be assumed to be ISO-8859-1 in the absence of a
+ <span>character encoding declaration</span> to the contrary, and of
+ RFC 2046, which requires that the encoding be assumed to be US-ASCII
+ in the absence of a <span>character encoding declaration</span> to
+ the contrary. This specification's third approach is motivated by a
desire to be maximally compatible with legacy content. <a
href="#refsHTTP">[HTTP]</a> <a href="#refsRFC2046">[RFC2046]</a></p>
-->
+
+
<h5>Character encodings</h5>
<p>User agents must at a minimum support the UTF-8 and Windows-1252
More information about the Commit-Watchers
mailing list