[html5] r2842 - [] (0) Abstract out the encoding stuff from the parser to the infrastructure sec [...]
whatwg at whatwg.org
whatwg at whatwg.org
Thu Feb 19 03:04:55 PST 2009
Author: ianh
Date: 2009-02-19 03:04:54 -0800 (Thu, 19 Feb 2009)
New Revision: 2842
Modified:
index
source
Log:
[] (0) Abstract out the encoding stuff from the parser to the infrastructure section so that it also affects form submission
Modified: index
===================================================================
--- index 2009-02-19 10:20:04 UTC (rev 2841)
+++ index 2009-02-19 11:04:54 UTC (rev 2842)
@@ -277,20 +277,21 @@
<li><a href=#content-type-sniffing:-unknown-type><span class=secno>2.7.4 </span>Content-Type sniffing: unknown type</a></li>
<li><a href=#content-type-sniffing:-image><span class=secno>2.7.5 </span>Content-Type sniffing: image</a></li>
<li><a href=#content-type-sniffing:-feed-or-html><span class=secno>2.7.6 </span>Content-Type sniffing: feed or HTML</a></ol></li>
- <li><a href=#common-dom-interfaces><span class=secno>2.8 </span>Common DOM interfaces</a>
+ <li><a href=#character-encodings-0><span class=secno>2.8 </span>Character encodings</a></li>
+ <li><a href=#common-dom-interfaces><span class=secno>2.9 </span>Common DOM interfaces</a>
<ol>
- <li><a href=#reflecting-content-attributes-in-dom-attributes><span class=secno>2.8.1 </span>Reflecting content attributes in DOM attributes</a></li>
- <li><a href=#collections><span class=secno>2.8.2 </span>Collections</a>
+ <li><a href=#reflecting-content-attributes-in-dom-attributes><span class=secno>2.9.1 </span>Reflecting content attributes in DOM attributes</a></li>
+ <li><a href=#collections><span class=secno>2.9.2 </span>Collections</a>
<ol>
- <li><a href=#htmlcollection><span class=secno>2.8.2.1 </span>HTMLCollection</a></li>
- <li><a href=#htmlformcontrolscollection><span class=secno>2.8.2.2 </span>HTMLFormControlsCollection</a></li>
- <li><a href=#htmloptionscollection><span class=secno>2.8.2.3 </span>HTMLOptionsCollection</a></ol></li>
- <li><a href=#domtokenlist><span class=secno>2.8.3 </span>DOMTokenList</a></li>
- <li><a href=#safe-passing-of-structured-data><span class=secno>2.8.4 </span>Safe passing of structured data</a></li>
- <li><a href=#domstringmap><span class=secno>2.8.5 </span>DOMStringMap</a></li>
- <li><a href=#dom-feature-strings><span class=secno>2.8.6 </span>DOM feature strings</a></li>
- <li><a href=#exceptions><span class=secno>2.8.7 </span>Exceptions</a></li>
- <li><a href=#garbage-collection><span class=secno>2.8.8 </span>Garbage collection</a></ol></ol></li>
+ <li><a href=#htmlcollection><span class=secno>2.9.2.1 </span>HTMLCollection</a></li>
+ <li><a href=#htmlformcontrolscollection><span class=secno>2.9.2.2 </span>HTMLFormControlsCollection</a></li>
+ <li><a href=#htmloptionscollection><span class=secno>2.9.2.3 </span>HTMLOptionsCollection</a></ol></li>
+ <li><a href=#domtokenlist><span class=secno>2.9.3 </span>DOMTokenList</a></li>
+ <li><a href=#safe-passing-of-structured-data><span class=secno>2.9.4 </span>Safe passing of structured data</a></li>
+ <li><a href=#domstringmap><span class=secno>2.9.5 </span>DOMStringMap</a></li>
+ <li><a href=#dom-feature-strings><span class=secno>2.9.6 </span>DOM feature strings</a></li>
+ <li><a href=#exceptions><span class=secno>2.9.7 </span>Exceptions</a></li>
+ <li><a href=#garbage-collection><span class=secno>2.9.8 </span>Garbage collection</a></ol></ol></li>
<li><a href=#dom><span class=secno>3 </span>Semantics and structure of HTML documents</a>
<ol>
<li><a href=#semantics-intro><span class=secno>3.1 </span>Introduction</a></li>
@@ -940,9 +941,8 @@
<li><a href=#the-input-stream><span class=secno>8.2.2 </span>The input stream</a>
<ol>
<li><a href=#determining-the-character-encoding><span class=secno>8.2.2.1 </span>Determining the character encoding</a></li>
- <li><a href=#character-encoding-requirements><span class=secno>8.2.2.2 </span>Character encoding requirements</a></li>
- <li><a href=#preprocessing-the-input-stream><span class=secno>8.2.2.3 </span>Preprocessing the input stream</a></li>
- <li><a href=#changing-the-encoding-while-parsing><span class=secno>8.2.2.4 </span>Changing the encoding while parsing</a></ol></li>
+ <li><a href=#preprocessing-the-input-stream><span class=secno>8.2.2.2 </span>Preprocessing the input stream</a></li>
+ <li><a href=#changing-the-encoding-while-parsing><span class=secno>8.2.2.3 </span>Changing the encoding while parsing</a></ol></li>
<li><a href=#parse-state><span class=secno>8.2.3 </span>Parse state</a>
<ol>
<li><a href=#the-insertion-mode><span class=secno>8.2.3.1 </span>The insertion mode</a></li>
@@ -5513,10 +5513,90 @@
- <h3 id=common-dom-interfaces><span class=secno>2.8 </span>Common DOM interfaces</h3>
+ <h3 id=character-encodings-0><span class=secno>2.8 </span>Character encodings</h3>
- <h4 id=reflecting-content-attributes-in-dom-attributes><span class=secno>2.8.1 </span>Reflecting content attributes in DOM attributes</h4>
+ <p>User agents must at a minimum support the UTF-8 and Windows-1252
+ encodings, but may support more.</p>
+ <p class=note>It is not unusual for Web browsers to support dozens
+ if not upwards of a hundred distinct character encodings.</p>
+
+ <p>User agents must support the preferred MIME name of every
+ character encoding they support that has a preferred MIME name, and
+ should support all the IANA-registered aliases. <a href=#refsIANACHARSET>[IANACHARSET]</a></p>
+
+ <p>When comparing a string specifying a character encoding with the
+ name or alias of a character encoding to determine if they are
+ equal, user agents must use the Charset Alias Matching rules defined
+ in Unicode Technical Standard #22. <a href=#refsUTS22>[UTS22]</a></p> <!-- XXXrefs
+ http://unicode.org/reports/tr22/#Charset_Alias_Matching -->
+
+ <p class=example>For instance, "GB_2312-80" and "g.b.2312(80)" are
+ considered equivalent names.</p>
+
+ <hr><p>When a user agent would otherwise use an encoding given in the
+ first column of the following table to either convert content to
+ Unicode characters or convert Unicode characters to bytes, it must
+ instead use the encoding given in the cell in the second column of
+ the same row. When a byte or sequence of bytes is treated
+ differently due to this encoding aliasing, it is said to have been
+ <dfn id=misinterpreted-for-compatibility>misinterpreted for compatibility</dfn>.</p>
+
+ <table><caption>Character encoding overrides</caption>
+ <thead><tr><th> Input encoding <th> Replacement encoding <th> References
+ <tbody><!-- how about EUC-JP? --><tr><td> EUC-KR <td> Windows-949 <td>
+ <a href=#refsEUCKR>[EUCKR]</a> <!-- see reference for [EUC-KR] in RFC1557 -->
+ <a href=#refsWin949>[WIN949]</a><!-- http://www.microsoft.com/globaldev/reference/dbcs/949.mspx -->
+ <tr><td> GB2312 <td> GBK <td>
+ <a href=#refsGB2312>[GB2312]</a><!-- XXX ? -->
+ <a href=#refsGBK>[GBK]</a><!-- http://www.iana.org/assignments/charset-reg/GBK -->
+ <tr><td> GB_2312-80 <td> GBK <td>
+ <a href=#refsRFC1345>[RFC1345]</a><!-- XXX consider more direct reference? -->
+ <a href=#refsGBK>[GBK]</a><!-- http://www.iana.org/assignments/charset-reg/GBK -->
+ <tr><td> ISO-8859-1 <td> Windows-1252 <td>
+ <a href=#refsRFC1345>[RFC1345]</a><!-- XXX consider more direct reference? -->
+ <a href=#refsWin1252>[WIN1252]</a><!-- http://www.microsoft.com/globaldev/reference/sbcs/1252.htm -->
+ <tr><td> ISO-8859-9 <td> Windows-1254 <td>
+ <a href=#refsRFC1345>[RFC1345]</a><!-- XXX consider more direct reference? -->
+ <a href=#refsWin1254>[WIN1254]</a><!-- http://www.microsoft.com/globaldev/reference/sbcs/1254.htm -->
+ <tr><td> ISO-8859-11 <td> Windows-874 <td>
+ <a href=#refsISO885911>[ISO885911]</a><!-- get reference from http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=28263 -->
+ <a href=#refsWin874>[WIN874]</a><!-- http://www.microsoft.com/globaldev/reference/sbcs/874.mspx -->
+ <tr><td> KS_C_5601-1987 <td> Windows-949 <td>
+ <a href=#refsRFC1345>[RFC1345]</a><!-- XXX consider more direct reference? -->
+ <a href=#refsWin949>[WIN949]</a><!-- http://www.microsoft.com/globaldev/reference/dbcs/949.mspx -->
+ <tr><td> TIS-620 <td> Windows-874 <td>
+ <a href=#refsTIS620>[TIS620]</a> <!-- http://www.nectec.or.th/it-standards/std620/std620.htm -->
+ <a href=#refsWin874>[WIN874]</a><!-- http://www.microsoft.com/globaldev/reference/sbcs/874.mspx -->
+ <tr><td> US-ASCII <td> Windows-1252 <td>
+ <a href=#refsRFC1345>[RFC1345]</a><!-- XXX consider more direct reference? -->
+ <a href=#refsWin1252>[WIN1252]</a><!-- http://www.microsoft.com/globaldev/reference/sbcs/1252.htm -->
+ <tr><td> x-x-big5 <td> Big5 <td>
+ <a href=#refsBIG5>[BIG5]</a> <!-- XXX ? -->
+ </table><p class=note>The requirement to treat certain encodings as other
+ encodings according to the table above is a willful violation of the
+ W3C Character Model specification. <a href=#refsCHARMOD>[CHARMOD]</a></p>
+
+ <hr><p>User agents must not support the CESU-8, UTF-7, BOCU-1 and SCSU
+ encodings. <a href=#refsCESU8>[CESU8]</a> <a href=#refsUTF7>[UTF7]</a> <a href=#refsBOCU1>[BOCU1]</a> <a href=#refsSCSU>[SCSU]</a></p>
+
+ <p>Support for encodings based on EBCDIC is not recommended. This
+ encoding is rarely used for publicly-facing Web content.</p>
+
+ <p>Support for UTF-32 is not recommended. This encoding is rarely
+ used, and frequently misimplemented.</p>
+
+ <p class=note>This specification does not make any attempt to
+ support EBCDIC-based encodings and UTF-32 in its algorithms; support
+ and use of these encodings can thus lead to unexpected behavior in
+ implementations of this specification.</p>
+
+
+
+ <h3 id=common-dom-interfaces><span class=secno>2.9 </span>Common DOM interfaces</h3>
+
+ <h4 id=reflecting-content-attributes-in-dom-attributes><span class=secno>2.9.1 </span>Reflecting content attributes in DOM attributes</h4>
+
<p>Some <span title="DOM attribute">DOM attributes</span> are
defined to <dfn id=reflect>reflect</dfn> a particular <span>content
attribute</span>. This means that on getting, the DOM attribute
@@ -5673,7 +5753,7 @@
- <h4 id=collections><span class=secno>2.8.2 </span>Collections</h4>
+ <h4 id=collections><span class=secno>2.9.2 </span>Collections</h4>
<p>The <code><a href=#htmlcollection-0>HTMLCollection</a></code>,
<code><a href=#htmlformcontrolscollection-0>HTMLFormControlsCollection</a></code>, and
@@ -5704,7 +5784,7 @@
object every time it is retrieved.</p>
- <h5 id=htmlcollection><span class=secno>2.8.2.1 </span>HTMLCollection</h5>
+ <h5 id=htmlcollection><span class=secno>2.9.2.1 </span>HTMLCollection</h5>
<p>The <code><a href=#htmlcollection-0>HTMLCollection</a></code> interface represents a generic
<a href=#collections-0 title=collections>collection</a> of elements.</p>
@@ -5755,7 +5835,7 @@
- <h5 id=htmlformcontrolscollection><span class=secno>2.8.2.2 </span>HTMLFormControlsCollection</h5>
+ <h5 id=htmlformcontrolscollection><span class=secno>2.9.2.2 </span>HTMLFormControlsCollection</h5>
<p>The <code><a href=#htmlformcontrolscollection-0>HTMLFormControlsCollection</a></code> interface represents
a <a href=#collections-0 title=collections>collection</a> of <a href=#category-listed title=category-listed>listed</a> elements in <code><a href=#the-form-element>form</a></code>
@@ -5806,7 +5886,7 @@
</ol><!--
http://software.hixie.ch/utilities/js/live-dom-viewer/?%3C%21DOCTYPE%20html%3E...%0A%3Cform%20name%3D%22a%22%3E%3Cinput%20id%3D%22x%22%20name%3D%22y%22%3E%3Cinput%20name%3D%22x%22%20id%3D%22y%22%3E%3C/form%3E%0A%3Cscript%3E%0A%20%20var%20x%3B%0A%20%20w%28x%20%3D%20document.forms%5B%27a%27%5D%5B%27x%27%5D%29%3B%0A%20%20w%28x.length%29%3B%0A%20%20x%5B0%5D.parentNode.removeChild%28x%5B0%5D%29%3B%0A%20%20w%28x.length%29%3B%0A%20%20w%28x%20%3D%3D%20document.forms%5B%27a%27%5D%5B%27x%27%5D%29%3B%0A%3C/script%3E%0A
---><h5 id=htmloptionscollection><span class=secno>2.8.2.3 </span>HTMLOptionsCollection</h5>
+--><h5 id=htmloptionscollection><span class=secno>2.9.2.3 </span>HTMLOptionsCollection</h5>
<p>The <code><a href=#htmloptionscollection-0>HTMLOptionsCollection</a></code> interface represents a
list of <code><a href=#the-option-element>option</a></code> elements. It is always rooted on a
@@ -5924,7 +6004,7 @@
<li><p>Remove <var title="">element</var> from its parent
node.</li>
- </ol><!-- see also http://ln.hixie.ch/?start=1161042744&count=1 --><h4 id=domtokenlist><span class=secno>2.8.3 </span>DOMTokenList</h4>
+ </ol><!-- see also http://ln.hixie.ch/?start=1161042744&count=1 --><h4 id=domtokenlist><span class=secno>2.9.3 </span>DOMTokenList</h4>
<p>The <code><a href=#domtokenlist-0>DOMTokenList</a></code> interface represents an interface
to an underlying string that consists of an <a href=#unordered-set-of-unique-space-separated-tokens>unordered set of
@@ -6047,7 +6127,7 @@
underlying string representation.</p>
- <h4 id=safe-passing-of-structured-data><span class=secno>2.8.4 </span>Safe passing of structured data</h4>
+ <h4 id=safe-passing-of-structured-data><span class=secno>2.9.4 </span>Safe passing of structured data</h4>
<p>When a user agent is required to obtain a <dfn id=structured-clone>structured
clone</dfn> of an object, it must run the following algorithm, which
@@ -6130,7 +6210,7 @@
</ol></dd>
- </dl><h4 id=domstringmap><span class=secno>2.8.5 </span>DOMStringMap</h4>
+ </dl><h4 id=domstringmap><span class=secno>2.9.5 </span>DOMStringMap</h4>
<p>The <code><a href=#domstringmap-0>DOMStringMap</a></code> interface represents a set of
name-value pairs. When a <code><a href=#domstringmap-0>DOMStringMap</a></code> object is
@@ -6169,7 +6249,7 @@
implemented for those languages.</p>
- <h4 id=dom-feature-strings><span class=secno>2.8.6 </span>DOM feature strings</h4>
+ <h4 id=dom-feature-strings><span class=secno>2.9.6 </span>DOM feature strings</h4>
<p>DOM3 Core defines mechanisms for checking for interface support,
and for obtaining implementations of interfaces, using <a href=http://www.w3.org/TR/DOM-Level-3-Core/core.html#DOMFeatures>feature
@@ -6200,7 +6280,7 @@
not guaranteed that an implementation that supports "<code title="">HTML</code>" "<code>5.0</code>" also supports "<code title="">HTML</code>" "<code>2.0</code>".</p>
- <h4 id=exceptions><span class=secno>2.8.7 </span>Exceptions</h4>
+ <h4 id=exceptions><span class=secno>2.9.7 </span>Exceptions</h4>
<p>The following <code>DOMException</code> codes are defined in DOM
Core. <a href=#refsDOMCORE>[DOMCORE]</a></p>
@@ -6232,7 +6312,7 @@
<li value=23><dfn id=unavailable_script_err><code>UNAVAILABLE_SCRIPT_ERR</code></dfn></li> <!-- actually defined right here for now -->
<li value=81><dfn id=parse_err><code>PARSE_ERR</code></dfn></li> <!-- actually defined in dom3ls -->
<li value=82><dfn id=serialise_err><code>SERIALISE_ERR</code></dfn></li> <!-- actually defined in dom3ls -->
- </ol><h4 id=garbage-collection><span class=secno>2.8.8 </span>Garbage collection</h4>
+ </ol><h4 id=garbage-collection><span class=secno>2.9.8 </span>Garbage collection</h4>
<p>There is an <dfn id=implied-strong-reference>implied strong reference</dfn> from any DOM
attribute that returns a pre-existing object to that object.</p>
@@ -48679,92 +48759,9 @@
use for the input stream.</p>
- <h5 id=character-encoding-requirements><span class=secno>8.2.2.2 </span>Character encoding requirements</h5>
- <p>User agents must at a minimum support the UTF-8 and Windows-1252
- encodings, but may support more.</p>
+ <h5 id=preprocessing-the-input-stream><span class=secno>8.2.2.2 </span>Preprocessing the input stream</h5>
- <p class=note>It is not unusual for Web browsers to support dozens
- if not upwards of a hundred distinct character encodings.</p>
-
- <p>User agents must support the preferred MIME name of every
- character encoding they support that has a preferred MIME name, and
- should support all the IANA-registered aliases. <a href=#refsIANACHARSET>[IANACHARSET]</a></p>
-
- <!-- XXX should all this be abstracted out so it can be used for
- <script charset=""> and <form accept-charset="">? Maybe move this
- stuff and the 'character encodings' section of the terminology
- section into its own infrastructure subsection? -->
-
- <p>When comparing a string specifying a character encoding with the
- name or alias of a character encoding to determine if they are
- equal, user agents must use the Charset Alias Matching rules defined
- in Unicode Technical Standard #22. <a href=#refsUTS22>[UTS22]</a></p> <!-- XXXrefs
- http://unicode.org/reports/tr22/#Charset_Alias_Matching -->
-
- <p class=example>For instance, "GB_2312-80" and "g.b.2312(80)" are
- considered equivalent names.</p>
-
- <p>When a user agent would otherwise use an encoding given in the
- first column of the following table, it must instead use the
- encoding given in the cell in the second column of the same row. Any
- bytes that are treated differently due to this encoding aliasing
- must be considered <a href=#parse-error title="parse error">parse
- errors</a>.</p>
-
- <table><caption>Character encoding overrides</caption>
- <thead><tr><th> Input encoding <th> Replacement encoding <th> References
- <tbody><!-- how about EUC-JP? --><tr><td> EUC-KR <td> Windows-949 <td>
- <a href=#refsEUCKR>[EUCKR]</a> <!-- see reference for [EUC-KR] in RFC1557 -->
- <a href=#refsWin949>[WIN949]</a><!-- http://www.microsoft.com/globaldev/reference/dbcs/949.mspx -->
- <tr><td> GB2312 <td> GBK <td>
- <a href=#refsGB2312>[GB2312]</a><!-- XXX ? -->
- <a href=#refsGBK>[GBK]</a><!-- http://www.iana.org/assignments/charset-reg/GBK -->
- <tr><td> GB_2312-80 <td> GBK <td>
- <a href=#refsRFC1345>[RFC1345]</a><!-- XXX consider more direct reference? -->
- <a href=#refsGBK>[GBK]</a><!-- http://www.iana.org/assignments/charset-reg/GBK -->
- <tr><td> ISO-8859-1 <td> Windows-1252 <td>
- <a href=#refsRFC1345>[RFC1345]</a><!-- XXX consider more direct reference? -->
- <a href=#refsWin1252>[WIN1252]</a><!-- http://www.microsoft.com/globaldev/reference/sbcs/1252.htm -->
- <tr><td> ISO-8859-9 <td> Windows-1254 <td>
- <a href=#refsRFC1345>[RFC1345]</a><!-- XXX consider more direct reference? -->
- <a href=#refsWin1254>[WIN1254]</a><!-- http://www.microsoft.com/globaldev/reference/sbcs/1254.htm -->
- <tr><td> ISO-8859-11 <td> Windows-874 <td>
- <a href=#refsISO885911>[ISO885911]</a><!-- get reference from http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=28263 -->
- <a href=#refsWin874>[WIN874]</a><!-- http://www.microsoft.com/globaldev/reference/sbcs/874.mspx -->
- <tr><td> KS_C_5601-1987 <td> Windows-949 <td>
- <a href=#refsRFC1345>[RFC1345]</a><!-- XXX consider more direct reference? -->
- <a href=#refsWin949>[WIN949]</a><!-- http://www.microsoft.com/globaldev/reference/dbcs/949.mspx -->
- <tr><td> TIS-620 <td> Windows-874 <td>
- <a href=#refsTIS620>[TIS620]</a> <!-- http://www.nectec.or.th/it-standards/std620/std620.htm -->
- <a href=#refsWin874>[WIN874]</a><!-- http://www.microsoft.com/globaldev/reference/sbcs/874.mspx -->
- <tr><td> US-ASCII <td> Windows-1252 <td>
- <a href=#refsRFC1345>[RFC1345]</a><!-- XXX consider more direct reference? -->
- <a href=#refsWin1252>[WIN1252]</a><!-- http://www.microsoft.com/globaldev/reference/sbcs/1252.htm -->
- <tr><td> x-x-big5 <td> Big5 <td>
- <a href=#refsBIG5>[BIG5]</a> <!-- XXX ? -->
- </table><p class=note>The requirement to treat certain encodings as other
- encodings according to the table above is a willful violation of the
- W3C Character Model specification. <a href=#refsCHARMOD>[CHARMOD]</a></p>
-
- <p>User agents must not support the CESU-8, UTF-7, BOCU-1 and SCSU
- encodings. <a href=#refsCESU8>[CESU8]</a> <a href=#refsUTF7>[UTF7]</a> <a href=#refsBOCU1>[BOCU1]</a> <a href=#refsSCSU>[SCSU]</a></p>
-
- <p>Support for encodings based on EBCDIC is not recommended. This
- encoding is rarely used for publicly-facing Web content.</p>
-
- <p>Support for UTF-32 is not recommended. This encoding is rarely
- used, and frequently misimplemented.</p>
-
- <p class=note>This specification does not make any attempt to
- support EBCDIC-based encodings and UTF-32 in its algorithms; support
- and use of these encodings can thus lead to unexpected behavior in
- implementations of this specification.</p>
-
-
-
- <h5 id=preprocessing-the-input-stream><span class=secno>8.2.2.3 </span>Preprocessing the input stream</h5>
-
<p>Given an encoding, the bytes in the input stream must be
converted to Unicode characters for the tokeniser, as described by
the rules for that encoding, except that the leading U+FEFF BYTE
@@ -48782,6 +48779,10 @@
(e.g. invalid UTF-8 byte sequences in a UTF-8 input stream) are
errors that conformance checkers are expected to report.</p>
+ <p>Any byte or sequences of bytes in the original byte stream that
+ is <a href=#misinterpreted-for-compatibility>misinterpreted for compatibility</a> is a <a href=#parse-error>parse
+ error</a>.</p>
+
<p>One leading U+FEFF BYTE ORDER MARK character must be ignored if
any are present.</p>
@@ -48834,7 +48835,7 @@
the stream, but rather the lack of any further characters.</p>
- <h5 id=changing-the-encoding-while-parsing><span class=secno>8.2.2.4 </span>Changing the encoding while parsing</h5>
+ <h5 id=changing-the-encoding-while-parsing><span class=secno>8.2.2.3 </span>Changing the encoding while parsing</h5>
<p>When the parser requires the user agent to <dfn id=change-the-encoding>change the
encoding</dfn>, it must run the following steps. This might happen
Modified: source
===================================================================
--- source 2009-02-19 10:20:04 UTC (rev 2841)
+++ source 2009-02-19 11:04:54 UTC (rev 2842)
@@ -5335,6 +5335,102 @@
+ <h3>Character encodings</h3>
+
+ <p>User agents must at a minimum support the UTF-8 and Windows-1252
+ encodings, but may support more.</p>
+
+ <p class="note">It is not unusual for Web browsers to support dozens
+ if not upwards of a hundred distinct character encodings.</p>
+
+ <p>User agents must support the preferred MIME name of every
+ character encoding they support that has a preferred MIME name, and
+ should support all the IANA-registered aliases. <a
+ href="#refsIANACHARSET">[IANACHARSET]</a></p>
+
+ <p>When comparing a string specifying a character encoding with the
+ name or alias of a character encoding to determine if they are
+ equal, user agents must use the Charset Alias Matching rules defined
+ in Unicode Technical Standard #22. <a
+ href="#refsUTS22">[UTS22]</a></p> <!-- XXXrefs
+ http://unicode.org/reports/tr22/#Charset_Alias_Matching -->
+
+ <p class="example">For instance, "GB_2312-80" and "g.b.2312(80)" are
+ considered equivalent names.</p>
+
+ <hr>
+
+ <p>When a user agent would otherwise use an encoding given in the
+ first column of the following table to either convert content to
+ Unicode characters or convert Unicode characters to bytes, it must
+ instead use the encoding given in the cell in the second column of
+ the same row. When a byte or sequence of bytes is treated
+ differently due to this encoding aliasing, it is said to have been
+ <dfn>misinterpreted for compatibility</dfn>.</p>
+
+ <table>
+ <caption>Character encoding overrides</caption>
+ <thead>
+ <tr> <th> Input encoding <th> Replacement encoding <th> References
+ <tbody>
+ <!-- how about EUC-JP? -->
+ <tr> <td> EUC-KR <td> Windows-949 <td>
+ <a href="#refsEUCKR">[EUCKR]</a> <!-- see reference for [EUC-KR] in RFC1557 -->
+ <a href="#refsWin949">[WIN949]</a><!-- http://www.microsoft.com/globaldev/reference/dbcs/949.mspx -->
+ <tr> <td> GB2312 <td> GBK <td>
+ <a href="#refsGB2312">[GB2312]</a><!-- XXX ? -->
+ <a href="#refsGBK">[GBK]</a><!-- http://www.iana.org/assignments/charset-reg/GBK -->
+ <tr> <td> GB_2312-80 <td> GBK <td>
+ <a href="#refsRFC1345">[RFC1345]</a><!-- XXX consider more direct reference? -->
+ <a href="#refsGBK">[GBK]</a><!-- http://www.iana.org/assignments/charset-reg/GBK -->
+ <tr> <td> ISO-8859-1 <td> Windows-1252 <td>
+ <a href="#refsRFC1345">[RFC1345]</a><!-- XXX consider more direct reference? -->
+ <a href="#refsWin1252">[WIN1252]</a><!-- http://www.microsoft.com/globaldev/reference/sbcs/1252.htm -->
+ <tr> <td> ISO-8859-9 <td> Windows-1254 <td>
+ <a href="#refsRFC1345">[RFC1345]</a><!-- XXX consider more direct reference? -->
+ <a href="#refsWin1254">[WIN1254]</a><!-- http://www.microsoft.com/globaldev/reference/sbcs/1254.htm -->
+ <tr> <td> ISO-8859-11 <td> Windows-874 <td>
+ <a href="#refsISO885911">[ISO885911]</a><!-- get reference from http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=28263 -->
+ <a href="#refsWin874">[WIN874]</a><!-- http://www.microsoft.com/globaldev/reference/sbcs/874.mspx -->
+ <tr> <td> KS_C_5601-1987 <td> Windows-949 <td>
+ <a href="#refsRFC1345">[RFC1345]</a><!-- XXX consider more direct reference? -->
+ <a href="#refsWin949">[WIN949]</a><!-- http://www.microsoft.com/globaldev/reference/dbcs/949.mspx -->
+ <tr> <td> TIS-620 <td> Windows-874 <td>
+ <a href="#refsTIS620">[TIS620]</a> <!-- http://www.nectec.or.th/it-standards/std620/std620.htm -->
+ <a href="#refsWin874">[WIN874]</a><!-- http://www.microsoft.com/globaldev/reference/sbcs/874.mspx -->
+ <tr> <td> US-ASCII <td> Windows-1252 <td>
+ <a href="#refsRFC1345">[RFC1345]</a><!-- XXX consider more direct reference? -->
+ <a href="#refsWin1252">[WIN1252]</a><!-- http://www.microsoft.com/globaldev/reference/sbcs/1252.htm -->
+ <tr> <td> x-x-big5 <td> Big5 <td>
+ <a href="#refsBIG5">[BIG5]</a> <!-- XXX ? -->
+ </tbody>
+ </table>
+
+ <p class="note">The requirement to treat certain encodings as other
+ encodings according to the table above is a willful violation of the
+ W3C Character Model specification. <a
+ href="#refsCHARMOD">[CHARMOD]</a></p>
+
+ <hr>
+
+ <p>User agents must not support the CESU-8, UTF-7, BOCU-1 and SCSU
+ encodings. <a href="#refsCESU8">[CESU8]</a> <a
+ href="#refsUTF7">[UTF7]</a> <a href="#refsBOCU1">[BOCU1]</a> <a
+ href="#refsSCSU">[SCSU]</a></p>
+
+ <p>Support for encodings based on EBCDIC is not recommended. This
+ encoding is rarely used for publicly-facing Web content.</p>
+
+ <p>Support for UTF-32 is not recommended. This encoding is rarely
+ used, and frequently misimplemented.</p>
+
+ <p class="note">This specification does not make any attempt to
+ support EBCDIC-based encodings and UTF-32 in its algorithms; support
+ and use of these encodings can thus lead to unexpected behavior in
+ implementations of this specification.</p>
+
+
+
<h3>Common DOM interfaces</h3>
<h4>Reflecting content attributes in DOM attributes</h4>
@@ -55624,102 +55720,7 @@
use for the input stream.</p>
- <h5>Character encoding requirements</h5>
- <p>User agents must at a minimum support the UTF-8 and Windows-1252
- encodings, but may support more.</p>
-
- <p class="note">It is not unusual for Web browsers to support dozens
- if not upwards of a hundred distinct character encodings.</p>
-
- <p>User agents must support the preferred MIME name of every
- character encoding they support that has a preferred MIME name, and
- should support all the IANA-registered aliases. <a
- href="#refsIANACHARSET">[IANACHARSET]</a></p>
-
- <!-- XXX should all this be abstracted out so it can be used for
- <script charset=""> and <form accept-charset="">? Maybe move this
- stuff and the 'character encodings' section of the terminology
- section into its own infrastructure subsection? -->
-
- <p>When comparing a string specifying a character encoding with the
- name or alias of a character encoding to determine if they are
- equal, user agents must use the Charset Alias Matching rules defined
- in Unicode Technical Standard #22. <a
- href="#refsUTS22">[UTS22]</a></p> <!-- XXXrefs
- http://unicode.org/reports/tr22/#Charset_Alias_Matching -->
-
- <p class="example">For instance, "GB_2312-80" and "g.b.2312(80)" are
- considered equivalent names.</p>
-
- <p>When a user agent would otherwise use an encoding given in the
- first column of the following table, it must instead use the
- encoding given in the cell in the second column of the same row. Any
- bytes that are treated differently due to this encoding aliasing
- must be considered <span title="parse error">parse
- errors</span>.</p>
-
- <table>
- <caption>Character encoding overrides</caption>
- <thead>
- <tr> <th> Input encoding <th> Replacement encoding <th> References
- <tbody>
- <!-- how about EUC-JP? -->
- <tr> <td> EUC-KR <td> Windows-949 <td>
- <a href="#refsEUCKR">[EUCKR]</a> <!-- see reference for [EUC-KR] in RFC1557 -->
- <a href="#refsWin949">[WIN949]</a><!-- http://www.microsoft.com/globaldev/reference/dbcs/949.mspx -->
- <tr> <td> GB2312 <td> GBK <td>
- <a href="#refsGB2312">[GB2312]</a><!-- XXX ? -->
- <a href="#refsGBK">[GBK]</a><!-- http://www.iana.org/assignments/charset-reg/GBK -->
- <tr> <td> GB_2312-80 <td> GBK <td>
- <a href="#refsRFC1345">[RFC1345]</a><!-- XXX consider more direct reference? -->
- <a href="#refsGBK">[GBK]</a><!-- http://www.iana.org/assignments/charset-reg/GBK -->
- <tr> <td> ISO-8859-1 <td> Windows-1252 <td>
- <a href="#refsRFC1345">[RFC1345]</a><!-- XXX consider more direct reference? -->
- <a href="#refsWin1252">[WIN1252]</a><!-- http://www.microsoft.com/globaldev/reference/sbcs/1252.htm -->
- <tr> <td> ISO-8859-9 <td> Windows-1254 <td>
- <a href="#refsRFC1345">[RFC1345]</a><!-- XXX consider more direct reference? -->
- <a href="#refsWin1254">[WIN1254]</a><!-- http://www.microsoft.com/globaldev/reference/sbcs/1254.htm -->
- <tr> <td> ISO-8859-11 <td> Windows-874 <td>
- <a href="#refsISO885911">[ISO885911]</a><!-- get reference from http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=28263 -->
- <a href="#refsWin874">[WIN874]</a><!-- http://www.microsoft.com/globaldev/reference/sbcs/874.mspx -->
- <tr> <td> KS_C_5601-1987 <td> Windows-949 <td>
- <a href="#refsRFC1345">[RFC1345]</a><!-- XXX consider more direct reference? -->
- <a href="#refsWin949">[WIN949]</a><!-- http://www.microsoft.com/globaldev/reference/dbcs/949.mspx -->
- <tr> <td> TIS-620 <td> Windows-874 <td>
- <a href="#refsTIS620">[TIS620]</a> <!-- http://www.nectec.or.th/it-standards/std620/std620.htm -->
- <a href="#refsWin874">[WIN874]</a><!-- http://www.microsoft.com/globaldev/reference/sbcs/874.mspx -->
- <tr> <td> US-ASCII <td> Windows-1252 <td>
- <a href="#refsRFC1345">[RFC1345]</a><!-- XXX consider more direct reference? -->
- <a href="#refsWin1252">[WIN1252]</a><!-- http://www.microsoft.com/globaldev/reference/sbcs/1252.htm -->
- <tr> <td> x-x-big5 <td> Big5 <td>
- <a href="#refsBIG5">[BIG5]</a> <!-- XXX ? -->
- </tbody>
- </table>
-
- <p class="note">The requirement to treat certain encodings as other
- encodings according to the table above is a willful violation of the
- W3C Character Model specification. <a
- href="#refsCHARMOD">[CHARMOD]</a></p>
-
- <p>User agents must not support the CESU-8, UTF-7, BOCU-1 and SCSU
- encodings. <a href="#refsCESU8">[CESU8]</a> <a
- href="#refsUTF7">[UTF7]</a> <a href="#refsBOCU1">[BOCU1]</a> <a
- href="#refsSCSU">[SCSU]</a></p>
-
- <p>Support for encodings based on EBCDIC is not recommended. This
- encoding is rarely used for publicly-facing Web content.</p>
-
- <p>Support for UTF-32 is not recommended. This encoding is rarely
- used, and frequently misimplemented.</p>
-
- <p class="note">This specification does not make any attempt to
- support EBCDIC-based encodings and UTF-32 in its algorithms; support
- and use of these encodings can thus lead to unexpected behavior in
- implementations of this specification.</p>
-
-
-
<h5>Preprocessing the input stream</h5>
<p>Given an encoding, the bytes in the input stream must be
@@ -55739,6 +55740,10 @@
(e.g. invalid UTF-8 byte sequences in a UTF-8 input stream) are
errors that conformance checkers are expected to report.</p>
+ <p>Any byte or sequences of bytes in the original byte stream that
+ is <span>misinterpreted for compatibility</span> is a <span>parse
+ error</span>.</p>
+
<p>One leading U+FEFF BYTE ORDER MARK character must be ignored if
any are present.</p>
More information about the Commit-Watchers
mailing list