[html5] r6498 - [e] (0) Clean up how we refer to UTF-16. Fixing http://www.w3.org/Bugs/Public/sh [...]

whatwg at whatwg.org whatwg at whatwg.org
Wed Aug 17 15:28:04 PDT 2011


Author: ianh
Date: 2011-08-17 15:28:03 -0700 (Wed, 17 Aug 2011)
New Revision: 6498

Modified:
   complete.html
   index
   source
Log:
[e] (0) Clean up how we refer to UTF-16.
Fixing http://www.w3.org/Bugs/Public/show_bug.cgi?id=13396

Modified: complete.html
===================================================================
--- complete.html	2011-08-17 22:20:32 UTC (rev 6497)
+++ complete.html	2011-08-17 22:28:03 UTC (rev 6498)
@@ -3343,6 +3343,10 @@
    different <meta charset> elements applying in each case.
   -->
 
+  <p>The term <dfn id=a-utf-16-encoding>a UTF-16 encoding</dfn> refers to any variant of
+  UTF-16: self-describing UTF-16 with a BOM, ambiguous UTF-16 without
+  a BOM, raw UTF-16LE, and raw UTF-16BE. <a href=#refsRFC2781>[RFC2781]</a></p>
+
   <p>The term <dfn id=unicode-character>Unicode character</dfn> is used to mean a <i title="">Unicode scalar value</i> (i.e. any Unicode code point that
   is not a surrogate code point). <a href=#refsUNICODE>[UNICODE]</a></p>
 
@@ -6627,7 +6631,8 @@
    component contains no unescaped non-ASCII characters. <a href=#refsRFC3987>[RFC3987]</a></li>
 
    <li><p>The <a href=#url>URL</a> is a valid IRI reference and the <a href="#document's-character-encoding" title="document's character encoding">character encoding</a> of
-   the URL's <code><a href=#document>Document</a></code> is UTF-8 or UTF-16. <a href=#refsRFC3987>[RFC3987]</a></li>
+   the URL's <code><a href=#document>Document</a></code> is UTF-8 or <a href=#a-utf-16-encoding>a UTF-16
+   encoding</a>. <a href=#refsRFC3987>[RFC3987]</a></li>
 
   </ul><p>A string is a <dfn id=valid-non-empty-url>valid non-empty URL</dfn> if it is a
   <a href=#valid-url>valid URL</a> but it is not the empty string.</p>
@@ -6819,8 +6824,8 @@
 
     </dl></li>
 
-   <li><p>If <var title="">encoding</var> is a UTF-16 encoding, then
-   change the value of <var title="">encoding</var> to UTF-8.</li>
+   <li><p>If <var title="">encoding</var> is <a href=#a-utf-16-encoding>a UTF-16
+   encoding</a>, then change the value of <var title="">encoding</var> to UTF-8.</li>
 
    <li>
 
@@ -84216,9 +84221,8 @@
          <li><p>If <var title="">need pragma</var> is true but <var title="">got pragma</var> is false, then jump to the second
          step of the overall "two step" algorithm.</li>
 
-         <li><p>If <var title="">charset</var> is a UTF-16 encoding,
-         change the value of <var title="">charset</var> to
-         UTF-8.</li>
+         <li><p>If <var title="">charset</var> is <a href=#a-utf-16-encoding>a UTF-16
+         encoding</a>, change the value of <var title="">charset</var> to UTF-8.</li>
 
          <li><p>If <var title="">charset</var> is not a supported
          character encoding, then jump to the second step of the
@@ -84650,12 +84654,14 @@
   violation</a> of the W3C Character Model specification, motivated
   by a desire for compatibility with legacy content. <a href=#refsCHARMOD>[CHARMOD]</a></p>
 
-  <p>When a user agent is to use the UTF-16 encoding but no BOM has
-  been found, user agents must default to UTF-16LE.</p>
+  <p>When a user agent is to use the self-describing UTF-16 encoding
+  but no BOM has been found, user agents must default to little-endian
+  UTF-16.</p>
 
-  <p class=note>The requirement to default UTF-16 to LE rather than
-  BE is a <a href=#willful-violation>willful violation</a> of RFC 2781, motivated by a
-  desire for compatibility with legacy content. <a href=#refsRFC2781>[RFC2781]</a></p>
+  <p class=note>The requirement to default UTF-16 to little-endian
+  rather than big-endian is a <a href=#willful-violation>willful violation</a> of RFC
+  2781, motivated by a desire for compatibility with legacy content.
+  <a href=#refsRFC2781>[RFC2781]</a></p>
 
   <hr><p>User agents must not support the CESU-8, UTF-7, BOCU-1 and SCSU
   encodings. <a href=#refsCESU8>[CESU8]</a> <a href=#refsUTF7>[UTF7]</a> <a href=#refsBOCU1>[BOCU1]</a> <a href=#refsSCSU>[SCSU]</a></p>
@@ -84771,13 +84777,13 @@
    earlier section failed to find the right encoding.</li>
 
    <li>If the encoding that is already being used to interpret the
-   input stream is a UTF-16 encoding, then set the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> to
+   input stream is <a href=#a-utf-16-encoding>a UTF-16 encoding</a>, then set the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> to
    <i>certain</i> and abort these steps. The new encoding is ignored;
    if it was anything but the same encoding, then it would be clearly
    incorrect.</li>
 
-   <li>If the new encoding is a UTF-16 encoding, change it to
-   UTF-8.</li>
+   <li>If the new encoding is <a href=#a-utf-16-encoding>a UTF-16 encoding</a>, change
+   it to UTF-8.</li>
 
    <li>If all the bytes up to the last byte converted by the current
    decoder have the same Unicode interpretations in both the current
@@ -88176,7 +88182,7 @@
 
     <p id=meta-charset-during-parse>If the element has a <code title=attr-meta-charset><a href=#attr-meta-charset>charset</a></code> attribute, and its value
     is either a supported <a href=#ascii-compatible-character-encoding>ASCII-compatible character
-    encoding</a> or a UTF-16 encoding, and the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is currently
+    encoding</a> or <a href=#a-utf-16-encoding>a UTF-16 encoding</a>, and the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is currently
     <i>tentative</i>, then <a href=#change-the-encoding>change the encoding</a> to the
     encoding given by the value of the <code title=attr-meta-charset><a href=#attr-meta-charset>charset</a></code> attribute.</p>
 
@@ -88186,8 +88192,8 @@
     <code title=attr-meta-content><a href=#attr-meta-content>content</a></code> attribute, and
     applying the <a href=#algorithm-for-extracting-an-encoding-from-a-meta-element>algorithm for extracting an encoding from a
     <code>meta</code> element</a> to that attribute's value returns
-    a supported <a href=#ascii-compatible-character-encoding>ASCII-compatible character encoding</a> or a
-    UTF-16 encoding, and the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is currently
+    a supported <a href=#ascii-compatible-character-encoding>ASCII-compatible character encoding</a> or
+    <a href=#a-utf-16-encoding>a UTF-16 encoding</a>, and the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is currently
     <i>tentative</i>, then <a href=#change-the-encoding>change the encoding</a> to the
     extracted encoding.</p>
 

Modified: index
===================================================================
--- index	2011-08-17 22:20:32 UTC (rev 6497)
+++ index	2011-08-17 22:28:03 UTC (rev 6498)
@@ -3240,6 +3240,10 @@
    different <meta charset> elements applying in each case.
   -->
 
+  <p>The term <dfn id=a-utf-16-encoding>a UTF-16 encoding</dfn> refers to any variant of
+  UTF-16: self-describing UTF-16 with a BOM, ambiguous UTF-16 without
+  a BOM, raw UTF-16LE, and raw UTF-16BE. <a href=#refsRFC2781>[RFC2781]</a></p>
+
   <p>The term <dfn id=unicode-character>Unicode character</dfn> is used to mean a <i title="">Unicode scalar value</i> (i.e. any Unicode code point that
   is not a surrogate code point). <a href=#refsUNICODE>[UNICODE]</a></p>
 
@@ -6491,7 +6495,8 @@
    component contains no unescaped non-ASCII characters. <a href=#refsRFC3987>[RFC3987]</a></li>
 
    <li><p>The <a href=#url>URL</a> is a valid IRI reference and the <a href="#document's-character-encoding" title="document's character encoding">character encoding</a> of
-   the URL's <code><a href=#document>Document</a></code> is UTF-8 or UTF-16. <a href=#refsRFC3987>[RFC3987]</a></li>
+   the URL's <code><a href=#document>Document</a></code> is UTF-8 or <a href=#a-utf-16-encoding>a UTF-16
+   encoding</a>. <a href=#refsRFC3987>[RFC3987]</a></li>
 
   </ul><p>A string is a <dfn id=valid-non-empty-url>valid non-empty URL</dfn> if it is a
   <a href=#valid-url>valid URL</a> but it is not the empty string.</p>
@@ -6683,8 +6688,8 @@
 
     </dl></li>
 
-   <li><p>If <var title="">encoding</var> is a UTF-16 encoding, then
-   change the value of <var title="">encoding</var> to UTF-8.</li>
+   <li><p>If <var title="">encoding</var> is <a href=#a-utf-16-encoding>a UTF-16
+   encoding</a>, then change the value of <var title="">encoding</var> to UTF-8.</li>
 
    <li>
 
@@ -79663,9 +79668,8 @@
          <li><p>If <var title="">need pragma</var> is true but <var title="">got pragma</var> is false, then jump to the second
          step of the overall "two step" algorithm.</li>
 
-         <li><p>If <var title="">charset</var> is a UTF-16 encoding,
-         change the value of <var title="">charset</var> to
-         UTF-8.</li>
+         <li><p>If <var title="">charset</var> is <a href=#a-utf-16-encoding>a UTF-16
+         encoding</a>, change the value of <var title="">charset</var> to UTF-8.</li>
 
          <li><p>If <var title="">charset</var> is not a supported
          character encoding, then jump to the second step of the
@@ -80097,12 +80101,14 @@
   violation</a> of the W3C Character Model specification, motivated
   by a desire for compatibility with legacy content. <a href=#refsCHARMOD>[CHARMOD]</a></p>
 
-  <p>When a user agent is to use the UTF-16 encoding but no BOM has
-  been found, user agents must default to UTF-16LE.</p>
+  <p>When a user agent is to use the self-describing UTF-16 encoding
+  but no BOM has been found, user agents must default to little-endian
+  UTF-16.</p>
 
-  <p class=note>The requirement to default UTF-16 to LE rather than
-  BE is a <a href=#willful-violation>willful violation</a> of RFC 2781, motivated by a
-  desire for compatibility with legacy content. <a href=#refsRFC2781>[RFC2781]</a></p>
+  <p class=note>The requirement to default UTF-16 to little-endian
+  rather than big-endian is a <a href=#willful-violation>willful violation</a> of RFC
+  2781, motivated by a desire for compatibility with legacy content.
+  <a href=#refsRFC2781>[RFC2781]</a></p>
 
   <hr><p>User agents must not support the CESU-8, UTF-7, BOCU-1 and SCSU
   encodings. <a href=#refsCESU8>[CESU8]</a> <a href=#refsUTF7>[UTF7]</a> <a href=#refsBOCU1>[BOCU1]</a> <a href=#refsSCSU>[SCSU]</a></p>
@@ -80218,13 +80224,13 @@
    earlier section failed to find the right encoding.</li>
 
    <li>If the encoding that is already being used to interpret the
-   input stream is a UTF-16 encoding, then set the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> to
+   input stream is <a href=#a-utf-16-encoding>a UTF-16 encoding</a>, then set the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> to
    <i>certain</i> and abort these steps. The new encoding is ignored;
    if it was anything but the same encoding, then it would be clearly
    incorrect.</li>
 
-   <li>If the new encoding is a UTF-16 encoding, change it to
-   UTF-8.</li>
+   <li>If the new encoding is <a href=#a-utf-16-encoding>a UTF-16 encoding</a>, change
+   it to UTF-8.</li>
 
    <li>If all the bytes up to the last byte converted by the current
    decoder have the same Unicode interpretations in both the current
@@ -83623,7 +83629,7 @@
 
     <p id=meta-charset-during-parse>If the element has a <code title=attr-meta-charset><a href=#attr-meta-charset>charset</a></code> attribute, and its value
     is either a supported <a href=#ascii-compatible-character-encoding>ASCII-compatible character
-    encoding</a> or a UTF-16 encoding, and the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is currently
+    encoding</a> or <a href=#a-utf-16-encoding>a UTF-16 encoding</a>, and the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is currently
     <i>tentative</i>, then <a href=#change-the-encoding>change the encoding</a> to the
     encoding given by the value of the <code title=attr-meta-charset><a href=#attr-meta-charset>charset</a></code> attribute.</p>
 
@@ -83633,8 +83639,8 @@
     <code title=attr-meta-content><a href=#attr-meta-content>content</a></code> attribute, and
     applying the <a href=#algorithm-for-extracting-an-encoding-from-a-meta-element>algorithm for extracting an encoding from a
     <code>meta</code> element</a> to that attribute's value returns
-    a supported <a href=#ascii-compatible-character-encoding>ASCII-compatible character encoding</a> or a
-    UTF-16 encoding, and the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is currently
+    a supported <a href=#ascii-compatible-character-encoding>ASCII-compatible character encoding</a> or
+    <a href=#a-utf-16-encoding>a UTF-16 encoding</a>, and the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is currently
     <i>tentative</i>, then <a href=#change-the-encoding>change the encoding</a> to the
     extracted encoding.</p>
 

Modified: source
===================================================================
--- source	2011-08-17 22:20:32 UTC (rev 6497)
+++ source	2011-08-17 22:28:03 UTC (rev 6498)
@@ -2202,6 +2202,11 @@
    different <meta charset> elements applying in each case.
   -->
 
+  <p>The term <dfn>a UTF-16 encoding</dfn> refers to any variant of
+  UTF-16: self-describing UTF-16 with a BOM, ambiguous UTF-16 without
+  a BOM, raw UTF-16LE, and raw UTF-16BE. <a
+  href="#refsRFC2781">[RFC2781]</a></p>
+
   <p>The term <dfn>Unicode character</dfn> is used to mean a <i
   title="">Unicode scalar value</i> (i.e. any Unicode code point that
   is not a surrogate code point). <a
@@ -6212,8 +6217,8 @@
 
    <li><p>The <span>URL</span> is a valid IRI reference and the <span
    title="document's character encoding">character encoding</span> of
-   the URL's <code>Document</code> is UTF-8 or UTF-16. <a
-   href="#refsRFC3987">[RFC3987]</a></p></li>
+   the URL's <code>Document</code> is UTF-8 or <span>a UTF-16
+   encoding</span>. <a href="#refsRFC3987">[RFC3987]</a></p></li>
 
   </ul>
 
@@ -6435,8 +6440,9 @@
 
    </li>
 
-   <li><p>If <var title="">encoding</var> is a UTF-16 encoding, then
-   change the value of <var title="">encoding</var> to UTF-8.</p></li>
+   <li><p>If <var title="">encoding</var> is <span>a UTF-16
+   encoding</span>, then change the value of <var
+   title="">encoding</var> to UTF-8.</p></li>
 
    <li>
 
@@ -95332,9 +95338,9 @@
          title="">got pragma</var> is false, then jump to the second
          step of the overall "two step" algorithm.</p></li>
 
-         <li><p>If <var title="">charset</var> is a UTF-16 encoding,
-         change the value of <var title="">charset</var> to
-         UTF-8.</p></li>
+         <li><p>If <var title="">charset</var> is <span>a UTF-16
+         encoding</span>, change the value of <var
+         title="">charset</var> to UTF-8.</p></li>
 
          <li><p>If <var title="">charset</var> is not a supported
          character encoding, then jump to the second step of the
@@ -95876,13 +95882,14 @@
   by a desire for compatibility with legacy content. <a
   href="#refsCHARMOD">[CHARMOD]</a></p>
 
-  <p>When a user agent is to use the UTF-16 encoding but no BOM has
-  been found, user agents must default to UTF-16LE.</p>
+  <p>When a user agent is to use the self-describing UTF-16 encoding
+  but no BOM has been found, user agents must default to little-endian
+  UTF-16.</p>
 
-  <p class="note">The requirement to default UTF-16 to LE rather than
-  BE is a <span>willful violation</span> of RFC 2781, motivated by a
-  desire for compatibility with legacy content. <a
-  href="#refsRFC2781">[RFC2781]</a></p>
+  <p class="note">The requirement to default UTF-16 to little-endian
+  rather than big-endian is a <span>willful violation</span> of RFC
+  2781, motivated by a desire for compatibility with legacy content.
+  <a href="#refsRFC2781">[RFC2781]</a></p>
 
   <hr>
 
@@ -96006,14 +96013,14 @@
    earlier section failed to find the right encoding.</li>
 
    <li>If the encoding that is already being used to interpret the
-   input stream is a UTF-16 encoding, then set the <span
+   input stream is <span>a UTF-16 encoding</span>, then set the <span
    title="concept-encoding-confidence">confidence</span> to
    <i>certain</i> and abort these steps. The new encoding is ignored;
    if it was anything but the same encoding, then it would be clearly
    incorrect.</li>
 
-   <li>If the new encoding is a UTF-16 encoding, change it to
-   UTF-8.</li>
+   <li>If the new encoding is <span>a UTF-16 encoding</span>, change
+   it to UTF-8.</li>
 
    <li>If all the bytes up to the last byte converted by the current
    decoder have the same Unicode interpretations in both the current
@@ -99925,7 +99932,7 @@
     <p id="meta-charset-during-parse">If the element has a <code
     title="attr-meta-charset">charset</code> attribute, and its value
     is either a supported <span>ASCII-compatible character
-    encoding</span> or a UTF-16 encoding, and the <span
+    encoding</span> or <span>a UTF-16 encoding</span>, and the <span
     title="concept-encoding-confidence">confidence</span> is currently
     <i>tentative</i>, then <span>change the encoding</span> to the
     encoding given by the value of the <code
@@ -99938,8 +99945,8 @@
     <code title="attr-meta-content">content</code> attribute, and
     applying the <span>algorithm for extracting an encoding from a
     <code>meta</code> element</span> to that attribute's value returns
-    a supported <span>ASCII-compatible character encoding</span> or a
-    UTF-16 encoding, and the <span
+    a supported <span>ASCII-compatible character encoding</span> or
+    <span>a UTF-16 encoding</span>, and the <span
     title="concept-encoding-confidence">confidence</span> is currently
     <i>tentative</i>, then <span>change the encoding</span> to the
     extracted encoding.</p>




More information about the Commit-Watchers mailing list