[html5] r6184 - [giow] (0) Try to clean up the stuff about Unicode characters. Fixing http://www [...]

Fri Jun 3 12:40:11 PDT 2011

Author: ianh
Date: 2011-06-03 12:40:10 -0700 (Fri, 03 Jun 2011)
New Revision: 6184

Modified:
   complete.html
   index
   source
Log:
[giow] (0) Try to clean up the stuff about Unicode characters.
Fixing http://www.w3.org/Bugs/Public/show_bug.cgi?id=12100

Modified: complete.html
===================================================================

--- complete.html	2011-06-03 01:21:42 UTC (rev 6183)
+++ complete.html	2011-06-03 19:40:10 UTC (rev 6184)
@@ -2944,9 +2944,8 @@
    different <meta charset> elements applying in each case.
   -->
 
-  <p>The term <dfn title="">Unicode character</dfn> is used to mean a
-  <i title="">Unicode scalar value</i> (i.e. any Unicode code point
-  that is not a surrogate code point). <a href=#refsUNICODE>[UNICODE]</a></p>
+  <p>The term <dfn id=unicode-character>Unicode character</dfn> is used to mean a <i title="">Unicode scalar value</i> (i.e. any Unicode code point that
+  is not a surrogate code point). <a href=#refsUNICODE>[UNICODE]</a></p>
 
 
 
@@ -3425,14 +3424,6 @@
     is passed an Infinity or Not-a-Number (NaN) value, a
     <code><a href=#not_supported_err>NOT_SUPPORTED_ERR</a></code> exception must be raised.</p>
 
-    <p>Except where otherwise specified, if a method has an argument
-    of type <code>DOMString</code>, or if an IDL attribute is assigned
-    a new value of type <code>DOMString</code>, the user agent must
-    <span title=dfn-obtain-unicode>convert the
-    <code>DOMString</code> to a sequence of Unicode characters</span>
-    to obtain the string on which the algorithms in this specification
-    are to operate. <a href=#refsWEBIDL>[WEBIDL]</a></p>
-
    </dd>
 
    <dt>JavaScript</dt>
@@ -6380,7 +6371,9 @@
     characters as defined by UTF-8.</p>
 
     <p>If any percent-encoded octets in that component are not valid
-    UTF-8 sequences, then return an error and abort these steps.</p>
+    UTF-8 sequences (e.g. sequences of percent-encoded octets that
+    expand to surrogate code points), then return an error and abort
+    these steps.</p>
 
     <p>Apply the IDNA ToASCII algorithm to the matching substring,
     with both the AllowUnassigned and UseSTD3ASCIIRules flags
@@ -16096,11 +16089,11 @@
 
          <dd>
 
-          <p>The contents of that file, interpreted as string of
-          Unicode characters, are the script source.</p>
+          <p>The contents of that file, interpreted as a Unicode
+          string, are the script source.</p>
 
-          <p>To obtain the string of Unicode characters, the user
-          agent run the following steps:</p>
+          <p>To obtain the Unicode string, the user agent run the
+          following steps:</p>
 
           <ol><li><p>If the resource's <a href=#content-type title=Content-Type>Content
            Type metadata</a>, if any, specifies a character
@@ -16471,11 +16464,11 @@
 star          = %x002A ; U+002A ASTERISK (*)
 slash         = %x002F ; U+002F SOLIDUS (/)
 not-newline   = %x0000-0009 / %x000B-10FFFF
-                ; a Unicode character other than U+000A LINE FEED (LF)
+                ; a <a href=#unicode-character>Unicode character</a> other than U+000A LINE FEED (LF)
 not-star      = %x0000-0029 / %x002B-10FFFF
-                ; a Unicode character other than U+002A ASTERISK (*)
+                ; a <a href=#unicode-character>Unicode character</a> other than U+002A ASTERISK (*)
 not-slash     = %x0000-002E / %x0030-10FFFF
-                ; a Unicode character other than U+002F SOLIDUS (/)</pre>
+                ; a <a href=#unicode-character>Unicode character</a> other than U+002F SOLIDUS (/)</pre>
 
   <p class=note>This corresponds to putting the contents of the
   element in JavaScript comments.</p>
@@ -32310,18 +32303,13 @@
   parsing the provided byte stream. If the stream lacks this WebVTT
   file signature, then the parser aborts.</p>
 
-  <p>When converting the bytes into Unicode characters, if the
-  encoding used is UTF-8, the bytes must be <a href=#decoded-as-utf-8,-with-error-handling title="decoded as
-  UTF-8, with error handling">decoded with the error handling</a>
-  defined in this specification, and all U+0000 NULL characters must
-  be replaced by U+FFFD REPLACEMENT CHARACTERs.</p>
-
   <p>The <dfn id=webvtt-parser-algorithm>WebVTT parser algorithm</dfn> is as follows:</p>
 
   <ol><li><p>Let <var title="">input</var> be the string being parsed,
-   after conversion to Unicode and after the replacement of U+0000
-   NULL characters described above.</li>
+   after conversion to Unicode.</li>
 
+   <li><p>Replace all U+0000 NULL characters in <var title="">input</var> by U+FFFD REPLACEMENT CHARACTERs.</li>
+
    <li><p>Let <var title="">position</var> be a pointer into <var title="">input</var>, initially pointing at the start of the
    string. In an <a href=#incremental-webvtt-parser>incremental WebVTT parser</a>, when this
    algorithm (or further algorithms that it uses) moves the <var title="">position</var> pointer, the user agent must wait until
@@ -64072,14 +64060,14 @@
    <li><p>Let <var title="">decoded fragid</var> be the result of
    expanding any sequences of percent-encoded octets in <var title="">fragid</var> that are valid UTF-8 sequences into Unicode
    characters as defined by UTF-8. If any percent-encoded octets in
-   that string are not valid UTF-8 sequences, then skip this step and
-   the next one.</p>
+   that string are not valid UTF-8 sequences (e.g. they expand to
+   surrogate code points), then skip this step and the next one.</p>
 
    <li><p>If this step was not skipped and there is an element in the
-   DOM that has an <a href=#concept-id title=concept-id>ID</a> exactly equal to <var title="">decoded
-   fragid</var>, then the first such element in tree order is
-   <a href=#the-indicated-part-of-the-document>the indicated part of the document</a>; stop the algorithm
-   here.</li>
+   DOM that has an <a href=#concept-id title=concept-id>ID</a> exactly equal to
+   <var title="">decoded fragid</var>, then the first such element in
+   tree order is <a href=#the-indicated-part-of-the-document>the indicated part of the document</a>; stop
+   the algorithm here.</li>
 
    <li><p>If there is an <code><a href=#the-a-element>a</a></code> element in the DOM that has a
    <code title=attr-a-name><a href=#attr-a-name>name</a></code> attribute whose value is
@@ -78565,9 +78553,9 @@
 colon         = %x003A ; U+003A COLON (:)
 bom           = %xFEFF ; U+FEFF BYTE ORDER MARK
 name-char     = %x0000-0009 / %x000B-000C / %x000E-0039 / %x003B-10FFFF
-                ; a Unicode character other than U+000A LINE FEED (LF), U+000D CARRIAGE RETURN (CR), or U+003A COLON (:)
+                ; a <a href=#unicode-character>Unicode character</a> other than U+000A LINE FEED (LF), U+000D CARRIAGE RETURN (CR), or U+003A COLON (:)
 any-char      = %x0000-0009 / %x000B-000C / %x000E-10FFFF
-                ; a Unicode character other than U+000A LINE FEED (LF) or U+000D CARRIAGE RETURN (CR)</pre>
+                ; a <a href=#unicode-character>Unicode character</a> other than U+000A LINE FEED (LF) or U+000D CARRIAGE RETURN (CR)</pre>
 
   <p>Event streams in this format must always be encoded as
   UTF-8. <a href=#refsRFC3629>[RFC3629]</a></p>
@@ -81820,12 +81808,13 @@
   <h4 id=text-1><span class=secno>13.1.3 </span>Text</h4>
 
   <p><dfn id=syntax-text title=syntax-text>Text</dfn> is allowed inside elements,
-  attribute values, and comments. Text must consist of Unicode
-  characters. Text must not contain U+0000 characters. Text must not
-  contain permanently undefined Unicode characters (noncharacters).
-  Text must not contain control characters other than <a href=#space-character title="space character">space characters</a>. Extra constraints
-  are placed on what is and what is not allowed in text based on where
-  the text is to be put, as described in the other sections.</p>
+  attribute values, and comments. Text must consist of <a href=#unicode-character title="Unicode character">Unicode characters</a>. Text must not
+  contain U+0000 characters. Text must not contain permanently
+  undefined Unicode characters (noncharacters). Text must not contain
+  control characters other than <a href=#space-character title="space character">space
+  characters</a>. Extra constraints are placed on what is and what
+  is not allowed in text based on where the text is to be put, as
+  described in the other sections.</p>
 
 
   <h5 id=newlines><span class=secno>13.1.3.1 </span>Newlines</h5>
@@ -82020,7 +82009,7 @@
   <h4 id=overview-of-the-parsing-model><span class=secno>13.2.1 </span>Overview of the parsing model</h4>
 
   <p>The input to the HTML parsing process consists of a stream of
-  Unicode characters, which is passed through a
+  Unicode code points, which is passed through a
   <a href=#tokenization>tokenization</a> stage followed by a <a href=#tree-construction>tree
   construction</a> stage. The output is a <code><a href=#document>Document</a></code>
   object.</p>
@@ -82069,7 +82058,7 @@
 
   <h4 id=the-input-stream><span class=secno>13.2.2 </span>The <dfn>input stream</dfn></h4>
 
-  <p>The stream of Unicode characters that comprises the input to the
+  <p>The stream of Unicode code points that comprises the input to the
   tokenization stage will be initially seen by the user agent as a
   stream of bytes (typically coming over the network or from the local
   file system). The bytes encode the actual characters according to a
@@ -82107,8 +82096,8 @@
   that encoding is <i>tentative</i> or <i>certain</i>, is <a href=#meta-charset-during-parse>used during the parsing</a> to
   determine whether to <a href=#change-the-encoding>change the encoding</a>. If no
   encoding is necessary, e.g. because the parser is operating on a
-  stream of Unicode characters and doesn't have to use an encoding at
-  all, then the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is
+  Unicode stream and doesn't have to use an encoding at all, then the
+  <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is
   <i>irrelevant</i>.</p>
 
   <ol><li><p>If the user has explicitly instructed the user agent to
@@ -82730,7 +82719,7 @@
   <h5 id=preprocessing-the-input-stream><span class=secno>13.2.2.3 </span>Preprocessing the input stream</h5>
 
   <p>Given an encoding, the bytes in the input stream must be
-  converted to Unicode characters for the tokenizer, as described by
+  converted to Unicode code points for the tokenizer, as described by
   the rules for that encoding, except that the leading U+FEFF BYTE
   ORDER MARK character, if any, must not be stripped by the encoding
   layer (it is stripped by the rule below).</p> <!-- this is to

Modified: index
===================================================================
--- index	2011-06-03 01:21:42 UTC (rev 6183)
+++ index	2011-06-03 19:40:10 UTC (rev 6184)
@@ -2961,9 +2961,8 @@
    different <meta charset> elements applying in each case.
   -->
 
-  <p>The term <dfn title="">Unicode character</dfn> is used to mean a
-  <i title="">Unicode scalar value</i> (i.e. any Unicode code point
-  that is not a surrogate code point). <a href=#refsUNICODE>[UNICODE]</a></p>
+  <p>The term <dfn id=unicode-character>Unicode character</dfn> is used to mean a <i title="">Unicode scalar value</i> (i.e. any Unicode code point that
+  is not a surrogate code point). <a href=#refsUNICODE>[UNICODE]</a></p>
 
 
 
@@ -3442,14 +3441,6 @@
     is passed an Infinity or Not-a-Number (NaN) value, a
     <code><a href=#not_supported_err>NOT_SUPPORTED_ERR</a></code> exception must be raised.</p>
 
-    <p>Except where otherwise specified, if a method has an argument
-    of type <code>DOMString</code>, or if an IDL attribute is assigned
-    a new value of type <code>DOMString</code>, the user agent must
-    <span title=dfn-obtain-unicode>convert the
-    <code>DOMString</code> to a sequence of Unicode characters</span>
-    to obtain the string on which the algorithms in this specification
-    are to operate. <a href=#refsWEBIDL>[WEBIDL]</a></p>
-
    </dd>
 
    <dt>JavaScript</dt>
@@ -6366,7 +6357,9 @@
     characters as defined by UTF-8.</p>
 
     <p>If any percent-encoded octets in that component are not valid
-    UTF-8 sequences, then return an error and abort these steps.</p>
+    UTF-8 sequences (e.g. sequences of percent-encoded octets that
+    expand to surrogate code points), then return an error and abort
+    these steps.</p>
 
     <p>Apply the IDNA ToASCII algorithm to the matching substring,
     with both the AllowUnassigned and UseSTD3ASCIIRules flags
@@ -16082,11 +16075,11 @@
 
          <dd>
 
-          <p>The contents of that file, interpreted as string of
-          Unicode characters, are the script source.</p>
+          <p>The contents of that file, interpreted as a Unicode
+          string, are the script source.</p>
 
-          <p>To obtain the string of Unicode characters, the user
-          agent run the following steps:</p>
+          <p>To obtain the Unicode string, the user agent run the
+          following steps:</p>
 
           <ol><li><p>If the resource's <a href=#content-type title=Content-Type>Content
            Type metadata</a>, if any, specifies a character
@@ -16457,11 +16450,11 @@
 star          = %x002A ; U+002A ASTERISK (*)
 slash         = %x002F ; U+002F SOLIDUS (/)
 not-newline   = %x0000-0009 / %x000B-10FFFF
-                ; a Unicode character other than U+000A LINE FEED (LF)
+                ; a <a href=#unicode-character>Unicode character</a> other than U+000A LINE FEED (LF)
 not-star      = %x0000-0029 / %x002B-10FFFF
-                ; a Unicode character other than U+002A ASTERISK (*)
+                ; a <a href=#unicode-character>Unicode character</a> other than U+002A ASTERISK (*)
 not-slash     = %x0000-002E / %x0030-10FFFF
-                ; a Unicode character other than U+002F SOLIDUS (/)</pre>
+                ; a <a href=#unicode-character>Unicode character</a> other than U+002F SOLIDUS (/)</pre>
 
   <p class=note>This corresponds to putting the contents of the
   element in JavaScript comments.</p>
@@ -32299,18 +32292,13 @@
   parsing the provided byte stream. If the stream lacks this WebVTT
   file signature, then the parser aborts.</p>
 
-  <p>When converting the bytes into Unicode characters, if the
-  encoding used is UTF-8, the bytes must be <a href=#decoded-as-utf-8,-with-error-handling title="decoded as
-  UTF-8, with error handling">decoded with the error handling</a>
-  defined in this specification, and all U+0000 NULL characters must
-  be replaced by U+FFFD REPLACEMENT CHARACTERs.</p>
-
   <p>The <dfn id=webvtt-parser-algorithm>WebVTT parser algorithm</dfn> is as follows:</p>
 
   <ol><li><p>Let <var title="">input</var> be the string being parsed,
-   after conversion to Unicode and after the replacement of U+0000
-   NULL characters described above.</li>
+   after conversion to Unicode.</li>
 
+   <li><p>Replace all U+0000 NULL characters in <var title="">input</var> by U+FFFD REPLACEMENT CHARACTERs.</li>
+
    <li><p>Let <var title="">position</var> be a pointer into <var title="">input</var>, initially pointing at the start of the
    string. In an <a href=#incremental-webvtt-parser>incremental WebVTT parser</a>, when this
    algorithm (or further algorithms that it uses) moves the <var title="">position</var> pointer, the user agent must wait until
@@ -64061,14 +64049,14 @@
    <li><p>Let <var title="">decoded fragid</var> be the result of
    expanding any sequences of percent-encoded octets in <var title="">fragid</var> that are valid UTF-8 sequences into Unicode
    characters as defined by UTF-8. If any percent-encoded octets in
-   that string are not valid UTF-8 sequences, then skip this step and
-   the next one.</p>
+   that string are not valid UTF-8 sequences (e.g. they expand to
+   surrogate code points), then skip this step and the next one.</p>
 
    <li><p>If this step was not skipped and there is an element in the
-   DOM that has an <a href=#concept-id title=concept-id>ID</a> exactly equal to <var title="">decoded
-   fragid</var>, then the first such element in tree order is
-   <a href=#the-indicated-part-of-the-document>the indicated part of the document</a>; stop the algorithm
-   here.</li>
+   DOM that has an <a href=#concept-id title=concept-id>ID</a> exactly equal to
+   <var title="">decoded fragid</var>, then the first such element in
+   tree order is <a href=#the-indicated-part-of-the-document>the indicated part of the document</a>; stop
+   the algorithm here.</li>
 
    <li><p>If there is an <code><a href=#the-a-element>a</a></code> element in the DOM that has a
    <code title=attr-a-name><a href=#attr-a-name>name</a></code> attribute whose value is
@@ -77566,12 +77554,13 @@
   <h4 id=text-1><span class=secno>11.1.3 </span>Text</h4>
 
   <p><dfn id=syntax-text title=syntax-text>Text</dfn> is allowed inside elements,
-  attribute values, and comments. Text must consist of Unicode
-  characters. Text must not contain U+0000 characters. Text must not
-  contain permanently undefined Unicode characters (noncharacters).
-  Text must not contain control characters other than <a href=#space-character title="space character">space characters</a>. Extra constraints
-  are placed on what is and what is not allowed in text based on where
-  the text is to be put, as described in the other sections.</p>
+  attribute values, and comments. Text must consist of <a href=#unicode-character title="Unicode character">Unicode characters</a>. Text must not
+  contain U+0000 characters. Text must not contain permanently
+  undefined Unicode characters (noncharacters). Text must not contain
+  control characters other than <a href=#space-character title="space character">space
+  characters</a>. Extra constraints are placed on what is and what
+  is not allowed in text based on where the text is to be put, as
+  described in the other sections.</p>
 
 
   <h5 id=newlines><span class=secno>11.1.3.1 </span>Newlines</h5>
@@ -77766,7 +77755,7 @@
   <h4 id=overview-of-the-parsing-model><span class=secno>11.2.1 </span>Overview of the parsing model</h4>
 
   <p>The input to the HTML parsing process consists of a stream of
-  Unicode characters, which is passed through a
+  Unicode code points, which is passed through a
   <a href=#tokenization>tokenization</a> stage followed by a <a href=#tree-construction>tree
   construction</a> stage. The output is a <code><a href=#document>Document</a></code>
   object.</p>
@@ -77815,7 +77804,7 @@
 
   <h4 id=the-input-stream><span class=secno>11.2.2 </span>The <dfn>input stream</dfn></h4>
 
-  <p>The stream of Unicode characters that comprises the input to the
+  <p>The stream of Unicode code points that comprises the input to the
   tokenization stage will be initially seen by the user agent as a
   stream of bytes (typically coming over the network or from the local
   file system). The bytes encode the actual characters according to a
@@ -77853,8 +77842,8 @@
   that encoding is <i>tentative</i> or <i>certain</i>, is <a href=#meta-charset-during-parse>used during the parsing</a> to
   determine whether to <a href=#change-the-encoding>change the encoding</a>. If no
   encoding is necessary, e.g. because the parser is operating on a
-  stream of Unicode characters and doesn't have to use an encoding at
-  all, then the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is
+  Unicode stream and doesn't have to use an encoding at all, then the
+  <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is
   <i>irrelevant</i>.</p>
 
   <ol><li><p>If the user has explicitly instructed the user agent to
@@ -78476,7 +78465,7 @@
   <h5 id=preprocessing-the-input-stream><span class=secno>11.2.2.3 </span>Preprocessing the input stream</h5>
 
   <p>Given an encoding, the bytes in the input stream must be
-  converted to Unicode characters for the tokenizer, as described by
+  converted to Unicode code points for the tokenizer, as described by
   the rules for that encoding, except that the leading U+FEFF BYTE
   ORDER MARK character, if any, must not be stripped by the encoding
   layer (it is stripped by the rule below).</p> <!-- this is to

Modified: source
===================================================================
--- source	2011-06-03 01:21:42 UTC (rev 6183)
+++ source	2011-06-03 19:40:10 UTC (rev 6184)
@@ -1908,9 +1908,9 @@
    different <meta charset> elements applying in each case.
   -->
 
-  <p>The term <dfn title="">Unicode character</dfn> is used to mean a
-  <i title="">Unicode scalar value</i> (i.e. any Unicode code point
-  that is not a surrogate code point). <a
+  <p>The term <dfn>Unicode character</dfn> is used to mean a <i
+  title="">Unicode scalar value</i> (i.e. any Unicode code point that
+  is not a surrogate code point). <a
   href="#refsUNICODE">[UNICODE]</a></p>
 
 
@@ -2448,14 +2448,6 @@
     is passed an Infinity or Not-a-Number (NaN) value, a
     <code>NOT_SUPPORTED_ERR</code> exception must be raised.</p>
 
-    <p>Except where otherwise specified, if a method has an argument
-    of type <code>DOMString</code>, or if an IDL attribute is assigned
-    a new value of type <code>DOMString</code>, the user agent must
-    <span title="dfn-obtain-unicode">convert the
-    <code>DOMString</code> to a sequence of Unicode characters</span>
-    to obtain the string on which the algorithms in this specification
-    are to operate. <a href="#refsWEBIDL">[WEBIDL]</a></p>
-
    </dd>
 
    <dt>JavaScript</dt>
@@ -6100,7 +6092,9 @@
     characters as defined by UTF-8.</p>
 
     <p>If any percent-encoded octets in that component are not valid
-    UTF-8 sequences, then return an error and abort these steps.</p>
+    UTF-8 sequences (e.g. sequences of percent-encoded octets that
+    expand to surrogate code points), then return an error and abort
+    these steps.</p>
 
     <p>Apply the IDNA ToASCII algorithm to the matching substring,
     with both the AllowUnassigned and UseSTD3ASCIIRules flags
@@ -17326,11 +17320,11 @@
 
          <dd>
 
-          <p>The contents of that file, interpreted as string of
-          Unicode characters, are the script source.</p>
+          <p>The contents of that file, interpreted as a Unicode
+          string, are the script source.</p>
 
-          <p>To obtain the string of Unicode characters, the user
-          agent run the following steps:</p>
+          <p>To obtain the Unicode string, the user agent run the
+          following steps:</p>
 
           <ol>
 
@@ -17747,11 +17741,11 @@
 star          = %x002A ; U+002A ASTERISK (*)
 slash         = %x002F ; U+002F SOLIDUS (/)
 not-newline   = %x0000-0009 / %x000B-10FFFF
-                ; a Unicode character other than U+000A LINE FEED (LF)
+                ; a <span>Unicode character</span> other than U+000A LINE FEED (LF)
 not-star      = %x0000-0029 / %x002B-10FFFF
-                ; a Unicode character other than U+002A ASTERISK (*)
+                ; a <span>Unicode character</span> other than U+002A ASTERISK (*)
 not-slash     = %x0000-002E / %x0030-10FFFF
-                ; a Unicode character other than U+002F SOLIDUS (/)</pre>
+                ; a <span>Unicode character</span> other than U+002F SOLIDUS (/)</pre>
 
   <p class="note">This corresponds to putting the contents of the
   element in JavaScript comments.</p>
@@ -35527,20 +35521,16 @@
   parsing the provided byte stream. If the stream lacks this WebVTT
   file signature, then the parser aborts.</p>
 
-  <p>When converting the bytes into Unicode characters, if the
-  encoding used is UTF-8, the bytes must be <span title="decoded as
-  UTF-8, with error handling">decoded with the error handling</span>
-  defined in this specification, and all U+0000 NULL characters must
-  be replaced by U+FFFD REPLACEMENT CHARACTERs.</p>
-
   <p>The <dfn>WebVTT parser algorithm</dfn> is as follows:</p>
 
   <ol>
 
    <li><p>Let <var title="">input</var> be the string being parsed,
-   after conversion to Unicode and after the replacement of U+0000
-   NULL characters described above.</p></li>
+   after conversion to Unicode.</p></li>
 
+   <li><p>Replace all U+0000 NULL characters in <var
+   title="">input</var> by U+FFFD REPLACEMENT CHARACTERs.</p></li>
+
    <li><p>Let <var title="">position</var> be a pointer into <var
    title="">input</var>, initially pointing at the start of the
    string. In an <span>incremental WebVTT parser</span>, when this
@@ -72991,14 +72981,14 @@
    expanding any sequences of percent-encoded octets in <var
    title="">fragid</var> that are valid UTF-8 sequences into Unicode
    characters as defined by UTF-8. If any percent-encoded octets in
-   that string are not valid UTF-8 sequences, then skip this step and
-   the next one.</p>
+   that string are not valid UTF-8 sequences (e.g. they expand to
+   surrogate code points), then skip this step and the next one.</p>
 
    <li><p>If this step was not skipped and there is an element in the
-   DOM that has an <span title="concept-id">ID</span> exactly equal to <var title="">decoded
-   fragid</var>, then the first such element in tree order is
-   <span>the indicated part of the document</span>; stop the algorithm
-   here.</p></li>
+   DOM that has an <span title="concept-id">ID</span> exactly equal to
+   <var title="">decoded fragid</var>, then the first such element in
+   tree order is <span>the indicated part of the document</span>; stop
+   the algorithm here.</p></li>
 
    <li><p>If there is an <code>a</code> element in the DOM that has a
    <code title="attr-a-name">name</code> attribute whose value is
@@ -89195,9 +89185,9 @@
 colon         = %x003A ; U+003A COLON (:)
 bom           = %xFEFF ; U+FEFF BYTE ORDER MARK
 name-char     = %x0000-0009 / %x000B-000C / %x000E-0039 / %x003B-10FFFF
-                ; a Unicode character other than U+000A LINE FEED (LF), U+000D CARRIAGE RETURN (CR), or U+003A COLON (:)
+                ; a <span>Unicode character</span> other than U+000A LINE FEED (LF), U+000D CARRIAGE RETURN (CR), or U+003A COLON (:)
 any-char      = %x0000-0009 / %x000B-000C / %x000E-10FFFF
-                ; a Unicode character other than U+000A LINE FEED (LF) or U+000D CARRIAGE RETURN (CR)</pre>
+                ; a <span>Unicode character</span> other than U+000A LINE FEED (LF) or U+000D CARRIAGE RETURN (CR)</pre>
 
   <p>Event streams in this format must always be encoded as
   UTF-8. <a href="#refsRFC3629">[RFC3629]</a></p>
@@ -92952,13 +92942,14 @@
   <h4>Text</h4>
 
   <p><dfn title="syntax-text">Text</dfn> is allowed inside elements,
-  attribute values, and comments. Text must consist of Unicode
-  characters. Text must not contain U+0000 characters. Text must not
-  contain permanently undefined Unicode characters (noncharacters).
-  Text must not contain control characters other than <span
-  title="space character">space characters</span>. Extra constraints
-  are placed on what is and what is not allowed in text based on where
-  the text is to be put, as described in the other sections.</p>
+  attribute values, and comments. Text must consist of <span
+  title="Unicode character">Unicode characters</span>. Text must not
+  contain U+0000 characters. Text must not contain permanently
+  undefined Unicode characters (noncharacters). Text must not contain
+  control characters other than <span title="space character">space
+  characters</span>. Extra constraints are placed on what is and what
+  is not allowed in text based on where the text is to be put, as
+  described in the other sections.</p>
 
 
   <h5>Newlines</h5>
@@ -93165,7 +93156,7 @@
   <h4>Overview of the parsing model</h4>
 
   <p>The input to the HTML parsing process consists of a stream of
-  Unicode characters, which is passed through a
+  Unicode code points, which is passed through a
   <span>tokenization</span> stage followed by a <span>tree
   construction</span> stage. The output is a <code>Document</code>
   object.</p>
@@ -93215,7 +93206,7 @@
 
   <h4>The <dfn>input stream</dfn></h4>
 
-  <p>The stream of Unicode characters that comprises the input to the
+  <p>The stream of Unicode code points that comprises the input to the
   tokenization stage will be initially seen by the user agent as a
   stream of bytes (typically coming over the network or from the local
   file system). The bytes encode the actual characters according to a
@@ -93256,9 +93247,8 @@
   href="#meta-charset-during-parse">used during the parsing</a> to
   determine whether to <span>change the encoding</span>. If no
   encoding is necessary, e.g. because the parser is operating on a
-  stream of Unicode characters and doesn't have to use an encoding at
-  all, then the <span
-  title="concept-encoding-confidence">confidence</span> is
+  Unicode stream and doesn't have to use an encoding at all, then the
+  <span title="concept-encoding-confidence">confidence</span> is
   <i>irrelevant</i>.</p>
 
   <ol>
@@ -94029,7 +94019,7 @@
   <h5>Preprocessing the input stream</h5>
 
   <p>Given an encoding, the bytes in the input stream must be
-  converted to Unicode characters for the tokenizer, as described by
+  converted to Unicode code points for the tokenizer, as described by
   the rules for that encoding, except that the leading U+FEFF BYTE
   ORDER MARK character, if any, must not be stripped by the encoding
   layer (it is stripped by the rule below).</p> <!-- this is to