[html5] r3871 - [ct] (2) Make surrogates in UTF-8 and character references turn into U+FFFD to p [...]

Wed Sep 16 02:22:02 PDT 2009

Author: ianh
Date: 2009-09-16 02:22:01 -0700 (Wed, 16 Sep 2009)
New Revision: 3871

Modified:
   index
   source
Log:
[ct] (2) Make surrogates in UTF-8 and character references turn into U+FFFD to prevent UTF-16 environments having hard-to-handle bugs.

Modified: index
===================================================================

--- index	2009-09-16 08:13:04 UTC (rev 3870)
+++ index	2009-09-16 09:22:01 UTC (rev 3871)
@@ -62159,23 +62159,25 @@
   motivated by a desire to increase the resilience of user agents in
   the face of naïve transcoders.</p>
 
-  <p>All U+0000 NULL characters in the input must be replaced by
-  U+FFFD REPLACEMENT CHARACTERs. Any occurrences of such characters is
-  a <a href=#parse-error>parse error</a>.</p>
+  <p>All U+0000 NULL characters and characters in the range U+D800 to
+  U+DFFF<!-- surrogates not allowed e.g. in UTF-8, and we don't want
+  them to suddenly turn into codepoints when they go through a UTF-16
+  pipe --> in the input must be replaced by U+FFFD REPLACEMENT
+  CHARACTERs. Any occurrences of such characters is a <a href=#parse-error>parse
+  error</a>.</p>
 
   <p>Any occurrences of any characters in the ranges U+0001 to U+0008,
   <!-- HT, LF allowed --> <!-- U+000B is in the next list --> <!-- FF,
   CR allowed --> U+000E to U+001F, <!-- ASCII allowed --> U+007F
-  <!--to U+0084, (U+0085 NEL not allowed), U+0086--> to U+009F, U+D800
-  to U+DFFF<!-- surrogates not allowed -->, U+FDD0 to U+FDEF, and
-  characters U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE,
-  U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF,
-  U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE,
-  U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF,
-  U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and
-  U+10FFFF are <a href=#parse-error title="parse error">parse errors</a>. (These
-  are all control characters or permanently undefined Unicode
-  characters.)</p>
+  <!--to U+0084, (U+0085 NEL not allowed), U+0086--> to U+009F, U+FDD0
+  to U+FDEF, and characters U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF,
+  U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE,
+  U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF,
+  U+9FFFE, U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE,
+  U+CFFFF, U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF,
+  U+10FFFE, and U+10FFFF are <a href=#parse-error title="parse error">parse
+  errors</a>. (These are all control characters or permanently
+  undefined Unicode characters.)</p>
 
   <p>U+000D CARRIAGE RETURN (CR) characters and U+000A LINE FEED (LF)
   characters are treated specially. Any CR characters that are
@@ -64016,9 +64018,11 @@
       <tr><td>0x9D <td>U+009D <td><control>
       <tr><td>0x9E <td>U+017E <td>LATIN SMALL LETTER Z WITH CARON ('ž')
       <tr><td>0x9F <td>U+0178 <td>LATIN CAPITAL LETTER Y WITH DIAERESIS ('Ÿ')
-    </table><p>Otherwise, if the number is greater than 0x10FFFF, then this is
-    a <a href=#parse-error>parse error</a>. Return a U+FFFD REPLACEMENT
-    CHARACTER.</p>
+    </table><p>Otherwise, if the number is in the range 0xD800 to 0xDFFF<!--
+    surrogates not allowed; see the comment in the "preprocessing the
+    input stream" section for details --> or is greater than 0x10FFFF,
+    then this is a <a href=#parse-error>parse error</a>. Return a U+FFFD
+    REPLACEMENT CHARACTER.</p>
 
     <p>Otherwise, return a character token for the Unicode character
     whose code point is that number.
@@ -64028,14 +64032,14 @@
     If the number is in the range 0x0001 to 0x0008, <!-- HT, LF
     allowed --> <!-- U+000B is in the next list --> <!-- FF, CR
     allowed --> 0x000E to 0x001F, <!-- ASCII allowed --> 0x007F <!--to
-    0x0084, (0x0085 NEL not allowed), 0x0086--> to 0x009F, 0xD800 to
-    0xDFFF<!-- surrogates not allowed -->, 0xFDD0 to 0xFDEF, or is one
-    of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF,
-    0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE,
-    0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF,
-    0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE,
-    0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, or
-    0x10FFFF, then this is a <a href=#parse-error>parse error</a>.</p>
+    0x0084, (0x0085 NEL not allowed), 0x0086--> to 0x009F, 0xFDD0 to
+    0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF,
+    0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE,
+    0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF,
+    0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE,
+    0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF,
+    0x10FFFE, or 0x10FFFF, then this is a <a href=#parse-error>parse
+    error</a>.</p>
 
    </dd>
 

Modified: source
===================================================================
--- source	2009-09-16 08:13:04 UTC (rev 3870)
+++ source	2009-09-16 09:22:01 UTC (rev 3871)
@@ -76737,23 +76737,25 @@
   motivated by a desire to increase the resilience of user agents in
   the face of naïve transcoders.</p>
 
-  <p>All U+0000 NULL characters in the input must be replaced by
-  U+FFFD REPLACEMENT CHARACTERs. Any occurrences of such characters is
-  a <span>parse error</span>.</p>
+  <p>All U+0000 NULL characters and characters in the range U+D800 to
+  U+DFFF<!-- surrogates not allowed e.g. in UTF-8, and we don't want
+  them to suddenly turn into codepoints when they go through a UTF-16
+  pipe --> in the input must be replaced by U+FFFD REPLACEMENT
+  CHARACTERs. Any occurrences of such characters is a <span>parse
+  error</span>.</p>
 
   <p>Any occurrences of any characters in the ranges U+0001 to U+0008,
   <!-- HT, LF allowed --> <!-- U+000B is in the next list --> <!-- FF,
   CR allowed --> U+000E to U+001F, <!-- ASCII allowed --> U+007F
-  <!--to U+0084, (U+0085 NEL not allowed), U+0086--> to U+009F, U+D800
-  to U+DFFF<!-- surrogates not allowed -->, U+FDD0 to U+FDEF, and
-  characters U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE,
-  U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF,
-  U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE,
-  U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF,
-  U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and
-  U+10FFFF are <span title="parse error">parse errors</span>. (These
-  are all control characters or permanently undefined Unicode
-  characters.)</p>
+  <!--to U+0084, (U+0085 NEL not allowed), U+0086--> to U+009F, U+FDD0
+  to U+FDEF, and characters U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF,
+  U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE,
+  U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF,
+  U+9FFFE, U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE,
+  U+CFFFF, U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF,
+  U+10FFFE, and U+10FFFF are <span title="parse error">parse
+  errors</span>. (These are all control characters or permanently
+  undefined Unicode characters.)</p>
 
   <p>U+000D CARRIAGE RETURN (CR) characters and U+000A LINE FEED (LF)
   characters are treated specially. Any CR characters that are
@@ -78857,9 +78859,11 @@
       <tr><td>0x9F <td>U+0178 <td>LATIN CAPITAL LETTER Y WITH DIAERESIS ('&#x0178;')
     </table>
 
-    <p>Otherwise, if the number is greater than 0x10FFFF, then this is
-    a <span>parse error</span>. Return a U+FFFD REPLACEMENT
-    CHARACTER.</p>
+    <p>Otherwise, if the number is in the range 0xD800 to 0xDFFF<!--
+    surrogates not allowed; see the comment in the "preprocessing the
+    input stream" section for details --> or is greater than 0x10FFFF,
+    then this is a <span>parse error</span>. Return a U+FFFD
+    REPLACEMENT CHARACTER.</p>
 
     <p>Otherwise, return a character token for the Unicode character
     whose code point is that number.
@@ -78869,14 +78873,14 @@
     If the number is in the range 0x0001 to 0x0008, <!-- HT, LF
     allowed --> <!-- U+000B is in the next list --> <!-- FF, CR
     allowed --> 0x000E to 0x001F, <!-- ASCII allowed --> 0x007F <!--to
-    0x0084, (0x0085 NEL not allowed), 0x0086--> to 0x009F, 0xD800 to
-    0xDFFF<!-- surrogates not allowed -->, 0xFDD0 to 0xFDEF, or is one
-    of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF,
-    0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE,
-    0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF,
-    0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE,
-    0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, or
-    0x10FFFF, then this is a <span>parse error</span>.</p>
+    0x0084, (0x0085 NEL not allowed), 0x0086--> to 0x009F, 0xFDD0 to
+    0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF,
+    0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE,
+    0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF,
+    0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE,
+    0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF,
+    0x10FFFE, or 0x10FFFF, then this is a <span>parse
+    error</span>.</p>
 
    </dd>