[html5] r3871 - [ct] (2) Make surrogates in UTF-8 and character references turn into U+FFFD to p [...]
whatwg at whatwg.org
whatwg at whatwg.org
Wed Sep 16 02:22:02 PDT 2009
Author: ianh
Date: 2009-09-16 02:22:01 -0700 (Wed, 16 Sep 2009)
New Revision: 3871
Modified:
index
source
Log:
[ct] (2) Make surrogates in UTF-8 and character references turn into U+FFFD to prevent UTF-16 environments having hard-to-handle bugs.
Modified: index
===================================================================
--- index 2009-09-16 08:13:04 UTC (rev 3870)
+++ index 2009-09-16 09:22:01 UTC (rev 3871)
@@ -62159,23 +62159,25 @@
motivated by a desire to increase the resilience of user agents in
the face of naïve transcoders.</p>
- <p>All U+0000 NULL characters in the input must be replaced by
- U+FFFD REPLACEMENT CHARACTERs. Any occurrences of such characters is
- a <a href=#parse-error>parse error</a>.</p>
+ <p>All U+0000 NULL characters and characters in the range U+D800 to
+ U+DFFF<!-- surrogates not allowed e.g. in UTF-8, and we don't want
+ them to suddenly turn into codepoints when they go through a UTF-16
+ pipe --> in the input must be replaced by U+FFFD REPLACEMENT
+ CHARACTERs. Any occurrences of such characters is a <a href=#parse-error>parse
+ error</a>.</p>
<p>Any occurrences of any characters in the ranges U+0001 to U+0008,
<!-- HT, LF allowed --> <!-- U+000B is in the next list --> <!-- FF,
CR allowed --> U+000E to U+001F, <!-- ASCII allowed --> U+007F
- <!--to U+0084, (U+0085 NEL not allowed), U+0086--> to U+009F, U+D800
- to U+DFFF<!-- surrogates not allowed -->, U+FDD0 to U+FDEF, and
- characters U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE,
- U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF,
- U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE,
- U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF,
- U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and
- U+10FFFF are <a href=#parse-error title="parse error">parse errors</a>. (These
- are all control characters or permanently undefined Unicode
- characters.)</p>
+ <!--to U+0084, (U+0085 NEL not allowed), U+0086--> to U+009F, U+FDD0
+ to U+FDEF, and characters U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF,
+ U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE,
+ U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF,
+ U+9FFFE, U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE,
+ U+CFFFF, U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF,
+ U+10FFFE, and U+10FFFF are <a href=#parse-error title="parse error">parse
+ errors</a>. (These are all control characters or permanently
+ undefined Unicode characters.)</p>
<p>U+000D CARRIAGE RETURN (CR) characters and U+000A LINE FEED (LF)
characters are treated specially. Any CR characters that are
@@ -64016,9 +64018,11 @@
<tr><td>0x9D <td>U+009D <td><control>
<tr><td>0x9E <td>U+017E <td>LATIN SMALL LETTER Z WITH CARON ('ž')
<tr><td>0x9F <td>U+0178 <td>LATIN CAPITAL LETTER Y WITH DIAERESIS ('Ÿ')
- </table><p>Otherwise, if the number is greater than 0x10FFFF, then this is
- a <a href=#parse-error>parse error</a>. Return a U+FFFD REPLACEMENT
- CHARACTER.</p>
+ </table><p>Otherwise, if the number is in the range 0xD800 to 0xDFFF<!--
+ surrogates not allowed; see the comment in the "preprocessing the
+ input stream" section for details --> or is greater than 0x10FFFF,
+ then this is a <a href=#parse-error>parse error</a>. Return a U+FFFD
+ REPLACEMENT CHARACTER.</p>
<p>Otherwise, return a character token for the Unicode character
whose code point is that number.
@@ -64028,14 +64032,14 @@
If the number is in the range 0x0001 to 0x0008, <!-- HT, LF
allowed --> <!-- U+000B is in the next list --> <!-- FF, CR
allowed --> 0x000E to 0x001F, <!-- ASCII allowed --> 0x007F <!--to
- 0x0084, (0x0085 NEL not allowed), 0x0086--> to 0x009F, 0xD800 to
- 0xDFFF<!-- surrogates not allowed -->, 0xFDD0 to 0xFDEF, or is one
- of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF,
- 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE,
- 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF,
- 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE,
- 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, or
- 0x10FFFF, then this is a <a href=#parse-error>parse error</a>.</p>
+ 0x0084, (0x0085 NEL not allowed), 0x0086--> to 0x009F, 0xFDD0 to
+ 0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF,
+ 0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE,
+ 0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF,
+ 0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE,
+ 0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF,
+ 0x10FFFE, or 0x10FFFF, then this is a <a href=#parse-error>parse
+ error</a>.</p>
</dd>
Modified: source
===================================================================
--- source 2009-09-16 08:13:04 UTC (rev 3870)
+++ source 2009-09-16 09:22:01 UTC (rev 3871)
@@ -76737,23 +76737,25 @@
motivated by a desire to increase the resilience of user agents in
the face of naïve transcoders.</p>
- <p>All U+0000 NULL characters in the input must be replaced by
- U+FFFD REPLACEMENT CHARACTERs. Any occurrences of such characters is
- a <span>parse error</span>.</p>
+ <p>All U+0000 NULL characters and characters in the range U+D800 to
+ U+DFFF<!-- surrogates not allowed e.g. in UTF-8, and we don't want
+ them to suddenly turn into codepoints when they go through a UTF-16
+ pipe --> in the input must be replaced by U+FFFD REPLACEMENT
+ CHARACTERs. Any occurrences of such characters is a <span>parse
+ error</span>.</p>
<p>Any occurrences of any characters in the ranges U+0001 to U+0008,
<!-- HT, LF allowed --> <!-- U+000B is in the next list --> <!-- FF,
CR allowed --> U+000E to U+001F, <!-- ASCII allowed --> U+007F
- <!--to U+0084, (U+0085 NEL not allowed), U+0086--> to U+009F, U+D800
- to U+DFFF<!-- surrogates not allowed -->, U+FDD0 to U+FDEF, and
- characters U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE,
- U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF,
- U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE,
- U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF,
- U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and
- U+10FFFF are <span title="parse error">parse errors</span>. (These
- are all control characters or permanently undefined Unicode
- characters.)</p>
+ <!--to U+0084, (U+0085 NEL not allowed), U+0086--> to U+009F, U+FDD0
+ to U+FDEF, and characters U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF,
+ U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE,
+ U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF,
+ U+9FFFE, U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE,
+ U+CFFFF, U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF,
+ U+10FFFE, and U+10FFFF are <span title="parse error">parse
+ errors</span>. (These are all control characters or permanently
+ undefined Unicode characters.)</p>
<p>U+000D CARRIAGE RETURN (CR) characters and U+000A LINE FEED (LF)
characters are treated specially. Any CR characters that are
@@ -78857,9 +78859,11 @@
<tr><td>0x9F <td>U+0178 <td>LATIN CAPITAL LETTER Y WITH DIAERESIS ('Ÿ')
</table>
- <p>Otherwise, if the number is greater than 0x10FFFF, then this is
- a <span>parse error</span>. Return a U+FFFD REPLACEMENT
- CHARACTER.</p>
+ <p>Otherwise, if the number is in the range 0xD800 to 0xDFFF<!--
+ surrogates not allowed; see the comment in the "preprocessing the
+ input stream" section for details --> or is greater than 0x10FFFF,
+ then this is a <span>parse error</span>. Return a U+FFFD
+ REPLACEMENT CHARACTER.</p>
<p>Otherwise, return a character token for the Unicode character
whose code point is that number.
@@ -78869,14 +78873,14 @@
If the number is in the range 0x0001 to 0x0008, <!-- HT, LF
allowed --> <!-- U+000B is in the next list --> <!-- FF, CR
allowed --> 0x000E to 0x001F, <!-- ASCII allowed --> 0x007F <!--to
- 0x0084, (0x0085 NEL not allowed), 0x0086--> to 0x009F, 0xD800 to
- 0xDFFF<!-- surrogates not allowed -->, 0xFDD0 to 0xFDEF, or is one
- of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF,
- 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE,
- 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF,
- 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE,
- 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, or
- 0x10FFFF, then this is a <span>parse error</span>.</p>
+ 0x0084, (0x0085 NEL not allowed), 0x0086--> to 0x009F, 0xFDD0 to
+ 0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF,
+ 0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE,
+ 0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF,
+ 0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE,
+ 0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF,
+ 0x10FFFE, or 0x10FFFF, then this is a <span>parse
+ error</span>.</p>
</dd>
More information about the Commit-Watchers
mailing list