[whatwg] Parse errors for invalid characters

Ian Hickson ian at hixie.ch
Fri Sep 13 14:18:55 PDT 2013


On Thu, 5 Sep 2013, Geoffrey Sneddon wrote:
>
> The phrasing content section states:
> 
> > Text nodes and attribute values must consist of Unicode characters, 
> > must not contain U+0000 characters, must not contain permanently 
> > undefined Unicode characters (noncharacters), and must not contain 
> > control characters other than space characters.
> 
> And the pre-processing the input-stream section states:
> 
> > Any occurrences of any characters in the ranges U+0001 to U+0008, 
> > U+000E to U+001F, U+007F to U+009F, U+FDD0 to U+FDEF, and characters 
> > U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE, 
> > U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF, 
> > U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE, 
> > U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE, U+DFFFF, 
> > U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and U+10FFFF are parse 
> > errors. These are all control characters or permanently undefined 
> > Unicode characters (noncharacters).
> 
> Note the first uses "Unicode characters", the second "characters" — the 
> former excludes surrogates as a conformance requirement.
> 
> Note that every disallowed non-surrogate character is a parse error.
>
> Therefore, it would make sense to make surrogates parse errors.

Done.


> It should be noted that they can only occur in the input stream if they 
> come from script (as they cannot be decoded from the input byte stream 
> as the decoders will never emit a surrogate).

Done.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


More information about the whatwg mailing list