[whatwg] Parse errors for invalid characters
ian at hixie.ch
Fri Sep 13 14:18:55 PDT 2013
On Thu, 5 Sep 2013, Geoffrey Sneddon wrote:
> The phrasing content section states:
> > Text nodes and attribute values must consist of Unicode characters,
> > must not contain U+0000 characters, must not contain permanently
> > undefined Unicode characters (noncharacters), and must not contain
> > control characters other than space characters.
> And the pre-processing the input-stream section states:
> > Any occurrences of any characters in the ranges U+0001 to U+0008,
> > U+000E to U+001F, U+007F to U+009F, U+FDD0 to U+FDEF, and characters
> > U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE,
> > U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF,
> > U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE,
> > U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE, U+DFFFF,
> > U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and U+10FFFF are parse
> > errors. These are all control characters or permanently undefined
> > Unicode characters (noncharacters).
> Note the first uses "Unicode characters", the second "characters" — the
> former excludes surrogates as a conformance requirement.
> Note that every disallowed non-surrogate character is a parse error.
> Therefore, it would make sense to make surrogates parse errors.
> It should be noted that they can only occur in the input stream if they
> come from script (as they cannot be decoded from the input byte stream
> as the decoders will never emit a surrogate).
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
More information about the whatwg