[whatwg] Parse errors for invalid characters
Kang-Hao (Kenny) Lu
kanghaol at oupeng.com
Thu Sep 5 20:05:56 PDT 2013
(2013/09/06 6:08), Geoffrey Sneddon wrote:
> The phrasing content section states:
>
>> Text nodes and attribute values must consist of Unicode characters,
>> must not contain U+0000 characters, must not contain permanently
>> undefined Unicode characters (noncharacters), and must not contain
>> control characters other than space characters. This specification
>> includes extra constraints on the exact value of Text nodes and
>> attribute values depending on their precise context.
>
> And the pre-processing the input-stream section states:
>
>> Any occurrences of any characters in the ranges U+0001 to U+0008,
>> U+000E to U+001F, U+007F to U+009F, U+FDD0 to U+FDEF, and characters
>> U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE,
>> U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF,
>> U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE,
>> U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE, U+DFFFF,
>> U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and U+10FFFF are parse
>> errors. These are all control characters or permanently undefined
>> Unicode characters (noncharacters).
>
> Note the first uses "Unicode characters", the second "characters" — the
> former excludes surrogates as a conformance requirement.
>
> Note that every disallowed non-surrogate character is a parse error.
Except U+0000 or am I missing something?
> Therefore, it would make sense to make surrogates parse errors.
>
> It should be noted that they can only occur in the input stream if they
> come from script (as they cannot be decoded from the input byte stream
> as the decoders will never emit a surrogate).
which means that this seems ... cubersome ... to implement in a
conformance checker. Which reminds me, does
# Conformance checkers must report at least one parse error
# condition to the user if one or more parse error conditions exist
# in the document and must not report parse error conditions if none
# exist in the document. Conformance checkers may report more than
# one parse error condition if more than one parse error condition
# exists in the document.
mean validator.nu and Firefox view source are non-conforming because
they do nothing about document.write() ?
I think we should exempt conformance checkers from scripts instead.
Cheers,
Kenny
--
Web Specialist, Opera Sphinx Game Force, Oupeng Browser, Beijing
Try Oupeng: http://www.oupeng.com/
More information about the whatwg
mailing list