[whatwg] Parse errors for invalid characters

Sat Sep 7 03:54:47 PDT 2013

On 06/09/2013 04:05, Kang-Hao (Kenny) Lu wrote:
> (2013/09/06 6:08), Geoffrey Sneddon wrote:
>> The phrasing content section states:
>>
>>> Text nodes and attribute values must consist of Unicode characters,
>>> must not contain U+0000 characters, must not contain permanently
>>> undefined Unicode characters (noncharacters), and must not contain
>>> control characters other than space characters. This specification
>>> includes extra constraints on the exact value of Text nodes and
>>> attribute values depending on their precise context.
>>
>> And the pre-processing the input-stream section states:
>>
>>> Any occurrences of any characters in the ranges U+0001 to U+0008,
>>> U+000E to U+001F, U+007F to U+009F, U+FDD0 to U+FDEF, and characters
>>> U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE,
>>> U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF,
>>> U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE,
>>> U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE, U+DFFFF,
>>> U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and U+10FFFF are parse
>>> errors. These are all control characters or permanently undefined
>>> Unicode characters (noncharacters).
>>
>> Note the first uses "Unicode characters", the second "characters" — the
>> former excludes surrogates as a conformance requirement.
>>
>> Note that every disallowed non-surrogate character is a parse error.
>
> Except U+0000 or am I missing something?

This is handled inline in the parser, as noted in the preprocessing 
section. It sometimes gets passed through as U+0000, sometimes gets 
changed to U+FFFD, sometimes gets ignored, but always creates a parser 
error.

>> Therefore, it would make sense to make surrogates parse errors.
>>
>> It should be noted that they can only occur in the input stream if they
>> come from script (as they cannot be decoded from the input byte stream
>> as the decoders will never emit a surrogate).
>
> which means that this seems ... cubersome ... to implement in a
> conformance checker. Which reminds me, does
>
>     # Conformance checkers must report at least one parse error
>     # condition to the user if one or more parse error conditions exist
>     # in the document and must not report parse error conditions if none
>     # exist in the document. Conformance checkers may report more than
>     # one parse error condition if more than one parse error condition
>     # exists in the document.
>
> mean validator.nu and Firefox view source are non-conforming because
> they do nothing about document.write() ?
>
> I think we should exempt conformance checkers from scripts instead.

They already are. From the "Conformance classes" section:

> Conformance checkers must check that the input document conforms when parsed without a browsing context (meaning that no scripts are run, and that the parser's scripting flag is disabled), and should also check that the input document conforms when parsed with a browsing context in which scripts execute, and that the scripts never cause non-conforming states to occur other than transiently during script execution itself. (This is only a "SHOULD" and not a "MUST" requirement because it has been proven to be impossible. [COMPUTABLE])

(I feel like pedanting and pointing out this is untrue — it has not been 
proven impossible to do, it has been proven impossible to do in general. 
It wouldn't be that hard to design a conformance checker to check 
"<html><script>document.write("<p>")</script>".)

On the other hand, a JS console can reasonably report parse errors from 
script, so the parse errors are still worthwhile to have.

/Geoffrey.