[whatwg] Should ambiguous ampersand be a parse error?

Tue Dec 10 09:45:16 PST 2013

On 12/10/13 11:11 AM, Peter Cashin wrote:
> The HTML5 spec says that an ambiguous ampersand (e.g. &something; undefined) is not allowed in element content

Right, that's an authoring requirement.

> and in section on HTML parsing, that this should throw a parse error.

There is no throwing of parse errors in the HTML spec.

I assume you're looking at the "anything else" case of 
http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#consume-a-character-reference 
?  This says, for the case you're looking at:

   If no match can be made, then no characters are consumed, and nothing
   is returned. In this case, if the characters after the U+0026
   AMPERSAND character (&) consist of a sequence of one or more
   alphanumeric ASCII characters followed by a U+003B SEMICOLON
   character (;), then this is a parse error.

And if you follow the link to "parse error" it's 
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#parse-error 
and basically has to do with validators needing to report them and UAs 
being allowed (but not required) to stop parsing here if they really 
want.  If they do NOT want to abort on the error (which is the common 
case, btw), the spec defines how they press on.

And the way they press on is by returning nothing from the "consume a 
character reference" algorithm.  What that does depends on the caller, 
but in the case you're talking about that's presumably 
http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#character-reference-in-data-state 
and what it will do if nothing is returned is emit the '&' and move on 
to the next character.  So basically treats the '&' as not special in 
any way in this case, leading to the behavior you observe in browsers.

> Is the specification intended to have compliant HTML agents stop parsing ambiguous ampersands?

Compliant HTML agents are allowed to do so, I guess, per the technical 
rules about parse errors, just like for any other parse error.  But I 
expect that this is at least partly for conformance classes other than 
"browsers"; all browsers press on through parse errors in HTML.  Maybe 
the allowed behavior for parse errors should be made conditional on 
conformance class...

-Boris