[whatwg] Bug in "Before DOCTYPE name state"?

Thu Dec 21 15:52:22 PST 2006

On Thu, 21 Dec 2006, Thomas Broyer wrote:
>
> Before DOCTYPE name state:
> http://www.whatwg.org/specs/web-apps/current-work/#before1
> """
> ↪ U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z
>     Create a new DOCTYPE token. Set the token's name name to the
> uppercase version of the current input character (subtract 0x0020 from
> the character's code point), and mark it as being in error. Switch to
> the DOCTYPE name state.
> """
> 
> DOCTYPE name state
> http://www.whatwg.org/specs/web-apps/current-work/#doctype1
> """
> ↪ U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z
>     Append the uppercase version of the current input character
> (subtract 0x0020 from the character's code point) to the current
> DOCTYPE token's name. Stay in the DOCTYPE name state."""
> 
> Why is the DOCTYPE marked "in error" in the former case?

Because otherwise this document:

   <!DOCTYPEH

...would emit a DOCTYPE that is not in error (since the token would be 
emitted before the bit at the end of the DOCTYPE name state).

> In other words, why would <!DOCTYPE html> be "in error" while
> <!DOCTYPE Html> wouldn't?

Both would be not in error, because of the sentence at the end of the 
DOCTYPE name state.

On Thu, 21 Dec 2006, Thomas Broyer wrote:
> >
> > It's not. The "DOCTYPE name state" also has this paragraph: "Then, if 
> > the name of the DOCTYPE token is exactly the four letters "HTML", then 
> > mark the token as being correct. Otherwise, mark it as being in 
> > error."
> 
> But it also has this note, which is quite confusing: "Because lowercase 
> letters in the name are uppercased by the algorithm above, the "HTML" 
> letters are actually case-insensitive relative to the markup."

How is it confusing? I would clarify it, but I don't know what is 
confusing.

> It remains that the tokenization stage is a bit confusing…

Yes. The tree construction stage is even worse. Just implement it exactly 
as written with no interpretation and you should be fine. ;-)

On Thu, 21 Dec 2006, Thomas Broyer wrote:
> 
> So what's the prupose of marking the DOCTYPE "in error" in the "before 
> DOCTYPE name state" when it finds a lowercase 'h' if it's set back to 
> "correct" in "DOCTYPE name state" if it actually was followed by the 
> three letters "tml" (case-insensitively)?

Unexpected EOFs, as noted above.

> So <!doctype html> should not produce a parse rror? or should it?

No parse error.

On Thu, 21 Dec 2006, Thomas Broyer wrote:
> 
> Additional note: as I read this, if the DOCTYPE was previously marked as 
> being "in error", it should then be rolled back to being "correct" if 
> the DOCTYPE name is "HTML": <!DOCTYPEHTML> would *not* be marked "in 
> error".

Correct. The DOCTYPE being "in error" or not just affects whether to use 
strict mode or quirks mode, it doesn't affect anything else, and in 
particular has no bearing on whether the document itself has parse errors 
or not.

> So I'll just code it so that these are "correct":
> <!doctype html>
> <!DOCTYPE HTML>
> and every other lowercase/uppercase variant;
> and thiese are "in error":
> <!doctypehtml>
> <!DOCTYPEHTML>
> and every other lowercase/uppercase variant.

No, those four are all treated exactly the same as far as token emission 
goes (they all emit a "correct" DOCTYPE token with name "HTML"). However, 
the bottom two do have parse errors.

HTH,
-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'