[whatwg] html5 parsing/tokenizing

Wed Jun 20 00:38:31 PDT 2007

> When the tokenization state machine is defined, every state first
> "consumes" and then potentially "emits". Some of the states transfer to
> another state with an order to "re-consume the character in the next
> state". This means that what you do in the new state is dependant on
> what you did in the last state and that the "comsume" is necessarily an
> inconsistent operation. A much better wording would be "look at the next
> character" and on state transition "consume and emit" or just "emit
> without consumption" making it clear when the input cursor moves.

I did the same in Twintsam with PeekChar/PeekChars and EatChar/EatChars methods.
http://twintsam.googlecode.com/svn/trunk/Twintsam/Html/HtmlReader.StreamHandling.cs
(beware, Twintsam hasn't been updated since January so it's not in
sync with the spec as it is now)

though actually you could just use a character queue into which you
push back characters that needs to be "re-consumed" (i.e. you
"un-read" the character and then you switch to the other state).
This is what html5lib does:
http://html5lib.googlecode.com/svn/trunk/python/src/tokenizer.py
(search for self.stream.queue; this needs to be refactored with an
unread() method on the HTMLInputStream)

That is to say, I don't think the spec should be changed at all. It's
just a matter of how you implement it. You just have to know that the
"queue" won't ever be larger than 9 characters as there are tweaks for
0-prefixed numeric entities and/or numeric entities greater 1114111.

> It would be nice if all <!...> tags (except comments) were considered
> "declarations" instead of bogus comments. Then DOCTYPE wouldn't need
> special handling by the tokenizer, just special handling by the parser.
> (Too much of the parser seems to have gotten into the tokenizer; with
> CDATA and RCDATA, this is a necessary evil. With <!DOCTYPE ...> it
> isn't.)

I can't see the problem here; plus DOCTYPE parsing is special because
we need the DOCTYPE name.
Moreover, the spec has changed recently so that DOCTYPE parsing takes
care of PUBLIC and SYSTEM identifiers.

-- 
Thomas Broyer