[whatwg] html5 parsing/tokenizing

Tue Jun 19 16:20:11 PDT 2007

I have a friend who has implemented a fast tokenizer in C.  I asked
him to send me any feedback he might have, and so what follows are his
words.  This is from about a month ago, so I apologize if any of this
is old ground.

-Ben

-------------
When the tokenization state machine is defined, every state first
"consumes" and then potentially "emits". Some of the states transfer to
another state with an order to "re-consume the character in the next
state". This means that what you do in the new state is dependant on
what you did in the last state and that the "comsume" is necessarily an
inconsistent operation. A much better wording would be "look at the next
character" and on state transition "consume and emit" or just "emit
without consumption" making it clear when the input cursor moves.

It would be nice if all <!...> tags (except comments) were considered
"declarations" instead of bogus comments. Then DOCTYPE wouldn't need
special handling by the tokenizer, just special handling by the parser.
(Too much of the parser seems to have gotten into the tokenizer; with
CDATA and RCDATA, this is a necessary evil. With <!DOCTYPE ...> it
isn't.)

Other than that, the definition is pretty solid and I've come to terms
with the xml-interoperability issues I formerly expressed. I've added a
switch to my parser that tells it whether or not to honor RCDATA
sections and I've purposed never to feed it CDATA. (I know it's not
supposed to be an xml parser.) ~D