[whatwg] Potentially avoidable tokeniser/treebuilder dependency
Øistein E. Andersen
liszt at coq.no
Tue Sep 22 16:01:04 PDT 2009
As currently specified, the tokeniser is mostly, but not completely,
independent of the treebuiilder.
The major obstacle for an independent tokeniser seems to be that the
content model flag is set to RCDATA, RAWTEXT or PLAINTEXT by the
treebuilder and not by the tokeniser. In most cases, the new content
model flag is entirely predictable from the start tag (and RCDATA/
RAWTEXT element names are known to the tokeniser already). The only
exceptions I have found so far concern start tags within <select> and
<frameset>, which are dropped by the treebuilder and therefore do not
cause the content model flag to change. Even these cases could
perhaps have been handled by the tokeniser without too much trouble
(and without changing the spec) if it were not for the "in select in
table" insertion mode, where a missing </select> end tag may be
inferred depending on the stack of open elements.
It seems unfortunate to abandon the possibility of an independent
tokeniser just to handle what appears to be a corner case of a corner
case, viz, unclosed RCDATA/RAWTEXT elements inside an unclosed
<select> element in a table. The easiest solution would be to switch
the content model flag upon seeing an RCDATA/RAWTEXT/PLAINTEXT start
tag irrespective of insertion mode, i.e., also within <select> and
<frameset>, which would allow the tokeniser to take care of this
without added complexity. Other solutions might be worth considering
if this is found to be too incompatible with existing pages. (I could
have a look at the the http://www.dotnetdotcom.org/ dataset if that
would be of any use.)
(A tiny bit of context: I recently implemented most of the tokeniser
in lex in the view of using it as a tool to investigate the use of
named character references in existing documents. It uses about 20
start conditions instead of the spec's 39 states and two flags, is
fairly compact and readable (500 lines compared to 5,500 in the
Validator.nu implementation), and runs about 35 times faster than the
full Validator.nu HTML Parser (both under highly suboptimal
conditions). Unfortunately, it is of little use without a treebuilder
to set the content model flag. It has been pointed out that use cases
for which a tree is not needed may not require perfect tokenisation;
even if that be true, it is much more difficult to assure that an
approximate implementation is sufficiently close than to follow the
specification; perhaps more importantly, removing unnecessary
dependencies and allowing the tokeniser to run on its own would also
make it easier to develop and test a tokeniser for use as part of a
full parser.)
--
Øistein E. Andersen
More information about the whatwg
mailing list