[whatwg] Another bug in the HTML parsing spec?
David Flanagan
dflanagan at mozilla.com
Mon Oct 17 16:44:07 PDT 2011
In the HTML spec, "The rules for parsing tokens in foreign content"
include an algorithm for "any other end tag". This is the algorithm at
the very end of
http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html.
I think there are some problems with this algorithm and would appreciate
any insight anyone has:
1) Step 3 includes an instruction to jump to the last step in the list
of steps. But the last step begins "Otherwise", which sounds like it is
an else clause. Jumping into an else clause is confusing enough that I
wonder if there is an error in the algorithm wording.
2) I can't get all of the parser tests from html5lib to pass with this
algorithm as it is currently written. In particular, there are 5 tests
in testdata/tree-construction/tests9.dat of this basic form:
<!DOCTYPE html><body><table><math><mi>foo</mi></math></table>
As the spec is written, the <mi> tag is a text integration point, so the
"foo" text token is handled like regular content, not like foreign
content. And since it is in a table, it isn't inserted right away but
is stored as pending table text. Then, when the </mi> tag is processed,
it is processed as foreign content, going through the algorithm I'm
talking about here. That pops it off the stack, and then reprocesses
the </mi> tag as regular content. This causes the pending table text to
be inserted, but since the <mi> has already been popped off the stack,
the text gets inserted into the <math> element instead of the <mi> element.
The workaround I've found (I'm not confident that this is the correct
workaround) is to change step 3 of the algorithm so that it only pops
the stack if there is no pending table text. Another potential
workaround is to use the existence of pending table text as a condition
for sending tokens to the regular insertion mode rather than treating
them as foreign content.
3) In this set of tests
http://code.google.com/p/html5lib/source/browse/testdata/tree-construction/webkit01.dat
there is this test:
<math><mrow><mrow><mn>1</mn></mrow><mi>a</mi></mrow></math>
When the first </mrow> tag is parsed, it is handled as foreign content,
and gets popped off the stack in step 3. Then, the token is reprocessed
in body mode. It is treated in the "any other end tag" case. Since the
top of the stack happens to be another mrow tag, that one gets popped
too. (Other tests don't fail here because they don't happen to have two
of the same tags on the stack). This means that the <mi> element ends
up as a child of the <math> element instead of the outer <mrow> element.
David Flanagan
More information about the whatwg
mailing list