[whatwg] Another bug in the HTML parsing spec?

Mon Oct 17 16:44:07 PDT 2011

In the HTML spec, "The rules for parsing tokens in foreign content" 
include an algorithm for "any other end tag".  This is the algorithm at 
the very end of 
http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html. 

I think there are some problems with this algorithm and would appreciate 
any insight anyone has:

1) Step 3 includes an instruction to jump to the last step in the list 
of steps.  But the last step begins "Otherwise", which sounds like it is 
an else clause.  Jumping into an else clause is confusing enough that I 
wonder if there is an error in the algorithm wording.

2) I can't get all of the parser tests from html5lib to pass with this 
algorithm as it is currently written.  In particular, there are 5 tests 
in testdata/tree-construction/tests9.dat of this basic form:

<!DOCTYPE html><body><table><math><mi>foo</mi></math></table>

As the spec is written, the <mi> tag is a text integration point, so the 
"foo" text token is handled like regular content, not like foreign 
content.  And since it is in a table, it isn't inserted right away but 
is stored as pending table text.  Then, when the </mi> tag is processed, 
it is processed as foreign content, going through the algorithm I'm 
talking about here.  That pops it off the stack, and then reprocesses 
the </mi> tag as regular content.  This causes the pending table text to 
be inserted, but since the <mi> has already been popped off the stack, 
the text gets inserted into the <math> element instead of the <mi> element.

The workaround I've found (I'm not confident that this is the correct 
workaround) is to change step 3 of the algorithm so that it only pops 
the stack if there is no pending table text.  Another potential 
workaround is to use the existence of pending table text as a condition 
for sending tokens to the regular insertion mode rather than treating 
them as foreign content.

3) In this set of tests 
http://code.google.com/p/html5lib/source/browse/testdata/tree-construction/webkit01.dat 
there is this test:

<math><mrow><mrow><mn>1</mn></mrow><mi>a</mi></mrow></math>

When the first </mrow> tag is parsed, it is handled as foreign content, 
and gets popped off the stack in step 3. Then, the token is reprocessed 
in body mode.  It is treated in the "any other end tag" case.  Since the 
top of the stack happens to be another mrow tag, that one gets popped 
too.  (Other tests don't fail here because they don't happen to have two 
of the same tags on the stack).  This means that the <mi> element ends 
up as a child of the <math> element instead of the outer <mrow> element.

     David Flanagan