[whatwg] Another bug in the HTML parsing spec?

Mon Oct 17 17:47:25 PDT 2011

On Mon, 17 Oct 2011, David Flanagan wrote:
>
> In the HTML spec, "The rules for parsing tokens in foreign content" 
> include an algorithm for "any other end tag".  This is the algorithm at 
> the very end of 
> http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html.
> 
> I think there are some problems with this algorithm and would appreciate 
> any insight anyone has:
> 
> 1) Step 3 includes an instruction to jump to the last step in the list 
> of steps.  But the last step begins "Otherwise", which sounds like it is 
> an else clause.  Jumping into an else clause is confusing enough that I 
> wonder if there is an error in the algorithm wording.

Yeah, that's bogus. The "last step" it's referring to has been removed (it 
used to reset the insertion mode). I've fixed the spec.

> 2) I can't get all of the parser tests from html5lib to pass with this
> algorithm as it is currently written.  In particular, there are 5 tests in
> testdata/tree-construction/tests9.dat of this basic form:
> 
> <!DOCTYPE html><body><table><math><mi>foo</mi></math></table>
> 
> As the spec is written, the <mi> tag is a text integration point, so the "foo"
> text token is handled like regular content, not like foreign content.

Oh, my, yeah, that's all kinds of wrong. The text node should be handled 
as if it was in the "in body" mode, not as if it was "in table". I'll have 
to study this closer.

I think this broke when we moved away from using an insertion mode for 
foreign content.

Henri, do you know how Gecko gets this right currently?

> The workaround I've found (I'm not confident that this is the correct 
> workaround) is to change step 3 of the algorithm so that it only pops 
> the stack if there is no pending table text.  Another potential 
> workaround is to use the existence of pending table text as a condition 
> for sending tokens to the regular insertion mode rather than treating 
> them as foreign content.

We shouldn't be ending up with pending table text here at all. It should 
go straight into the mi element.

> 3) In this set of tests
> http://code.google.com/p/html5lib/source/browse/testdata/tree-construction/webkit01.dat
> there is this test:
> 
> <math><mrow><mrow><mn>1</mn></mrow><mi>a</mi></mrow></math>
> 
> When the first </mrow> tag is parsed, it is handled as foreign content, 
> and gets popped off the stack in step 3. Then, the token is reprocessed 
> in body mode.  It is treated in the "any other end tag" case.  Since the 
> top of the stack happens to be another mrow tag, that one gets popped 
> too.  (Other tests don't fail here because they don't happen to have two 
> of the same tags on the stack).  This means that the <mi> element ends 
> up as a child of the <math> element instead of the outer <mrow> element.

That should be fixed with the updated spec text now, right?

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'