[whatwg] <!DOCTYPE html><body><table><math><mi>foo</mi></math></table> and other parser questions

Tue Dec 13 14:32:45 PST 2011

On Fri, 14 Oct 2011, David Flanagan wrote:
>
> The "Anything else" case of the in_table insertion mode of the HTML parsing
> spec reads:
> > Process the token using the rules for the "in body" insertion mode, except
> > that if the current node is a table, tbody, tfoot, thead, or tr element,
> > then, whenever a node would be inserted into the current node, it must
> > instead be foster parented.
> I think that this is actually incorrect (or at least very misleading) as it is
> worded.  In order to get correct parsing results, it appears that you have to
> do this:
> 
> Process the token using the rules for the "in body" insertion mode, except
> that whenever a node would be inserted into the current node and the current
> node is a table, tbody, tfoot, thead, or tr element, then the node to be
> inserted must instead be foster parented.
> 
> As the spec is currently worded, we are directed to check once whether the
> current node is a table, table section or table row, and then proceed to use
> the rules for the in body mode.  In fact, however, it is necessary to check
> whether the current node is a table, section or row each time a node is to be
> inserted.  This came up for me when a text node is being inserted into a table
> when there is an active formatting element that gets reconstructed and foster
> parented.  My reading of the current spec text said that the text node should
> also be foster parented (because I only checked whether the current node was a
> table once), and the text node ended up as a sibling of the active formatting
> element rather than a child of that element.

Agreed that the previous wording was misleading. I've adjusted it. Let me 
know if you think it's still bad.

On Mon, 12 Dec 2011, Adam Barth wrote:
>
> I'm trying to understand how the HTML parsing spec handles the following case:
> 
> <!DOCTYPE html><body><table><math><mi>foo</mi></math></table>
> 
> According to the html5lib test data, we should parse that as follows:
> 
> | <!DOCTYPE html>
> | <html>
> |   <head>
> |   <body>
> |     <math math>
> |       <math mi>
> |         "foo"
> |     <table>
> 
> However, I'm not sure whether that's what the spec actually does.
> 
> Consider point at which we parse the "f" character token (from "foo").
>  The insertion mode will be "in table".  The spec will execute as
> follows:
> 
> -> If the current node is a MathML text integration point and the
> token is a character token
>   * Process the token according to the rules given in the section
> corresponding to the current insertion mode in HTML content.
> 
> -> A character token
>   * Let the pending table character tokens be an empty list of tokens.
>   * Let the original insertion mode be the current insertion mode.
>   * Switch the insertion mode to "in table text" and reprocess the token.
> 
> -> Any other character token
>   * Append the character token to the pending table character tokens list.
> 
> ... the "o" and "o" will be processed similarly and end up in the
> pending table character tokens list.
> 
> Now, consider the </mi> token.  We're still at a MathML text
> integration point, but the current token is neither a start token
> (with certain names) nor a character token, so we process the token
> according to the rules given in the section for parsing tokens in
> foreign content.
> 
> -> Any other end tag
>   * Run these steps:
>     ...
> 
> The net result of which is popping the stack of open elements, but not
> flushing out the pending table character tokens list.  The list will
> eventually be flushed when we process the </table> token, resulting
> these character tokens getting foster parented:
> 
> | <!DOCTYPE html>
> | <html>
> |   <head>
> |   <body>
> |     <math math>
> |       <math mi>
> |     "foo"
> |     <table>

On Tue, 18 Oct 2011, David Flanagan wrote:
>
> Here's my current workaround:
> 
> In 13.2.5, in the rules for whether to use the current insertion mode or 
> to insert the token as foreign content, if the token is being inserted 
> because the current node is a math (or HTML, but I'm not sure about 
> that) integration point, then first set a text_integration_mode flag, 
> then invoke the current insertion mode, then clear the flag.
> 
> And in the in table insertion mode, when a character token is inserted, 
> and the text_integration_mode flag is set, then just process the token 
> using in body mode, and otherwise follow the directions that are there 
> now.
> 
> I'm not sure that is the best way to fix the spec, but it works for me, 
> in the sense that my parser now passes the tests.

I think the real problem is that there's no need to go into the "table 
text" mode if the current node is not a table model element. So I've 
changed the spec at that point.

Please let me know if that doesn't fix the test case or causes any other 
regressions.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'