[whatwg] On tag inference

Ian Hickson ian at hixie.ch
Fri Mar 10 15:40:40 PST 2006


(Blake, particular input from you is requested near the bottom of this 
e-mail, regarding whether Mozilla is willing to change behaviour or 
whether the spec should change instead.)

On Thu, 1 Sep 2005, Henri Sivonen wrote:
> 
> End tag inference
> 
> I made the following list based on the HTML 4.01 Transitional DTD. 
> Before the colon on each line there is a element whose end tag is 
> optional. After the colon, there is the list of elements whose start tag 
> can cause the end tag being inferred.
> 
> p: p, h1, h2, h3, h4, h5, h6, ol, ul, pre, dl, div, center, noscript,
> noframes, blockquote, form, isindex, hr, table, fieldset, address

Add to this list: dir, listing, menu.

I disagree about: noscript and noframes (only Mozilla closes p on 
noframes, nobody does for noscript).

form is special; if there is already an open form, it doesn't close p.

isindex is special: it itself doesn't close a p, but it implies an <hr> 
start tag which closes an open p (and for that matter, the open <form> 
that the <isindex> implies).

Note that this closing only happens if the <p> on the stack is nearer to 
the top of the stack than the nearest table, caption, td, th, button, 
marquee or object, or html element.


> li: li
> dt: dt, dd
> dd: dt, dd

Yes, though the closing only happens if the <li>, <dt> or <dd> is on the 
stack nearer to the end of the stack than the nearest element that is not 
a formatting, phrasing, div, or address element.


> thead: tfoot, tbody
> tfoot: tbody
> tbody: tbody

All three get closed by any of the other three.


> colgroup: colgroup, thead, tfoot, tbody, tr

colgroup gets closed by anything but <col> and </col>.


> tr: tr, tfoot, tbody
> td: td, th, tr, tfoot, tbody
> th: td, th, tr, tfoot, tbody

The exact details for these are pretty convoluted.


> html:
> body:

Correct.


> head: ANY BUT script, style, meta, link, object, title, isindex, base

I've removed object and isindex from this list.


> How should this list be augmented for HTML5? Eg. should a start tag for 
> <section> close a paragraph?

New tags in HTML5 are getting added in exactly three places in the current 
spec:

  1. Inline elements get added to the "phrasing" entry (the default for 
     unknown elements, so nothing needs to change for these).

  2. Block elements get added to the same start tag / end tag entries as 
     the <blockquote> element.

  3. Empty elements get added to the same entry as "img".


> Start tag inference
> 
>  * If the top of the stack is 'table' and the element start is 'tr', infer
> 'tbody'.
>  * If the stack is empty and the element start is anything but 'html', infer
> 'html'.
>  * If the top of the stack is 'html', the element start is not 'head' and
> 'head' has not been seen yet, infer 'head'.
>  * If the top of the stack is 'html', the element start is not 'body' and
> 'head' has been seen, infer 'body'.

This is far more complicated than this, I think.


> Should (in memory of HTML 4.01 Transitional) character data imply the 
> start of body?

Yes.


> > As far as I can tell, there are four kinds of inference needed when 
> > parsing *conforming* documents (ie. no second stack for residual 
> > style):
> > 1) Element end causes the end of the elements that is on the top of 
> > the stack*.
> 
> If the top of the stack does not match the element end event, see if the 
> top of the stack is on the list of elements whose end tag is optional. 
> Pop and report the end of the popped element if yes. Err if not. Repeat.
>
> > 2) End of the data stream causes the end of the element that is on the 
> > top of the stack.
> 
> See if the top of the stack is on the list of elements whose end tag is 
> optional. Pop and report the end of the popped element if yes. Err if 
> not. Repeat.

It's a lot more complicated than these in practice!


> > 3) Element start causes the end of the element that is on the top of 
> > the stack.
> > 4) Element start causes another element start before itself.
> 
> a) Perform end tag inference repeatedly according to the lists given above
> until no inference can be made.
> b) Perform the start tag inference once.
> Repeat from a) until additional inference cannot be performed. Then let the
> original element start go through.
> 
> Is this correct for *conforming* documents (ie. without residual style, 
> etc.)?

I'm not sure. I haven't tried to make the distinction.


On Sun, 4 Sep 2005, Anne van Kesteren wrote:
>
> > How should this list be augmented for HTML5? Eg. should a start tag 
> > for <section> close a paragraph?
> 
> I do not think that would be a good idea. Changing the HTML parser is 
> not something browser vendors tend to like. I think HTML5 should just 
> require end tags in certain circumstances. Or perhaps just require them 
> for some elements, like the P element.

I think this:

   <p>Hello
   <section>
     <p> Hello
   </section>
   <p>Hello

...needs to work as indented, not as:

  # BODY
    * P
       o #text: Hello
       o SECTION
    * P
       o #text: Hello
    * P
       o #text: Hello

...which is what happens today in Mozilla or Safari (let alone do 
something worse, e.g. what happens in Opera 8.x).

Sure, this means styling won't work in legacy UAs, but then it wouldn't 
anyway (see above). We can't simply never introduce new block-level 
elements. :-)


On Sun, 4 Sep 2005, Henri Sivonen wrote:
> 
> What about the interaction of <section> with <head> and <body>?

<section> (and any unknown tag) must close <head>. Browsers are all over 
the place when it comes to unknown tags in <head>; Safari happens to do 
what I'm suggesting, Mozilla and IE treat all unknown elements in <head> 
as if they were leaves (though in slightly different ways). Opera has a 
number of bugs with <head>.

In practice I don't think there are many pages that depend on unknown 
element parsing in <head>.


> How would you insert the optional tags in this case:
> 
> <!DOCTYPE html>
> <title>...</title>
> <section>...</section>
> <div>...</div>
> 
> ?

 <!DOCTYPE html>
 <html><head><title>...</title>
 </head><body><section>...</section>
 <div>...</div></body></html>


> My tentative assumption has been
> <!DOCTYPE html>
> <html><head><title>...</title>
> </head><body><section>...</section>
> <div>...</div></body></html>
> 
> (Assuming the data stream ends with </div>.)

Yes, exactly.


> Should the answer instead be
> 
> <!DOCTYPE html>
> <html><head><title>...</title>
> <section>...</section>
> </head><body><div>...</div></body></html>
> 
> ?

Well, it could _at most_ be:

 <!DOCTYPE html>
 <html><head><title>...</title>
 <section></head><body>...
 <div>...</div></body></html>

...and I don't think that's what we want.


On Sun, 4 Sep 2005, Henri Sivonen wrote:
>
> So I did, but I now think my reasoning leading to that conclusion was 
> flawed. (I should have corrected myself on this mailing list earlier.) 
> If only a closed list of elements is allowed in HEAD, it will be 
> impossible to introduce extensions that occur as children of HEAD. An 
> explicit list of element starts that close HEAD would solve that 
> problem.
> 
> The implicit boundary of HEAD and BODY is a tricky issue with open-ended 
> extensions. I guess I'll have to examine browser behavior as the next 
> step.

I don't see a problem with simply adding new elements to the head as we 
want to. That's how <style> was added, e.g. It's no big deal.


On Sun, 4 Sep 2005, Henri Sivonen wrote:
> 
> It appears that my first attempt was compatible with browsers and 
> introducing new child elements of HEAD is already impossible.

It's impossible only if you want to keep the same DOM across versions of 
the browser. I don't think this is needed for new elements, so long as 
the rendering is compatible and no pages break.


On Fri, 9 Sep 2005, Henri Sivonen wrote:
> 
> I added bgsound to the elements that do not close head.

In the spec, bgsound doesn't work in the head, only in the body. Should we 
change that? Only half the browsers I tested put it in the head.


On Fri, 9 Sep 2005, Blake Kaplan wrote:
> 
> For compatibility, Mozilla has to allow userdefined tags as leaves in 
> the head of a document. People use them (at least for XBL bindings (at 
> least that's what the bug that caused us to implement this behavior 
> said)).

Is this something you'd consider changing back? Can you tell us which 
sites this behaviour was changed for?

I suppose we could put this behaviour in HTML5, but doesn't seem like a 
good forward-compatible design.


On Wed, 21 Sep 2005, Anne van Kesteren wrote:
>
> I think you leave out presentational elements. I am pretty sure HTML 5 
> will not introduce html:bgsound.

The parser part has to handle all the elements.


On Mon, 19 Sep 2005, Henri Sivonen wrote:

> On Sep 1, 2005, at 20:48, Henri Sivonen wrote:
> 
> > End tag inference
> 
> > p: p, h1, h2, h3, h4, h5, h6, ol, ul, pre, dl, div, center, noscript,
> > noframes, blockquote, form, isindex, hr, table, fieldset, address
> 
> Anne's new design shows that the above list is incomplete. Since paragraphs
> may appear in definitions, list items and table cells, which themselves may
> omit their end tags, the list of elements whose start tags closes 'p' needs to
> also include the elements whose start tag closes dd, li, td and th.
> 
> > dd: dt, dd
> > li: li
> > td: td, th, tr, tfoot, tbody

The way I handled this in the spec is by saying that in e.g.:

   <dd>
    <p>
   <dt>

...the <dt> closes the <dd> by implying a </dd>, and it is the </dd> that 
closes the <p> (by implying a </p>).

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'



More information about the whatwg mailing list