[whatwg] On tag inference

Thu Sep 1 10:48:37 PDT 2005

On Aug 29, 2005, at 22:29, Henri Sivonen wrote:

> What kind of approach to tag inference can HTML5 be expected to take? 
> For an SGML validator that is parsing HTML 4 the set of possible 
> element names is finite. However, a browser needs to deal with an 
> infinite set of a potential elements names. Therefore, it makes a 
> difference whether end tag inference is based on what is allowed as a 
> child of an element or on what elements are not allowed.
>
> Example:
> <p><foo>
> Is 'foo' an element that not allowed as a child of 'p' and, therefore, 
> implicitly closes the 'p'? Or is 'foo' not on the list of elements 
> that close 'p' and, therefore, does not implicitly close it? Which way 
> are the inference rules going to be defined?

I think the latter approach should be chosen, because otherwise it 
would be impossible to extend HTML in the future with an element that 
can occur as a child of 'p'.

Therefore:

End tag inference

I made the following list based on the HTML 4.01 Transitional DTD. 
Before the colon on each line there is a element whose end tag is 
optional. After the colon, there is the list of elements whose start 
tag can cause the end tag being inferred. How should this list be 
augmented for HTML5? Eg. should a start tag for <section> close a 
paragraph?

p: p, h1, h2, h3, h4, h5, h6, ol, ul, pre, dl, div, center, noscript, 
noframes, blockquote, form, isindex, hr, table, fieldset, address
dt: dt, dd
dd: dt, dd
li: li
thead: tfoot, tbody
tfoot: tbody
tbody: tbody
colgroup: colgroup, thead, tfoot, tbody, tr
tr: tr, tfoot, tbody
td: td, th, tr, tfoot, tbody
th: td, th, tr, tfoot, tbody
html:
body:
head: ANY BUT script, style, meta, link, object, title, isindex, base

Start tag inference

  * If the top of the stack is 'table' and the element start is 'tr', 
infer 'tbody'.
  * If the stack is empty and the element start is anything but 'html', 
infer 'html'.
  * If the top of the stack is 'html', the element start is not 'head' 
and 'head' has not been seen yet, infer 'head'.
  * If the top of the stack is 'html', the element start is not 'body' 
and 'head' has been seen, infer 'body'.

Should (in memory of HTML 4.01 Transitional) character data imply the 
start of body?

> As far as I can tell, there are four kinds of inference needed when 
> parsing *conforming* documents (ie. no second stack for residual 
> style):
> 1) Element end causes the end of the elements that is on the top of 
> the stack*.

If the top of the stack does not match the element end event, see if 
the top of the stack is on the list of elements whose end tag is 
optional. Pop and report the end of the popped element if yes. Err if 
not. Repeat.

> 2) End of the data stream causes the end of the element that is on the 
> top of the stack.

See if the top of the stack is on the list of elements whose end tag is 
optional. Pop and report the end of the popped element if yes. Err if 
not. Repeat.

> 3) Element start causes the end of the element that is on the top of 
> the stack.
> 4) Element start causes another element start before itself.

a) Perform end tag inference repeatedly according to the lists given 
above until no inference can be made.
b) Perform the start tag inference once.
Repeat from a) until additional inference cannot be performed. Then let 
the original element start go through.

Is this correct for *conforming* documents (ie. without residual style, 
etc.)?

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/