[whatwg] On tag inference

Mon Aug 29 12:29:08 PDT 2005

What kind of approach to tag inference can HTML5 be expected to take? 
For an SGML validator that is parsing HTML 4 the set of possible 
element names is finite. However, a browser needs to deal with an 
infinite set of a potential elements names. Therefore, it makes a 
difference whether end tag inference is based on what is allowed as a 
child of an element or on what elements are not allowed.

Example:
<p><foo>
Is 'foo' an element that not allowed as a child of 'p' and, therefore, 
implicitly closes the 'p'? Or is 'foo' not on the list of elements that 
close 'p' and, therefore, does not implicitly close it? Which way are 
the inference rules going to be defined?

As far as I can tell, there are four kinds of inference needed when 
parsing *conforming* documents (ie. no second stack for residual 
style):
1) Element end causes the end of the elements that is on the top of the 
stack*.
2) End of the data stream causes the end of the element that is on the 
top of the stack.
3) Element start causes the end of the element that is on the top of 
the stack.
4) Element start causes another element start before itself.

Is this list complete?

I am assuming that for *conforming* documents, the above-mentioned 
inference decisions can be taken by observing the top of the stack and 
the element name associated with the current end or start element 
event. Correct? (I am assuming the rules may be applied repeatedly. Ie. 
null on stack and start 'title' implies 'html' start. 'html' on stack 
and start 'title' implies 'head' start. 'head' on stack and start 
'title' implies nothing and the start 'title' goes through.)

It seems to me that #3 is the tricky case in terms of interaction with 
unknown element names. #1 and #2 require a list of elements whose end 
tag is optional. #4 requires a map of top of stack plus current start 
pairs to inferred start tags.

* I am assuming an implementation maintains a stack of open elements or 
works directly on a parser tree in which case the path from the current 
node to the root has the right same role as the stack.

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/