[whatwg] Tag Soup: Blocks-in-inlines
hsivonen at iki.fi
Wed Jan 25 08:21:55 PST 2006
On Jan 25, 2006, at 12:09, Lachlan Hunt wrote:
> This is in response to Hixie's article .
I had had such a strong intuitive assumption of what Gecko and
WebCore were doing that I was surprised to learn their behavior is
indeed much hairier. (I hadn't even verified my assumption by
checking the sources, because it seemed so obvious to me that Gecko &
WebCore were doing what I thought they were doing...)
Anyway, here's what I thought they were doing:
There's low-level parser is kind of like a tag-level lexer and emits
a (non-well-formed) sequence of SAX-like events like startTag,
characters, endTag and comment (in my parser* HtmlParser.java). These
events don't go to the DOM builder / content sink directly. Instead,
there's a filter layer that takes care of tag inference and emits a
well-formed event stream (TagInferenceFilter.java and
EmptyElementFilter.java in my parser). Additionally, there's a filter
(not present in my parser, which is designed for conformance
checking; this may need to be integrated into the tag inference
filter) that performs the "residual style" fixups. It works like this
(assuming that there is no need for legitimate tag inference at the
A stack is used for keeping track of the open elements. When startTag
is seen, the topmost element of the stack and the name of the new
element are compared to a static table to see if the new element can
occur as a child of the topmost element on the stack. If it can, the
new element is pushed on the stack and echoed forward in the pipeline.
If the element start was for an inline element, a second residual
style stack is inspected. This also happens when characters are
reported. If there are items in the residual style stack, the stack
is popped and the popped element is echoed forward in the pipeline
and pushed onto the open element stack. The items on the stacks
include not only element names but attributes as well.
When the residual style stack is empty, the inline content (startTag
of an inline element or characters) from the lower layer is echoed
forward in the pipeline (pushing the element on the open element
stack if it was startTag and not characters).
When an endTag is seen, if it matches the topmost item of the open
element stack, the stack is popped end the endTag event (now actually
an endElement event) is echoed forward in the pipeline.
If, however, the endTag and the open element stack do not match, the
open element stack is searched until the first non-inline element. If
a matching start for the endTag is found before or at the first non-
inline element, the stack is popped and the popped item echoed
forward in the pipeline and pushed onto the residual style stack
until the matching start is found (at which point the element is
close as above). If the matching start is not found before or at the
first non-inline element on the stack, the endTag event is discarded.
Whenever items are pushed onto the residual style stack, it is
considered an easy parse error.
Perhaps this model is a simple enough model to be deterministically
specified but still good enough an approximation of Gecko's and
WebCore's behavior. All decisions are local to the parse event being
observed and do not involve reshuffling the parts of the DOM that
have already been built.
* http://hsivonen.iki.fi/validator-about/htmlparser.jar (with source)
hsivonen at iki.fi
More information about the whatwg