[whatwg] Stability of tokenizing/dom algorithms

Sun Dec 14 13:37:40 PST 2008

Hello all,

I was curious to know how stable/complete HTML 5's tokenizing and DOM
algorithms are (specifically section 8). A cursory glance through the
section reveals a few red warning boxes, but these are largely issues of
whether or not the specification should follow browser implementations,
and not actual errors in the specification.

The reason I'd like to know this is because I am the author of a tool
named HTML Purifier, which takes user-input HTML and cleans it for
standards-compliance as well as XSS. We insist on output being standards
compliant, because the result is unambiguous.

As far as I can tell, this is quite unlike the tools that HTML5 is
tooled towards; compliance checkers, user agents and data miners. There
certainly is overlap: we have our own parsing and DOM-building
algorithms which work decently well, although they do trip up on a
number of edge-cases (active formatting elements being one notable
example). However, using the HTML5 algorithm wholesale is not possible
for several reasons:

1. Users input HTML fragments, not actual HTML documents. A parser I
would use needs to be able to enter parsing in a specific state, and has
to ignore any requests by the user to exit that state (i.e. a </body> tag)

2. No one actually codes their HTML in HTML5 (yet), so the only parts of
the algorithm I want to use are the ones that are emulating browser
behavior with HTML4. However, HTML5 interweaves it's additions with the
browser research it has done.

I'd be really interested to hear what you all have to say about this
matter. Thanks!

Cheers,
Edward