[whatwg] Stability of tokenizing/dom algorithms

Ian Hickson ian at hixie.ch
Sun Dec 14 20:41:07 PST 2008

On Sun, 14 Dec 2008, Edward Z. Yang wrote:
> I was curious to know how stable/complete HTML 5's tokenizing and DOM 
> algorithms are (specifically section 8).

Pretty stable. There are some known issues [1], and more issues will 
surely be found as implementations grow in usage, but the basic 
architecture is unlikely to change and the specifics are unlikely to 
change much. The only major pending change is adding SVG, but that will 
likely be done in a way similar to what is currently specified but 
commented out.

[1] Mostly listed here: http://www.whatwg.org/issues/#parsing

> The reason I'd like to know this is because I am the author of a tool 
> named HTML Purifier, which takes user-input HTML and cleans it for 
> standards-compliance as well as XSS. We insist on output being standards 
> compliant, because the result is unambiguous.
> As far as I can tell, this is quite unlike the tools that HTML5 is 
> tooled towards; compliance checkers, user agents and data miners. There 
> certainly is overlap: we have our own parsing and DOM-building 
> algorithms which work decently well, although they do trip up on a 
> number of edge-cases (active formatting elements being one notable 
> example). However, using the HTML5 algorithm wholesale is not possible 
> for several reasons:
> 1. Users input HTML fragments, not actual HTML documents. A parser I 
> would use needs to be able to enter parsing in a specific state, and has 
> to ignore any requests by the user to exit that state (i.e. a </body> 
> tag)

As Anne pointed out, we do have a section to handle that case (it's 
similar to innerHTML in browsers); if there's anything I can do to make 
those sections more helpful to you, please let me know.

> 2. No one actually codes their HTML in HTML5 (yet), so the only parts of 
> the algorithm I want to use are the ones that are emulating browser 
> behavior with HTML4. However, HTML5 interweaves it's additions with the 
> browser research it has done.

In general you should be able to just implement what the spec says and 
then either leave the HTML5 support in (it's unlikely to cause any harm) 
or just comment out the support for the new elements, that should be 
relatively easy.

Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

More information about the whatwg mailing list