[whatwg] Stability of tokenizing/dom algorithms
ian at hixie.ch
Sun Dec 14 20:41:07 PST 2008
On Sun, 14 Dec 2008, Edward Z. Yang wrote:
> I was curious to know how stable/complete HTML 5's tokenizing and DOM
> algorithms are (specifically section 8).
Pretty stable. There are some known issues , and more issues will
surely be found as implementations grow in usage, but the basic
architecture is unlikely to change and the specifics are unlikely to
change much. The only major pending change is adding SVG, but that will
likely be done in a way similar to what is currently specified but
 Mostly listed here: http://www.whatwg.org/issues/#parsing
> The reason I'd like to know this is because I am the author of a tool
> named HTML Purifier, which takes user-input HTML and cleans it for
> standards-compliance as well as XSS. We insist on output being standards
> compliant, because the result is unambiguous.
> As far as I can tell, this is quite unlike the tools that HTML5 is
> tooled towards; compliance checkers, user agents and data miners. There
> certainly is overlap: we have our own parsing and DOM-building
> algorithms which work decently well, although they do trip up on a
> number of edge-cases (active formatting elements being one notable
> example). However, using the HTML5 algorithm wholesale is not possible
> for several reasons:
> 1. Users input HTML fragments, not actual HTML documents. A parser I
> would use needs to be able to enter parsing in a specific state, and has
> to ignore any requests by the user to exit that state (i.e. a </body>
As Anne pointed out, we do have a section to handle that case (it's
similar to innerHTML in browsers); if there's anything I can do to make
those sections more helpful to you, please let me know.
> 2. No one actually codes their HTML in HTML5 (yet), so the only parts of
> the algorithm I want to use are the ones that are emulating browser
> behavior with HTML4. However, HTML5 interweaves it's additions with the
> browser research it has done.
In general you should be able to just implement what the spec says and
then either leave the HTML5 support in (it's unlikely to cause any harm)
or just comment out the support for the new elements, that should be
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
More information about the whatwg