[imps] [ANN] Complete tokenizer in C#

Fri Aug 24 13:50:47 PDT 2007

Hi all,

I spent the last week rewriting Twintsam's tokenizer from scratch,
exactly following the current draft's algorithm. Performance could
probably be improved a lot, but I'll first concentrate on the tree
building stage.

For the record, Twintsam is written in C# for .NET 2.0, and it is
available at http://twintsam.googlecode.com

The tokenizer rewrite is accompanied by a refactoring: the
tokenization stage used to be implemented entirely private in class
aimed at tokenizing *and* applying some tree-building rules (mainly
handling omitted or misnested tags). It is now in a public
HtmlTextTokenizer class, which derives from an HtmlTokenizer abstract
class. Another HtmlTokenizer-based class, the abstract
HtmlWrappingTokenizer, provides a framework for building "filters" (à
la "sanitizer.Filter" from HTML5Lib).
The HtmlReader will use an HtmlTokenizer as input and apply some of
the tree-building rules (see above); that's where I'll concentrate
now.
Another class will then implement the other tree-building rules while
actually building a tree of XmlNode objects.

The next steps will be to implement serializers (XmlWriter- and
HtmlWriter-based ones), the other HTML5 algorithms (encoding sniffing
–partly done already–, content-type sniffing) and the HTML5 DOM
(probably based on an extensible DOM –all based on the core .NET
XmlNode-based DOM–, reusable outside the scope of HTML parsing).

-- 
Thomas Broyer