[whatwg] [Ann] A limping development version of an HTML5 parser in Java
hsivonen at iki.fi
Wed Jul 11 03:29:29 PDT 2007
There's now a limping development version of an HTML5 parser in Java
that interested parties may try out:
svn co http://svn.versiondude.net/whattf/htmlparser/trunk/ htmlparser
Warning: This isn't at all ready for any kind of production use. The
purpose of this email is just to let interested parties know the
status of the project.
ICU4J (compile time--needed at run time only if normalization
checking of source text is enabled)
json-tools, which in turn depend on Antlr 2 (needed for compiling and
running the tokenizer test harness--not needed for normal use of the
The Apache XML serializer aka. serializer.jar, which comes with
Xerces and Xalan (needed for testing the associated tree model with XML)
Classes with main() methods are in the *.test packages.
Provide an HTML5 parser that works as a drop-in replacement for an
XML parser in non-browser Java apps that expect XML APIs. Make the
parser strict enough for conformance checking (including encoding
* Foster parenting doesn't work.
* The JDK UTF-8 decoder leaves some bad byte sequences unreported.
* Probably lots especially in the tree builder.
* Test harness for html5lib tree construction test cases.
* Pass those tests.
* Buffered SAX as drop-in replacement for an XML parser.
* Streaming SAX (fatal errors on the AAA and foster parenting,
etc.) as drop-in replacement for an XML parser.
* DOM as drop-in replacement for an XML parser.
* XOM as drop-in replacement for an XML parser.
* Configurability regarding XML 1.0 infoset violations.
Later on roadmap:
* JDOM as drop-in replacement for an XML parser.
* Performance improvements.
(dom4j is not explicitly on the roadmap, because DOM support is
expected to work with dom4j).
Doable but not on the roadmap:
* Buffered StAX.
Not doable within the architecture:
* True streaming StAX. (Use SAX instead.)
MIT/expat. Patches welcome under the same license.
Thanks to the Mozilla Foundation for funding this project. Thanks to
the html5lib team and Philip Taylor (of the lazyilluminati fame) for
test cases and bug reports.
hsivonen at iki.fi
More information about the whatwg