[Imps] [ANN] Father Christmas is a bit early this year…
t.broyer at gmail.com
Wed Dec 27 03:13:02 PST 2006
2006/12/23, James Graham:
> Thomas Broyer wrote:
> > ANNONCING "Twintsam" – because The Web Is Not Tag Soup Any More
> > ===============================================================
> > As I annonced last week on the WHATWG list, I've started an HTML5
> > parser in C# (for .NET 2.0).
> > I finished the tokenizer implementation (coding "blinded") yesterday
> > evening and spent some time today to run the unit tests borrowed from
> > html5lib project; and 'know what? apart two or three typos, they all
> > passed !!!
Actually (please do not laugh !!!), I ran the tests with an incomplete
JSON->C# test code generator, so unit tests only ran tokenization but
never compared the actual output with the expected one :-D
Worked on that today, there was several small fixes needed in my
tokenization code but now all tests pass.
Some are reported as errors for now because of how I'm "emitting"
tokens. For example, I always report parse errors before the token
currently being parsed is "emitted" (this makes "Unfinished comment"
fail), and I never "emit" a sequence of "Character" tokens (this makes
some tests on entities fail).
Unless someone has a better idea (e.g. changing the test suite), I'll
work on my JSON->C# test code generator to merge consecutive
"Character" tokens and reorder "ParseError" tokens.
> Be wary though that we have very few tokenizer-specific tests (in the
> current framework some things can only be checked through the parser
> tests , but that does not constitute the majority of what is
> missing). So, for example, I don't think the current tokenizer testsuite
> checks that you are lowercasing tag names properly. I believe it is
> highly desirable for the tokenizer tests to, as far as possible, stand
> on their own, and this is something we are looking to improve, so please
> contribute any extra tests you can :)
I'll run the tests through a code coverage tool and contribute new tests.
>  An example of something that, at present can only be checked
> through a parser test is the proper tokenizing of a fragment like
Which is something within the spec that I still don't have understood…
but I haven't digg too deep in the tree-construction steps so I might
be missing something…
More information about the Implementors