[Imps] Test cases for parsing spec

Wed Dec 6 10:28:02 PST 2006

James Graham wrote:
> Sam Ruby wrote:
>> Ian Hickson wrote:
>>> On Wed, 6 Dec 2006, Sam Ruby wrote:
>>>> That being said, if Ian (or somebody) can come up with a small seed 
>>>> of test cases, I will try to convert them into a usable form and see 
>>>> if I can get html5lib working with it.
>>> I have a bunch of tests here, I just need a format to output the 
>>> tests into. It would take me a few minutes at most, once someone has 
>>> defined the exact format for the tests. They're not currently in a 
>>> usable form (outside Google, anyway).
>>
>> Negotiable.  What I work with now is:
>>
>> Line 1: "<!--"
>> Line 2: "Description: [1]"
>> Line 3: "Expect:      [2]"
>> Line 4: "-->"
>> Line 5+: HTML
>>
>> where [1] is human readable, and [2] is computer readable.  [2] will 
>> likely need to be adapted anyway, so don't worry too much.  Something 
>> language neutral or xpath-ish would be ideal.
>>
>> Example:
>>
>> <!--
>> Description: extraneous quotes
>> Expect:      html/body/a[@title="foo"]
>> -->
>> <html><body><a href="#"" title="foo"></body></html>
> 
> Something like that looks ideal for the parser/treebuilder. If we want 
> to test the tokeniser separately, it might be good to just have a list 
> of expected token types and properties:
> 
> <!--
> Description:
> Expect:
> StartTag html
> StartTag body
> StartTag a {'href':'#', 'title':'foo'}
> EndTag body
> EndTag html
> -->
> <html><body><a href="#"" title="foo"></body></html>
> 
> It means multi-line expect and might be overcomplex (ideally one could 
> define a single test and check the output from both phases)... what do 
> you think?

My gut feel is that spelling out the complete sequence is both tedious 
and error prone. And fragile.  Should the tokenizer add an implicit 
"head"?  I'm not suggesting that it should, but consider what havoc such 
a change would make on the test suite.

My guess is that typically, what you are looking for is a single token, 
or at most a short sequence.  A semicolon separated list of tokens 
(generally only one) would suffice for most cases.

I'd also suggest ditching most of the noise characters:

   Expect: StartTag a href='#' title='foo'

- Sam Ruby