[Imps] Liberal XML parsing

Mon Jan 8 09:23:40 PST 2007

Anne van Kesteren wrote:
> On Mon, 08 Jan 2007 17:42:49 +0100, Sam Ruby <rubys at intertwingly.net> 
> wrote:
>> The current tokenizer has ".lower()" sprinkled throughout and doesn't 
>> expose in any meaningful way the difference between empty and start tags.
> 
> Because there is no difference between them. See the HTML5 specification.

My point is that by "baking in" that behavior into the tokenizer, it 
essentially limits that tokenizer to just supporting HTML5.  By 
providing one extra "bit" of information, the potential for reuse is 
increased.

Of course, the html5parser will need to ignore this extra bit, and my 
patch includes that change.

>> For the tokenizer to be meaningfully subclassed (and by that, I mean 
>> without requiring wholesale duplication of a number of methods), these 
>> behaviors would need to be factored out into separate methods that 
>> could be overridden.
> 
> You could subclass it and change processSolidusInTag. Instead of 
> throwing an atheist parse error you would change the type of token to be 
> "empty" or something.

 From a maintenance point of view, that is suboptimal.  As 
processSolidusInTag changes, that maintenance would need to occur in two 
places.

> Not sure how to do the .lower() stuff. I kind of guessed the reason you 
> wanted to change that was because of a project like this :-)

I've provided one way: by refactoring it so that all the lowercasing of 
element names is done in exactly one place, and that the lowercasing of 
attribute names is also done in exactly one place.  That class can be 
subclassed to provide a different behavior.

  - - -

It is no secret that my interest in the WHATWG started with a 
dissatisfaction with Python's sgmllib, particularly when used as a 
foundation for parsing HTML, XHTML, or as a fallback parser for XML. 
What I see in html5lib is a *much* better foundation.

I'm in no particular rush, but if after a few days it turns out that 
people are OK with something *like* this going into the html5lib 
repository, I'd love to put it in there -- at which point it would be 
free to evolve, be renamed, refactored, and enhanced.  One thing I would 
love to work on is a true DOM builder (at which point, I could throw 
away my XMLDocument, XMLElement, and XMLComment classes), but I would 
need changes to TreeBuilder so that I could provide my own Text class 
(for example).

Needless to say, such a treebuilder could also be used with HTML5.

Once this stabilized, I would them plan to look at having the UFP take 
advantage of this library, if it is installed/available.  I'd also 
modify Venus, but such support would not need to be conditional there: 
Venus could simply include html5lib.

- Sam Ruby