[Imps] Liberal XML parsing
rubys at intertwingly.net
Mon Jan 8 09:23:40 PST 2007
Anne van Kesteren wrote:
> On Mon, 08 Jan 2007 17:42:49 +0100, Sam Ruby <rubys at intertwingly.net>
>> The current tokenizer has ".lower()" sprinkled throughout and doesn't
>> expose in any meaningful way the difference between empty and start tags.
> Because there is no difference between them. See the HTML5 specification.
My point is that by "baking in" that behavior into the tokenizer, it
essentially limits that tokenizer to just supporting HTML5. By
providing one extra "bit" of information, the potential for reuse is
Of course, the html5parser will need to ignore this extra bit, and my
patch includes that change.
>> For the tokenizer to be meaningfully subclassed (and by that, I mean
>> without requiring wholesale duplication of a number of methods), these
>> behaviors would need to be factored out into separate methods that
>> could be overridden.
> You could subclass it and change processSolidusInTag. Instead of
> throwing an atheist parse error you would change the type of token to be
> "empty" or something.
From a maintenance point of view, that is suboptimal. As
processSolidusInTag changes, that maintenance would need to occur in two
> Not sure how to do the .lower() stuff. I kind of guessed the reason you
> wanted to change that was because of a project like this :-)
I've provided one way: by refactoring it so that all the lowercasing of
element names is done in exactly one place, and that the lowercasing of
attribute names is also done in exactly one place. That class can be
subclassed to provide a different behavior.
- - -
It is no secret that my interest in the WHATWG started with a
dissatisfaction with Python's sgmllib, particularly when used as a
foundation for parsing HTML, XHTML, or as a fallback parser for XML.
What I see in html5lib is a *much* better foundation.
I'm in no particular rush, but if after a few days it turns out that
people are OK with something *like* this going into the html5lib
repository, I'd love to put it in there -- at which point it would be
free to evolve, be renamed, refactored, and enhanced. One thing I would
love to work on is a true DOM builder (at which point, I could throw
away my XMLDocument, XMLElement, and XMLComment classes), but I would
need changes to TreeBuilder so that I could provide my own Text class
Needless to say, such a treebuilder could also be used with HTML5.
Once this stabilized, I would them plan to look at having the UFP take
advantage of this library, if it is installed/available. I'd also
modify Venus, but such support would not need to be conditional there:
Venus could simply include html5lib.
- Sam Ruby
More information about the Implementors