[Imps] Liberal XML parsing

Mon Jan 8 17:15:38 PST 2007

Anne van Kesteren wrote:
> 
>>> Currently html5lib maps rather well to the specificaction which 
>>> improves the readability of the code a lot (imho). I'd like to know 
>>> at how many changes we're looking and how that impacts the code.
>>
>> That's why I provided a comprehensive patch:
>>
>>    http://intertwingly.net/stories/2007/01/08/xhtml5.diff
> 
> Instead of using string.ascii_uppercase you should use our internal 
> asciiUppercase. Also, instead of using a dict for translating can't you 
> just provide two strings? I'd think that would be faster.

I don't understand the suggestion to use the internal asciiUppercase - 
with my patch, this constant is no longer used.  And my constant was 
defined in the src/constants.py file...

I also don't understand the suggestion to "just provide two strings". 
That's not how Python's unicode.translate() method works.

> The normalizeToken method should be inlined as you only want to do that 
> from a single place anyway. And EndTag should use the translate method 
> and not .lower().

While it is true that normalizeToken is only called from one place, this 
method can't be inlined as the liberal XML parser subclass needs to 
override this behavior.

> I suppose these changes also remove the need for asciiLowercase (not 
> asciiLower that you introduce) as defined in constants.py.

asciiLowercase is still used in the portion of the logic dealing with 
DocTypes.  But having two similarly named constants with quite different 
purposes is confusing, and clearly *that* should be changed.

> Anyway, with these nits (open for debate) I think I'm ok with doing this 
> assuming you will update the tests as well (or someone else will). I'd 
> like to have a liberal XML parser too one day and working on an 
> experimental implementation of one can't hurt I suppose :-)

In case you didn't notice it, here are the tests:

http://intertwingly.net/stories/2007/01/08/tests/test_xhtml.py

> If xhtml5parser.py is the only other file I would be fine with adding 
> that to src/ as liberalxmlparser.py. Bit of a lengthty name, but it more 
> accurately reflects what it is.

I'm not worried about the the name.  That name is fine.

I'll look into committing this tomorrow, with your proposed module name, 
with the unit tests, and with some subset of these nits addressed.  I'll 
add comments at the top of the module indicating that this support is 
experimental and subject to change and even removal at any time.

- Sam Ruby