[Imps] Liberal XML parsing

Mon Jan 8 17:27:49 PST 2007

On Tue, 09 Jan 2007 02:15:38 +0100, Sam Ruby <rubys at intertwingly.net>  
wrote:
>>>> Currently html5lib maps rather well to the specificaction which  
>>>> improves the readability of the code a lot (imho). I'd like to know  
>>>> at how many changes we're looking and how that impacts the code.
>>>
>>> That's why I provided a comprehensive patch:
>>>
>>>    http://intertwingly.net/stories/2007/01/08/xhtml5.diff
>>  Instead of using string.ascii_uppercase you should use our internal  
>> asciiUppercase. Also, instead of using a dict for translating can't you  
>> just provide two strings? I'd think that would be faster.
>
> I don't understand the suggestion to use the internal asciiUppercase -  
> with my patch, this constant is no longer used.  And my constant was  
> defined in the src/constants.py file...

But you haven't removed the constant either. But as you later note it's  
still used somewhere else...

> I also don't understand the suggestion to "just provide two strings".  
> That's not how Python's unicode.translate() method works.

Oh, sorry about that. I vaguely recalled translate() from  
http://diveintopython.org/performance_tuning/dictionary_lookups.html and  
apparently it works slightly different from what I remembered. Should we  
use string.maketrans?

>> The normalizeToken method should be inlined as you only want to do that  
>> from a single place anyway. And EndTag should use the translate method  
>> and not .lower().
>
> While it is true that normalizeToken is only called from one place, this  
> method can't be inlined as the liberal XML parser subclass needs to  
> override this behavior.

Hmm, not so nice. For a large page that's a lot of additional method  
calls. Can you redo it a bit making sure we don't make that call for all  
non tag tokens at least, such as characters.

>> I suppose these changes also remove the need for asciiLowercase (not  
>> asciiLower that you introduce) as defined in constants.py.
>
> asciiLowercase is still used in the portion of the logic dealing with  
> DocTypes.  But having two similarly named constants with quite different  
> purposes is confusing, and clearly *that* should be changed.

Yeah.

>> Anyway, with these nits (open for debate) I think I'm ok with doing  
>> this assuming you will update the tests as well (or someone else will).  
>> I'd like to have a liberal XML parser too one day and working on an  
>> experimental implementation of one can't hurt I suppose :-)
>
> In case you didn't notice it, here are the tests:
>
> http://intertwingly.net/stories/2007/01/08/tests/test_xhtml.py

I noticed those, but you also had some comments on updating tests.

>> If xhtml5parser.py is the only other file I would be fine with adding  
>> that to src/ as liberalxmlparser.py. Bit of a lengthty name, but it  
>> more accurately reflects what it is.
>
> I'm not worried about the the name.  That name is fine.
>
> I'll look into committing this tomorrow, with your proposed module name,  
> with the unit tests, and with some subset of these nits addressed.  I'll  
> add comments at the top of the module indicating that this support is  
> experimental and subject to change and even removal at any time.

Ok, cool.

-- 
Anne van Kesteren
<http://annevankesteren.nl/>
<http://www.opera.com/>