[Imps] Liberal XML parsing

Mon Jan 8 14:44:40 PST 2007

On Mon, 08 Jan 2007 19:46:27 +0100, Sam Ruby <rubys at intertwingly.net>  
wrote:
>> Well, the next "bit" would probably be processing instructions. That's  
>> why it would be nice to have some formalization / standardization first  
>> to see how many changes are required exactly.
>
> I have no interest in XML processing instructions at this time.

Fair enough. But if this is becoming the foundation of an (experimental)  
liberal XML parser we'll have interest in due course I reckon. If only for  
<?xbl?> and <?xml-stylesheet?>.

>> Currently html5lib maps rather well to the specificaction which  
>> improves the readability of the code a lot (imho). I'd like to know at  
>> how many changes we're looking and how that impacts the code.
>
> That's why I provided a comprehensive patch:
>
>    http://intertwingly.net/stories/2007/01/08/xhtml5.diff

Instead of using string.ascii_uppercase you should use our internal  
asciiUppercase. Also, instead of using a dict for translating can't you  
just provide two strings? I'd think that would be faster.

The normalizeToken method should be inlined as you only want to do that  
 from a single place anyway. And EndTag should use the translate method and  
not .lower().

I suppose these changes also remove the need for asciiLowercase (not  
asciiLower that you introduce) as defined in constants.py.

Anyway, with these nits (open for debate) I think I'm ok with doing this  
assuming you will update the tests as well (or someone else will). I'd  
like to have a liberal XML parser too one day and working on an  
experimental implementation of one can't hurt I suppose :-)

If xhtml5parser.py is the only other file I would be fine with adding that  
to src/ as liberalxmlparser.py. Bit of a lengthty name, but it more  
accurately reflects what it is.

-- 
Anne van Kesteren
<http://annevankesteren.nl/>
<http://www.opera.com/>