[Imps] Liberal XML parsing

Mon Jan 8 10:46:27 PST 2007

Anne van Kesteren wrote:
> On Mon, 08 Jan 2007 18:23:40 +0100, Sam Ruby <rubys at intertwingly.net> 
> wrote:
>>>  Because there is no difference between them. See the HTML5 
>>> specification.
>>
>> My point is that by "baking in" that behavior into the tokenizer, it 
>> essentially limits that tokenizer to just supporting HTML5.  By 
>> providing one extra "bit" of information, the potential for reuse is 
>> increased.
> 
> Well, the next "bit" would probably be processing instructions. That's 
> why it would be nice to have some formalization / standardization first 
> to see how many changes are required exactly.

I have no interest in XML processing instructions at this time.

> Currently html5lib maps rather well to the specificaction which improves 
> the readability of the code a lot (imho). I'd like to know at how many 
> changes we're looking and how that impacts the code.

That's why I provided a comprehensive patch:

   http://intertwingly.net/stories/2007/01/08/xhtml5.diff

>>> Not sure how to do the .lower() stuff. I kind of guessed the reason 
>>> you wanted to change that was because of a project like this :-)
>>
>> I've provided one way: by refactoring it so that all the lowercasing 
>> of element names is done in exactly one place, and that the 
>> lowercasing of attribute names is also done in exactly one place.  
>> That class can be subclassed to provide a different behavior.
> 
> Do you this as a standalone patch somewhere? As mentioned before, I'd 
> like to see how it deals with non-ASCII characters.

The patch isn't all that big.  The relevant portions are:

   asciiLower = dict([(ord(c),ord(c.lower())) for c in 
string.ascii_uppercase])

   token["name"] = token["name"].translate(asciiLower)

   token["data"] = dict([(attr.translate(asciiLower), value)
       for attr,value in token["data"][::-1]])

- Sam Ruby