[Imps] Liberal XML parsing
Sam Ruby
rubys at intertwingly.net
Mon Jan 8 10:46:27 PST 2007
Anne van Kesteren wrote:
> On Mon, 08 Jan 2007 18:23:40 +0100, Sam Ruby <rubys at intertwingly.net>
> wrote:
>>> Because there is no difference between them. See the HTML5
>>> specification.
>>
>> My point is that by "baking in" that behavior into the tokenizer, it
>> essentially limits that tokenizer to just supporting HTML5. By
>> providing one extra "bit" of information, the potential for reuse is
>> increased.
>
> Well, the next "bit" would probably be processing instructions. That's
> why it would be nice to have some formalization / standardization first
> to see how many changes are required exactly.
I have no interest in XML processing instructions at this time.
> Currently html5lib maps rather well to the specificaction which improves
> the readability of the code a lot (imho). I'd like to know at how many
> changes we're looking and how that impacts the code.
That's why I provided a comprehensive patch:
http://intertwingly.net/stories/2007/01/08/xhtml5.diff
>>> Not sure how to do the .lower() stuff. I kind of guessed the reason
>>> you wanted to change that was because of a project like this :-)
>>
>> I've provided one way: by refactoring it so that all the lowercasing
>> of element names is done in exactly one place, and that the
>> lowercasing of attribute names is also done in exactly one place.
>> That class can be subclassed to provide a different behavior.
>
> Do you this as a standalone patch somewhere? As mentioned before, I'd
> like to see how it deals with non-ASCII characters.
The patch isn't all that big. The relevant portions are:
asciiLower = dict([(ord(c),ord(c.lower())) for c in
string.ascii_uppercase])
token["name"] = token["name"].translate(asciiLower)
token["data"] = dict([(attr.translate(asciiLower), value)
for attr,value in token["data"][::-1]])
- Sam Ruby
More information about the Implementors
mailing list