[Imps] Liberal XML parsing
Anne van Kesteren
annevk at opera.com
Mon Jan 8 17:27:49 PST 2007
On Tue, 09 Jan 2007 02:15:38 +0100, Sam Ruby <rubys at intertwingly.net>
wrote:
>>>> Currently html5lib maps rather well to the specificaction which
>>>> improves the readability of the code a lot (imho). I'd like to know
>>>> at how many changes we're looking and how that impacts the code.
>>>
>>> That's why I provided a comprehensive patch:
>>>
>>> http://intertwingly.net/stories/2007/01/08/xhtml5.diff
>> Instead of using string.ascii_uppercase you should use our internal
>> asciiUppercase. Also, instead of using a dict for translating can't you
>> just provide two strings? I'd think that would be faster.
>
> I don't understand the suggestion to use the internal asciiUppercase -
> with my patch, this constant is no longer used. And my constant was
> defined in the src/constants.py file...
But you haven't removed the constant either. But as you later note it's
still used somewhere else...
> I also don't understand the suggestion to "just provide two strings".
> That's not how Python's unicode.translate() method works.
Oh, sorry about that. I vaguely recalled translate() from
http://diveintopython.org/performance_tuning/dictionary_lookups.html and
apparently it works slightly different from what I remembered. Should we
use string.maketrans?
>> The normalizeToken method should be inlined as you only want to do that
>> from a single place anyway. And EndTag should use the translate method
>> and not .lower().
>
> While it is true that normalizeToken is only called from one place, this
> method can't be inlined as the liberal XML parser subclass needs to
> override this behavior.
Hmm, not so nice. For a large page that's a lot of additional method
calls. Can you redo it a bit making sure we don't make that call for all
non tag tokens at least, such as characters.
>> I suppose these changes also remove the need for asciiLowercase (not
>> asciiLower that you introduce) as defined in constants.py.
>
> asciiLowercase is still used in the portion of the logic dealing with
> DocTypes. But having two similarly named constants with quite different
> purposes is confusing, and clearly *that* should be changed.
Yeah.
>> Anyway, with these nits (open for debate) I think I'm ok with doing
>> this assuming you will update the tests as well (or someone else will).
>> I'd like to have a liberal XML parser too one day and working on an
>> experimental implementation of one can't hurt I suppose :-)
>
> In case you didn't notice it, here are the tests:
>
> http://intertwingly.net/stories/2007/01/08/tests/test_xhtml.py
I noticed those, but you also had some comments on updating tests.
>> If xhtml5parser.py is the only other file I would be fine with adding
>> that to src/ as liberalxmlparser.py. Bit of a lengthty name, but it
>> more accurately reflects what it is.
>
> I'm not worried about the the name. That name is fine.
>
> I'll look into committing this tomorrow, with your proposed module name,
> with the unit tests, and with some subset of these nits addressed. I'll
> add comments at the top of the module indicating that this support is
> experimental and subject to change and even removal at any time.
Ok, cool.
--
Anne van Kesteren
<http://annevankesteren.nl/>
<http://www.opera.com/>
More information about the Implementors
mailing list