From jg307 at cam.ac.uk Wed Jan 3 08:07:30 2007 From: jg307 at cam.ac.uk (James Graham) Date: Wed, 03 Jan 2007 16:07:30 +0000 Subject: [Imps] Adding "content model flags" to tokenization tests In-Reply-To: References: Message-ID: <459BD4C2.3060100@cam.ac.uk> Thomas Broyer wrote: > 2006/12/28, Thomas Broyer: >> 2006/12/23, James Graham: >> > >> > [1] An example of something that, at present can only be checked >> > through a parser test is the proper tokenizing of a fragment like >> > <head>&body; >> >> How about adding a new "parameter" to tests to set the initial >> "content model flag" (defaulting to "PCDATA" if not present)? > > I've finally created some test cases (attached) with a > "contentModelFlags" property whose value is a list of "content model > flag"s. The test case is then run successively with the same input and > expected output but initialized with a different "content model flag". > If the property is not given, it defaults to ["PCDATA"] (a list with a > single value "PCDATA"). That's; I've added these to the html5lib svn repository and updated our test framework to run the new tests. -- "The universe doesn't care what you believe. The wonderful thing about science is that it doesn't ask for your faith, it just asks for your eyes" --- http://xkcd.com/c154.html From ian at hixie.ch Wed Jan 3 15:07:20 2007 From: ian at hixie.ch (Ian Hickson) Date: Wed, 3 Jan 2007 23:07:20 +0000 (UTC) Subject: [Imps] Reasonable limits on buffered values In-Reply-To: <BAY109-F26C90CA07674EF007E2364B4C60@phx.gbl> References: <BAY109-F26C90CA07674EF007E2364B4C60@phx.gbl> Message-ID: <Pine.LNX.4.62.0701032249260.4611@dhalsim.dreamhost.com> On Fri, 29 Dec 2006, Simon Pieters wrote: > > From: Henri Sivonen <hsivonen at iki.fi> > >I'm wondering if there's a best practice here. Is there data on how > >long non-malicious attribute values legitimately appear on the Web? I'll see if I can get some data. (No ETA.) > Additionally, .NET applications can have long attribute values too. See > "Figure 3. Simple page LessViewState.aspx with DataGrid1" at > > http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnaspnet/html/asp11222001.asp > > That's 3.05 KiB, but can get a lot longer depending on the number of > form controls, I think. I myself have written pages with significantly longer href="" attributes, e.g. when using long data: URIs of big images. The problem is that whatever limit you set, you'll always find a legitimate document that's bigger. It sounds stupid but the best practice really is to not have explicit limits, but instead to have algorithms that can handle any volume of input without exploding. It might be best, in fact, to limit CPU and memory usage, rather than attempting to limit input buffers. ("This page would take too many resources to handle.") That actually handles the billion laughs problem without having to special case anything to do with it. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' From rubys at intertwingly.net Mon Jan 8 07:34:55 2007 From: rubys at intertwingly.net (Sam Ruby) Date: Mon, 08 Jan 2007 10:34:55 -0500 Subject: [Imps] Liberal XML parsing Message-ID: <45A2649F.2020308@intertwingly.net> I've posted a note on how the code in html5lib could serve as an excellent foundation for a number of "liberal" XML parsing tasks: http://www.intertwingly.net/blog/2007/01/08/Xhtml5lib Personally, I'm not overly interested in hearing more opinions as to whether or not there is a valid demand for liberal XML parsing. If you don't want to use it, don't. What I WOULD be interested in hearing opinions on is what would be the best way to maintain this code going forward: could it live as a separate module within html5lib repository? Should it be a separate repository? If separate, are there some changes to the tokenizer in particular that could be made that would either directly enable this usage or would make it easier to monkey-patch for usage by xhtml5lib? - Sam Ruby From annevk at opera.com Mon Jan 8 08:29:56 2007 From: annevk at opera.com (Anne van Kesteren) Date: Mon, 08 Jan 2007 17:29:56 +0100 Subject: [Imps] Liberal XML parsing In-Reply-To: <45A2649F.2020308@intertwingly.net> References: <45A2649F.2020308@intertwingly.net> Message-ID: <op.tluf36by64w2qv@id-c0020> On Mon, 08 Jan 2007 16:34:55 +0100, Sam Ruby <rubys at intertwingly.net> wrote: > I've posted a note on how the code in html5lib could serve as an > excellent foundation for a number of "liberal" XML parsing tasks: > > http://www.intertwingly.net/blog/2007/01/08/Xhtml5lib > > Personally, I'm not overly interested in hearing more opinions as to > whether or not there is a valid demand for liberal XML parsing. If you > don't want to use it, don't. I've nothing against liberal XML parsing and I would actually like it to be formalized somewhere, but I do think that calling it an XHTML5 parser is wrong given that XHTML5 as it stands now is supposed to be parsed by an XML parser. > What I WOULD be interested in hearing opinions on is what would be the > best way to maintain this code going forward: could it live as a > separate module within html5lib repository? Should it be a separate > repository? If separate, are there some changes to the tokenizer in > particular that could be made that would either directly enable this > usage or would make it easier to monkey-patch for usage by xhtml5lib? Can't you subclass the tokenizer? (I don't mind it being in the same repository as html5lib by the way. Not sure what the best location is.) -- Anne van Kesteren <http://annevankesteren.nl/> <http://www.opera.com/> From rubys at intertwingly.net Mon Jan 8 08:42:49 2007 From: rubys at intertwingly.net (Sam Ruby) Date: Mon, 08 Jan 2007 11:42:49 -0500 Subject: [Imps] Liberal XML parsing In-Reply-To: <op.tluf36by64w2qv@id-c0020> References: <45A2649F.2020308@intertwingly.net> <op.tluf36by64w2qv@id-c0020> Message-ID: <45A27489.3090504@intertwingly.net> Anne van Kesteren wrote: > >> What I WOULD be interested in hearing opinions on is what would be the >> best way to maintain this code going forward: could it live as a >> separate module within html5lib repository? Should it be a separate >> repository? If separate, are there some changes to the tokenizer in >> particular that could be made that would either directly enable this >> usage or would make it easier to monkey-patch for usage by xhtml5lib? > > Can't you subclass the tokenizer? (I don't mind it being in the same > repository as html5lib by the way. Not sure what the best location is.) The current tokenizer has ".lower()" sprinkled throughout and doesn't expose in any meaningful way the difference between empty and start tags. For the tokenizer to be meaningfully subclassed (and by that, I mean without requiring wholesale duplication of a number of methods), these behaviors would need to be factored out into separate methods that could be overridden. - Sam Ruby From annevk at opera.com Mon Jan 8 08:48:12 2007 From: annevk at opera.com (Anne van Kesteren) Date: Mon, 08 Jan 2007 17:48:12 +0100 Subject: [Imps] Liberal XML parsing In-Reply-To: <45A27489.3090504@intertwingly.net> References: <45A2649F.2020308@intertwingly.net> <op.tluf36by64w2qv@id-c0020> <45A27489.3090504@intertwingly.net> Message-ID: <op.tlugyms364w2qv@id-c0020> On Mon, 08 Jan 2007 17:42:49 +0100, Sam Ruby <rubys at intertwingly.net> wrote: > The current tokenizer has ".lower()" sprinkled throughout and doesn't > expose in any meaningful way the difference between empty and start tags. Because there is no difference between them. See the HTML5 specification. > For the tokenizer to be meaningfully subclassed (and by that, I mean > without requiring wholesale duplication of a number of methods), these > behaviors would need to be factored out into separate methods that could > be overridden. You could subclass it and change processSolidusInTag. Instead of throwing an atheist parse error you would change the type of token to be "empty" or something. Not sure how to do the .lower() stuff. I kind of guessed the reason you wanted to change that was because of a project like this :-) -- Anne van Kesteren <http://annevankesteren.nl/> <http://www.opera.com/> From rubys at intertwingly.net Mon Jan 8 09:23:40 2007 From: rubys at intertwingly.net (Sam Ruby) Date: Mon, 08 Jan 2007 12:23:40 -0500 Subject: [Imps] Liberal XML parsing In-Reply-To: <op.tlugyms364w2qv@id-c0020> References: <45A2649F.2020308@intertwingly.net> <op.tluf36by64w2qv@id-c0020> <45A27489.3090504@intertwingly.net> <op.tlugyms364w2qv@id-c0020> Message-ID: <45A27E1C.7000601@intertwingly.net> Anne van Kesteren wrote: > On Mon, 08 Jan 2007 17:42:49 +0100, Sam Ruby <rubys at intertwingly.net> > wrote: >> The current tokenizer has ".lower()" sprinkled throughout and doesn't >> expose in any meaningful way the difference between empty and start tags. > > Because there is no difference between them. See the HTML5 specification. My point is that by "baking in" that behavior into the tokenizer, it essentially limits that tokenizer to just supporting HTML5. By providing one extra "bit" of information, the potential for reuse is increased. Of course, the html5parser will need to ignore this extra bit, and my patch includes that change. >> For the tokenizer to be meaningfully subclassed (and by that, I mean >> without requiring wholesale duplication of a number of methods), these >> behaviors would need to be factored out into separate methods that >> could be overridden. > > You could subclass it and change processSolidusInTag. Instead of > throwing an atheist parse error you would change the type of token to be > "empty" or something. From a maintenance point of view, that is suboptimal. As processSolidusInTag changes, that maintenance would need to occur in two places. > Not sure how to do the .lower() stuff. I kind of guessed the reason you > wanted to change that was because of a project like this :-) I've provided one way: by refactoring it so that all the lowercasing of element names is done in exactly one place, and that the lowercasing of attribute names is also done in exactly one place. That class can be subclassed to provide a different behavior. - - - It is no secret that my interest in the WHATWG started with a dissatisfaction with Python's sgmllib, particularly when used as a foundation for parsing HTML, XHTML, or as a fallback parser for XML. What I see in html5lib is a *much* better foundation. I'm in no particular rush, but if after a few days it turns out that people are OK with something *like* this going into the html5lib repository, I'd love to put it in there -- at which point it would be free to evolve, be renamed, refactored, and enhanced. One thing I would love to work on is a true DOM builder (at which point, I could throw away my XMLDocument, XMLElement, and XMLComment classes), but I would need changes to TreeBuilder so that I could provide my own Text class (for example). Needless to say, such a treebuilder could also be used with HTML5. Once this stabilized, I would them plan to look at having the UFP take advantage of this library, if it is installed/available. I'd also modify Venus, but such support would not need to be conditional there: Venus could simply include html5lib. - Sam Ruby From jg307 at cam.ac.uk Mon Jan 8 09:27:09 2007 From: jg307 at cam.ac.uk (James Graham) Date: Mon, 08 Jan 2007 17:27:09 +0000 Subject: [Imps] Liberal XML parsing In-Reply-To: <45A2649F.2020308@intertwingly.net> References: <45A2649F.2020308@intertwingly.net> Message-ID: <45A27EED.7000508@cam.ac.uk> Sam Ruby wrote: > What I WOULD be interested in hearing opinions on is what would be the > best way to maintain this code going forward: could it live as a > separate module within html5lib repository? Should it be a separate > repository? I'm open to hosting it in the same repository; the only issue that I see is that people may conflate the two parts and be put off downloading html5lib because they think it is a liberal XML parser or xhtml5lib because they think it is a html-only project. > If separate, are there some changes to the tokenizer in > particular that could be made that would either directly enable this > usage or would make it easier to monkey-patch for usage by xhtml5lib? Assuming the patches needed don't cause severe regressions in the code readability or performance of html5lib I think the existing tokenizer would be the right place to apply them. -- "Eternity's a terrible thought. I mean, where's it all going to end?" -- Tom Stoppard, Rosencrantz and Guildenstern are Dead From jg307 at cam.ac.uk Mon Jan 8 09:41:41 2007 From: jg307 at cam.ac.uk (James Graham) Date: Mon, 08 Jan 2007 17:41:41 +0000 Subject: [Imps] Liberal XML parsing In-Reply-To: <45A27E1C.7000601@intertwingly.net> References: <45A2649F.2020308@intertwingly.net> <op.tluf36by64w2qv@id-c0020> <45A27489.3090504@intertwingly.net> <op.tlugyms364w2qv@id-c0020> <45A27E1C.7000601@intertwingly.net> Message-ID: <45A28255.2050509@cam.ac.uk> Sam Ruby wrote: > I've provided one way: by refactoring it so that all the lowercasing of > element names is done in exactly one place, and that the lowercasing of > attribute names is also done in exactly one place. That class can be > subclassed to provide a different behavior. That sounds fine to me. We need to add some unicode tests though to be sure we're not lowercasing where we shouldn't be. > I'm in no particular rush, but if after a few days it turns out that > people are OK with something *like* this going into the html5lib > repository, I'd love to put it in there -- at which point it would be > free to evolve, be renamed, refactored, and enhanced. One thing I would > love to work on is a true DOM builder (at which point, I could throw > away my XMLDocument, XMLElement, and XMLComment classes), but I would > need changes to TreeBuilder so that I could provide my own Text class > (for example). FWIW I consider supporting one of the python DOM implementations a priority for the 0.3 release of html5lib (of course we need to release 0.2 first -- at this point that is basically a case of uploading the source archive). Using the current treebuilder interface it should be possible to support DOM-like text nodes without any changes but it's non-trivial so maybe the current interface is in need of improvement (the problem is that we aslo need to support ElementTree which regards text as attributes). -- "Eternity's a terrible thought. I mean, where's it all going to end?" -- Tom Stoppard, Rosencrantz and Guildenstern are Dead From annevk at opera.com Mon Jan 8 10:28:22 2007 From: annevk at opera.com (Anne van Kesteren) Date: Mon, 08 Jan 2007 19:28:22 +0100 Subject: [Imps] Liberal XML parsing In-Reply-To: <45A27E1C.7000601@intertwingly.net> References: <45A2649F.2020308@intertwingly.net> <op.tluf36by64w2qv@id-c0020> <45A27489.3090504@intertwingly.net> <op.tlugyms364w2qv@id-c0020> <45A27E1C.7000601@intertwingly.net> Message-ID: <op.tlullkuh64w2qv@id-c0020> On Mon, 08 Jan 2007 18:23:40 +0100, Sam Ruby <rubys at intertwingly.net> wrote: >> Because there is no difference between them. See the HTML5 >> specification. > > My point is that by "baking in" that behavior into the tokenizer, it > essentially limits that tokenizer to just supporting HTML5. By > providing one extra "bit" of information, the potential for reuse is > increased. Well, the next "bit" would probably be processing instructions. That's why it would be nice to have some formalization / standardization first to see how many changes are required exactly. Currently html5lib maps rather well to the specificaction which improves the readability of the code a lot (imho). I'd like to know at how many changes we're looking and how that impacts the code. > From a maintenance point of view, that is suboptimal. As > processSolidusInTag changes, that maintenance would need to occur in two > places. Well, the method isn't that big :-) >> Not sure how to do the .lower() stuff. I kind of guessed the reason you >> wanted to change that was because of a project like this :-) > > I've provided one way: by refactoring it so that all the lowercasing of > element names is done in exactly one place, and that the lowercasing of > attribute names is also done in exactly one place. That class can be > subclassed to provide a different behavior. Do you this as a standalone patch somewhere? As mentioned before, I'd like to see how it deals with non-ASCII characters. > Once this stabilized, I would them plan to look at having the UFP take > advantage of this library, if it is installed/available. I'd also > modify Venus, but such support would not need to be conditional there: > Venus could simply include html5lib. That'd be cool! I read today that actual usage and support is important if you want your library to be included in the default distribution. -- Anne van Kesteren <http://annevankesteren.nl/> <http://www.opera.com/> From rubys at intertwingly.net Mon Jan 8 10:46:27 2007 From: rubys at intertwingly.net (Sam Ruby) Date: Mon, 08 Jan 2007 13:46:27 -0500 Subject: [Imps] Liberal XML parsing In-Reply-To: <op.tlullkuh64w2qv@id-c0020> References: <45A2649F.2020308@intertwingly.net> <op.tluf36by64w2qv@id-c0020> <45A27489.3090504@intertwingly.net> <op.tlugyms364w2qv@id-c0020> <45A27E1C.7000601@intertwingly.net> <op.tlullkuh64w2qv@id-c0020> Message-ID: <45A29183.6080403@intertwingly.net> Anne van Kesteren wrote: > On Mon, 08 Jan 2007 18:23:40 +0100, Sam Ruby <rubys at intertwingly.net> > wrote: >>> Because there is no difference between them. See the HTML5 >>> specification. >> >> My point is that by "baking in" that behavior into the tokenizer, it >> essentially limits that tokenizer to just supporting HTML5. By >> providing one extra "bit" of information, the potential for reuse is >> increased. > > Well, the next "bit" would probably be processing instructions. That's > why it would be nice to have some formalization / standardization first > to see how many changes are required exactly. I have no interest in XML processing instructions at this time. > Currently html5lib maps rather well to the specificaction which improves > the readability of the code a lot (imho). I'd like to know at how many > changes we're looking and how that impacts the code. That's why I provided a comprehensive patch: http://intertwingly.net/stories/2007/01/08/xhtml5.diff >>> Not sure how to do the .lower() stuff. I kind of guessed the reason >>> you wanted to change that was because of a project like this :-) >> >> I've provided one way: by refactoring it so that all the lowercasing >> of element names is done in exactly one place, and that the >> lowercasing of attribute names is also done in exactly one place. >> That class can be subclassed to provide a different behavior. > > Do you this as a standalone patch somewhere? As mentioned before, I'd > like to see how it deals with non-ASCII characters. The patch isn't all that big. The relevant portions are: asciiLower = dict([(ord(c),ord(c.lower())) for c in string.ascii_uppercase]) token["name"] = token["name"].translate(asciiLower) token["data"] = dict([(attr.translate(asciiLower), value) for attr,value in token["data"][::-1]]) - Sam Ruby From annevk at opera.com Mon Jan 8 14:44:40 2007 From: annevk at opera.com (Anne van Kesteren) Date: Mon, 08 Jan 2007 23:44:40 +0100 Subject: [Imps] Liberal XML parsing In-Reply-To: <45A29183.6080403@intertwingly.net> References: <45A2649F.2020308@intertwingly.net> <op.tluf36by64w2qv@id-c0020> <45A27489.3090504@intertwingly.net> <op.tlugyms364w2qv@id-c0020> <45A27E1C.7000601@intertwingly.net> <op.tlullkuh64w2qv@id-c0020> <45A29183.6080403@intertwingly.net> Message-ID: <op.tluxgqrl64w2qv@id-c0020> On Mon, 08 Jan 2007 19:46:27 +0100, Sam Ruby <rubys at intertwingly.net> wrote: >> Well, the next "bit" would probably be processing instructions. That's >> why it would be nice to have some formalization / standardization first >> to see how many changes are required exactly. > > I have no interest in XML processing instructions at this time. Fair enough. But if this is becoming the foundation of an (experimental) liberal XML parser we'll have interest in due course I reckon. If only for <?xbl?> and <?xml-stylesheet?>. >> Currently html5lib maps rather well to the specificaction which >> improves the readability of the code a lot (imho). I'd like to know at >> how many changes we're looking and how that impacts the code. > > That's why I provided a comprehensive patch: > > http://intertwingly.net/stories/2007/01/08/xhtml5.diff Instead of using string.ascii_uppercase you should use our internal asciiUppercase. Also, instead of using a dict for translating can't you just provide two strings? I'd think that would be faster. The normalizeToken method should be inlined as you only want to do that from a single place anyway. And EndTag should use the translate method and not .lower(). I suppose these changes also remove the need for asciiLowercase (not asciiLower that you introduce) as defined in constants.py. Anyway, with these nits (open for debate) I think I'm ok with doing this assuming you will update the tests as well (or someone else will). I'd like to have a liberal XML parser too one day and working on an experimental implementation of one can't hurt I suppose :-) If xhtml5parser.py is the only other file I would be fine with adding that to src/ as liberalxmlparser.py. Bit of a lengthty name, but it more accurately reflects what it is. -- Anne van Kesteren <http://annevankesteren.nl/> <http://www.opera.com/> From rubys at intertwingly.net Mon Jan 8 17:15:38 2007 From: rubys at intertwingly.net (Sam Ruby) Date: Mon, 08 Jan 2007 20:15:38 -0500 Subject: [Imps] Liberal XML parsing In-Reply-To: <op.tluxgqrl64w2qv@id-c0020> References: <45A2649F.2020308@intertwingly.net> <op.tluf36by64w2qv@id-c0020> <45A27489.3090504@intertwingly.net> <op.tlugyms364w2qv@id-c0020> <45A27E1C.7000601@intertwingly.net> <op.tlullkuh64w2qv@id-c0020> <45A29183.6080403@intertwingly.net> <op.tluxgqrl64w2qv@id-c0020> Message-ID: <45A2ECBA.4040607@intertwingly.net> Anne van Kesteren wrote: > >>> Currently html5lib maps rather well to the specificaction which >>> improves the readability of the code a lot (imho). I'd like to know >>> at how many changes we're looking and how that impacts the code. >> >> That's why I provided a comprehensive patch: >> >> http://intertwingly.net/stories/2007/01/08/xhtml5.diff > > Instead of using string.ascii_uppercase you should use our internal > asciiUppercase. Also, instead of using a dict for translating can't you > just provide two strings? I'd think that would be faster. I don't understand the suggestion to use the internal asciiUppercase - with my patch, this constant is no longer used. And my constant was defined in the src/constants.py file... I also don't understand the suggestion to "just provide two strings". That's not how Python's unicode.translate() method works. > The normalizeToken method should be inlined as you only want to do that > from a single place anyway. And EndTag should use the translate method > and not .lower(). While it is true that normalizeToken is only called from one place, this method can't be inlined as the liberal XML parser subclass needs to override this behavior. > I suppose these changes also remove the need for asciiLowercase (not > asciiLower that you introduce) as defined in constants.py. asciiLowercase is still used in the portion of the logic dealing with DocTypes. But having two similarly named constants with quite different purposes is confusing, and clearly *that* should be changed. > Anyway, with these nits (open for debate) I think I'm ok with doing this > assuming you will update the tests as well (or someone else will). I'd > like to have a liberal XML parser too one day and working on an > experimental implementation of one can't hurt I suppose :-) In case you didn't notice it, here are the tests: http://intertwingly.net/stories/2007/01/08/tests/test_xhtml.py > If xhtml5parser.py is the only other file I would be fine with adding > that to src/ as liberalxmlparser.py. Bit of a lengthty name, but it more > accurately reflects what it is. I'm not worried about the the name. That name is fine. I'll look into committing this tomorrow, with your proposed module name, with the unit tests, and with some subset of these nits addressed. I'll add comments at the top of the module indicating that this support is experimental and subject to change and even removal at any time. - Sam Ruby From annevk at opera.com Mon Jan 8 17:27:49 2007 From: annevk at opera.com (Anne van Kesteren) Date: Tue, 09 Jan 2007 02:27:49 +0100 Subject: [Imps] Liberal XML parsing In-Reply-To: <45A2ECBA.4040607@intertwingly.net> References: <45A2649F.2020308@intertwingly.net> <op.tluf36by64w2qv@id-c0020> <45A27489.3090504@intertwingly.net> <op.tlugyms364w2qv@id-c0020> <45A27E1C.7000601@intertwingly.net> <op.tlullkuh64w2qv@id-c0020> <45A29183.6080403@intertwingly.net> <op.tluxgqrl64w2qv@id-c0020> <45A2ECBA.4040607@intertwingly.net> Message-ID: <op.tlu40nal64w2qv@id-c0020> On Tue, 09 Jan 2007 02:15:38 +0100, Sam Ruby <rubys at intertwingly.net> wrote: >>>> Currently html5lib maps rather well to the specificaction which >>>> improves the readability of the code a lot (imho). I'd like to know >>>> at how many changes we're looking and how that impacts the code. >>> >>> That's why I provided a comprehensive patch: >>> >>> http://intertwingly.net/stories/2007/01/08/xhtml5.diff >> Instead of using string.ascii_uppercase you should use our internal >> asciiUppercase. Also, instead of using a dict for translating can't you >> just provide two strings? I'd think that would be faster. > > I don't understand the suggestion to use the internal asciiUppercase - > with my patch, this constant is no longer used. And my constant was > defined in the src/constants.py file... But you haven't removed the constant either. But as you later note it's still used somewhere else... > I also don't understand the suggestion to "just provide two strings". > That's not how Python's unicode.translate() method works. Oh, sorry about that. I vaguely recalled translate() from http://diveintopython.org/performance_tuning/dictionary_lookups.html and apparently it works slightly different from what I remembered. Should we use string.maketrans? >> The normalizeToken method should be inlined as you only want to do that >> from a single place anyway. And EndTag should use the translate method >> and not .lower(). > > While it is true that normalizeToken is only called from one place, this > method can't be inlined as the liberal XML parser subclass needs to > override this behavior. Hmm, not so nice. For a large page that's a lot of additional method calls. Can you redo it a bit making sure we don't make that call for all non tag tokens at least, such as characters. >> I suppose these changes also remove the need for asciiLowercase (not >> asciiLower that you introduce) as defined in constants.py. > > asciiLowercase is still used in the portion of the logic dealing with > DocTypes. But having two similarly named constants with quite different > purposes is confusing, and clearly *that* should be changed. Yeah. >> Anyway, with these nits (open for debate) I think I'm ok with doing >> this assuming you will update the tests as well (or someone else will). >> I'd like to have a liberal XML parser too one day and working on an >> experimental implementation of one can't hurt I suppose :-) > > In case you didn't notice it, here are the tests: > > http://intertwingly.net/stories/2007/01/08/tests/test_xhtml.py I noticed those, but you also had some comments on updating tests. >> If xhtml5parser.py is the only other file I would be fine with adding >> that to src/ as liberalxmlparser.py. Bit of a lengthty name, but it >> more accurately reflects what it is. > > I'm not worried about the the name. That name is fine. > > I'll look into committing this tomorrow, with your proposed module name, > with the unit tests, and with some subset of these nits addressed. I'll > add comments at the top of the module indicating that this support is > experimental and subject to change and even removal at any time. Ok, cool. -- Anne van Kesteren <http://annevankesteren.nl/> <http://www.opera.com/> From teeshift at gmail.com Mon Jan 8 19:59:45 2007 From: teeshift at gmail.com (tee shift) Date: Tue, 9 Jan 2007 11:59:45 +0800 Subject: [Imps] adding object namespace function on canvas (so we could use them.) Message-ID: <18b3755f0701081959va5352e1o3606f64e67413d33@mail.gmail.com> Haven't read all the spect yet. I wonder if in future canvas will have the feature like the following beginPath circle = arc(30, 50, 0, Math.PI*2, 2) endPath ........ circle.moveTo(40,20) ..... we could assign name to object we have drawn so we could later change its attributes. In this example, location but we could change radius, center or size. Thanks. Tee Shift -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.whatwg.org/pipermail/implementors-whatwg.org/attachments/20070109/8693045e/attachment.htm> From annevk at opera.com Tue Jan 9 02:25:18 2007 From: annevk at opera.com (Anne van Kesteren) Date: Tue, 09 Jan 2007 11:25:18 +0100 Subject: [Imps] adding object namespace function on canvas (so we could use them.) In-Reply-To: <18b3755f0701081959va5352e1o3606f64e67413d33@mail.gmail.com> References: <18b3755f0701081959va5352e1o3606f64e67413d33@mail.gmail.com> Message-ID: <op.tlvtwgls64w2qv@id-c0020> On Tue, 09 Jan 2007 04:59:45 +0100, tee shift <teeshift at gmail.com> wrote: > Haven't read all the spect yet. > I wonder if in future canvas will have the feature like the following > > beginPath > circle = arc(30, 50, 0, Math.PI*2, 2) > endPath > ........ > circle.moveTo(40,20) > ..... > > we could assign name to object we have drawn so we could later change its > attributes. > In this example, location but we could change radius, center or size. The idea of <canvas> is that it doesn't consist of objects, but is just a bitmap you can draw upon. If you really want the objects you want SVG. Also, feedback like this ought to go to whatwg at whatwg.org. -- Anne van Kesteren <http://annevankesteren.nl/> <http://www.opera.com/> From rubys at intertwingly.net Wed Jan 10 02:36:40 2007 From: rubys at intertwingly.net (Sam Ruby) Date: Wed, 10 Jan 2007 05:36:40 -0500 Subject: [Imps] True DOM TreeBuilder Message-ID: <45A4C1B8.1000708@intertwingly.net> I just committed a minidom.getDOMImplementation() based TreeBuilder to html5lib. Notes: 1) I had to monkey patch minidom in order to get text nodes that are immediate children of the document node to work. 2) Based on how html5 is spec'ed, the doctypes become "HTML" instead of "html", which is what you would expect in an XML DOM representation. 3) This implementation is not namespace aware, nor are the elements placed in the XHTML namespace. Demo: http://code.google.com/p/html5lib/ is purportedly XHTML 1.0 Strict, but is served as text/html and contains such dubious constructs as "<div id=gaia>". You can obtained a cleaned up version of this page after a side trip through the DOM via: $ python parse.py -b dom -x http://code.google.com/p/html5lib/ In particular, note what the DOM's default "toxml()" method does to the script near the end of this page. - Sam Ruby From ian at hixie.ch Fri Jan 12 15:13:21 2007 From: ian at hixie.ch (Ian Hickson) Date: Fri, 12 Jan 2007 23:13:21 +0000 (UTC) Subject: [Imps] <style> processing change in the HTML5 parsing spec Message-ID: <Pine.LNX.4.62.0701122312130.4611@dhalsim.dreamhost.com> Ostensibly for compatiblity with IE, the HTML5 parsing spec is going to get an experimental change to the processing of <style> elements. Instead of being moved to the <head>, they are now going to be left wherever they are found. In addition, <script> and <style> elements found during the "after head" mode will be inserted into the <head>, instead of being inserted between the <head> and the <body>. Let me know if any of this causes you problems. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' From annevk at opera.com Sat Jan 13 02:46:31 2007 From: annevk at opera.com (Anne van Kesteren) Date: Sat, 13 Jan 2007 11:46:31 +0100 Subject: [Imps] <style> processing change in the HTML5 parsing spec In-Reply-To: <Pine.LNX.4.62.0701122312130.4611@dhalsim.dreamhost.com> References: <Pine.LNX.4.62.0701122312130.4611@dhalsim.dreamhost.com> Message-ID: <op.tl29jthb64w2qv@id-c0020> On Sat, 13 Jan 2007 00:13:21 +0100, Ian Hickson <ian at hixie.ch> wrote: > Ostensibly for compatiblity with IE, the HTML5 parsing spec is going to > get an experimental change to the processing of <style> elements. > > Instead of being moved to the <head>, they are now going to be left > wherever they are found. > > In addition, <script> and <style> elements found during the "after head" > mode will be inserted into the <head>, instead of being inserted between > the <head> and the <body>. > > Let me know if any of this causes you problems. r475 of html5lib contains this change. Thanks for the testcases by the way! -- Anne van Kesteren <http://annevankesteren.nl/> <http://www.opera.com/> From annevk at opera.com Sun Jan 14 10:05:02 2007 From: annevk at opera.com (Anne van Kesteren) Date: Sun, 14 Jan 2007 19:05:02 +0100 Subject: [Imps] </body> and after body phase Message-ID: <op.tl5oio0f64w2qv@id-c0020> I need some kind of strategy that: <!doctype html><li></body> doesn't cause any parse errors but that <!doctype html><div><li></body> does. Note you can't actually imply </li> on </body> because that would break stuff. (This doesn't seem to be covered in the specification, but is assumed in at least one of the Google testcases (and makes sense).) -- Anne van Kesteren <http://annevankesteren.nl/> <http://www.opera.com/> From rubys at intertwingly.net Mon Jan 15 00:39:26 2007 From: rubys at intertwingly.net (Sam Ruby) Date: Mon, 15 Jan 2007 03:39:26 -0500 Subject: [Imps] i18n message discussion Message-ID: <45AB3DBE.2040204@intertwingly.net> Just tossing this out for discussion... Typical error in html5lib: self.parser.parseError(_("Unexpected end of file. Expected end " u"tag (" + self.tree.openElements[1].name + u") first.")) Typical error in feedvalidator: self.log(UndefinedNamedEntity({'value':name})) Discussion: in the feedvalidator, each class of error is mapped to a Python class, and a parameterizable. The unit tests can verify that a specific error is (or is not) generated, and can even match on parameters. At runtime, each error class is mapped to a sprintf type of string, which can be language specific. The validator also uses the class name as the name of an html page which can contain more information and pointers to spec text or other information. Obviously, the reason why I am bringing this up is that I prefer the feedvalidator approach (credit due to Mark Pilgrim). - Sam Ruby From annevk at opera.com Mon Jan 15 02:17:04 2007 From: annevk at opera.com (Anne van Kesteren) Date: Mon, 15 Jan 2007 11:17:04 +0100 Subject: [Imps] i18n message discussion In-Reply-To: <45AB3DBE.2040204@intertwingly.net> References: <45AB3DBE.2040204@intertwingly.net> Message-ID: <op.tl6xiqcd64w2qv@id-c0020> On Mon, 15 Jan 2007 09:39:26 +0100, Sam Ruby <rubys at intertwingly.net> wrote: > Typical error in html5lib: > > self.parser.parseError(_("Unexpected end of file. Expected end " > u"tag (" + self.tree.openElements[1].name + u") first.")) > > Typical error in feedvalidator: > > self.log(UndefinedNamedEntity({'value':name})) I agree the latter is a lot better. It's not entirely clear how to do this though. Everytime you emit a parse error per the specification it's quite different and currently we have the ability to make the error messages as accurate as possible for every situation. I'd be happy with a more concrete proposal on how to do this though. -- Anne van Kesteren <http://annevankesteren.nl/> <http://www.opera.com/> From jg307 at cam.ac.uk Mon Jan 15 03:53:47 2007 From: jg307 at cam.ac.uk (James Graham) Date: Mon, 15 Jan 2007 11:53:47 +0000 Subject: [Imps] i18n message discussion In-Reply-To: <op.tl6xiqcd64w2qv@id-c0020> References: <45AB3DBE.2040204@intertwingly.net> <op.tl6xiqcd64w2qv@id-c0020> Message-ID: <45AB6B4B.6030304@cam.ac.uk> Anne van Kesteren wrote: > On Mon, 15 Jan 2007 09:39:26 +0100, Sam Ruby <rubys at intertwingly.net> > wrote: >> Typical error in html5lib: >> >> self.parser.parseError(_("Unexpected end of file. Expected end " >> u"tag (" + self.tree.openElements[1].name + u") first.")) >> >> Typical error in feedvalidator: >> >> self.log(UndefinedNamedEntity({'value':name})) > > I agree the latter is a lot better. It's not entirely clear how to do this > though. Everytime you emit a parse error per the specification it's quite > different and currently we have the ability to make the error messages as > accurate as possible for every situation. I'd be happy with a more > concrete proposal on how to do this though. Well with enough parameters to the class it shouldn't be impossible. Also, I've just set up a google groups group/mailing list for html5lib specific issues such as these; html5lib-discuss at googlegroups.com - see http://groups.google.com/group/html5lib-discuss/topics -- "Eternity's a terrible thought. I mean, where's it all going to end?" -- Tom Stoppard, Rosencrantz and Guildenstern are Dead From ian at hixie.ch Tue Jan 16 14:49:59 2007 From: ian at hixie.ch (Ian Hickson) Date: Tue, 16 Jan 2007 22:49:59 +0000 (UTC) Subject: [Imps] </body> and after body phase In-Reply-To: <op.tl5oio0f64w2qv@id-c0020> References: <op.tl5oio0f64w2qv@id-c0020> Message-ID: <Pine.LNX.4.62.0701162236190.4611@dhalsim.dreamhost.com> On Sun, 14 Jan 2007, Anne van Kesteren wrote: > > I need some kind of strategy that: > > <!doctype html><li></body> > > doesn't cause any parse errors but that > > <!doctype html><div><li></body> > > does. > > Note you can't actually imply </li> on </body> because that would break > stuff. > > (This doesn't seem to be covered in the specification, but is assumed in > at least one of the Google testcases (and makes sense).) Yeah, known issue. You just want to start at the end off the stack and walk back until the second element. If they are all elements that get closed with you generate implied end tags, and if the second element is <body>, then you're ok, otherwise, raise a parse error. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' From jg307 at cam.ac.uk Wed Jan 3 08:07:30 2007 From: jg307 at cam.ac.uk (James Graham) Date: Wed, 03 Jan 2007 16:07:30 +0000 Subject: [Imps] Adding "content model flags" to tokenization tests In-Reply-To: <a9699fd20612290330j6f9d5a7dkd24b472980b22449@mail.gmail.com> References: <a9699fd20612290330j6f9d5a7dkd24b472980b22449@mail.gmail.com> Message-ID: <459BD4C2.3060100@cam.ac.uk> Thomas Broyer wrote: > 2006/12/28, Thomas Broyer: >> 2006/12/23, James Graham: >> > >> > [1] An example of something that, at present can only be checked >> > through a parser test is the proper tokenizing of a fragment like >> > <plaintext><head>&body; >> >> How about adding a new "parameter" to tests to set the initial >> "content model flag" (defaulting to "PCDATA" if not present)? > > I've finally created some test cases (attached) with a > "contentModelFlags" property whose value is a list of "content model > flag"s. The test case is then run successively with the same input and > expected output but initialized with a different "content model flag". > If the property is not given, it defaults to ["PCDATA"] (a list with a > single value "PCDATA"). That's; I've added these to the html5lib svn repository and updated our test framework to run the new tests. -- "The universe doesn't care what you believe. The wonderful thing about science is that it doesn't ask for your faith, it just asks for your eyes" --- http://xkcd.com/c154.html From ian at hixie.ch Wed Jan 3 15:07:20 2007 From: ian at hixie.ch (Ian Hickson) Date: Wed, 3 Jan 2007 23:07:20 +0000 (UTC) Subject: [Imps] Reasonable limits on buffered values In-Reply-To: <BAY109-F26C90CA07674EF007E2364B4C60@phx.gbl> References: <BAY109-F26C90CA07674EF007E2364B4C60@phx.gbl> Message-ID: <Pine.LNX.4.62.0701032249260.4611@dhalsim.dreamhost.com> On Fri, 29 Dec 2006, Simon Pieters wrote: > > From: Henri Sivonen <hsivonen at iki.fi> > >I'm wondering if there's a best practice here. Is there data on how > >long non-malicious attribute values legitimately appear on the Web? I'll see if I can get some data. (No ETA.) > Additionally, .NET applications can have long attribute values too. See > "Figure 3. Simple page LessViewState.aspx with DataGrid1" at > > http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnaspnet/html/asp11222001.asp > > That's 3.05 KiB, but can get a lot longer depending on the number of > form controls, I think. I myself have written pages with significantly longer href="" attributes, e.g. when using long data: URIs of big images. The problem is that whatever limit you set, you'll always find a legitimate document that's bigger. It sounds stupid but the best practice really is to not have explicit limits, but instead to have algorithms that can handle any volume of input without exploding. It might be best, in fact, to limit CPU and memory usage, rather than attempting to limit input buffers. ("This page would take too many resources to handle.") That actually handles the billion laughs problem without having to special case anything to do with it. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' From rubys at intertwingly.net Mon Jan 8 07:34:55 2007 From: rubys at intertwingly.net (Sam Ruby) Date: Mon, 08 Jan 2007 10:34:55 -0500 Subject: [Imps] Liberal XML parsing Message-ID: <45A2649F.2020308@intertwingly.net> I've posted a note on how the code in html5lib could serve as an excellent foundation for a number of "liberal" XML parsing tasks: http://www.intertwingly.net/blog/2007/01/08/Xhtml5lib Personally, I'm not overly interested in hearing more opinions as to whether or not there is a valid demand for liberal XML parsing. If you don't want to use it, don't. What I WOULD be interested in hearing opinions on is what would be the best way to maintain this code going forward: could it live as a separate module within html5lib repository? Should it be a separate repository? If separate, are there some changes to the tokenizer in particular that could be made that would either directly enable this usage or would make it easier to monkey-patch for usage by xhtml5lib? - Sam Ruby From annevk at opera.com Mon Jan 8 08:29:56 2007 From: annevk at opera.com (Anne van Kesteren) Date: Mon, 08 Jan 2007 17:29:56 +0100 Subject: [Imps] Liberal XML parsing In-Reply-To: <45A2649F.2020308@intertwingly.net> References: <45A2649F.2020308@intertwingly.net> Message-ID: <op.tluf36by64w2qv@id-c0020> On Mon, 08 Jan 2007 16:34:55 +0100, Sam Ruby <rubys at intertwingly.net> wrote: > I've posted a note on how the code in html5lib could serve as an > excellent foundation for a number of "liberal" XML parsing tasks: > > http://www.intertwingly.net/blog/2007/01/08/Xhtml5lib > > Personally, I'm not overly interested in hearing more opinions as to > whether or not there is a valid demand for liberal XML parsing. If you > don't want to use it, don't. I've nothing against liberal XML parsing and I would actually like it to be formalized somewhere, but I do think that calling it an XHTML5 parser is wrong given that XHTML5 as it stands now is supposed to be parsed by an XML parser. > What I WOULD be interested in hearing opinions on is what would be the > best way to maintain this code going forward: could it live as a > separate module within html5lib repository? Should it be a separate > repository? If separate, are there some changes to the tokenizer in > particular that could be made that would either directly enable this > usage or would make it easier to monkey-patch for usage by xhtml5lib? Can't you subclass the tokenizer? (I don't mind it being in the same repository as html5lib by the way. Not sure what the best location is.) -- Anne van Kesteren <http://annevankesteren.nl/> <http://www.opera.com/> From rubys at intertwingly.net Mon Jan 8 08:42:49 2007 From: rubys at intertwingly.net (Sam Ruby) Date: Mon, 08 Jan 2007 11:42:49 -0500 Subject: [Imps] Liberal XML parsing In-Reply-To: <op.tluf36by64w2qv@id-c0020> References: <45A2649F.2020308@intertwingly.net> <op.tluf36by64w2qv@id-c0020> Message-ID: <45A27489.3090504@intertwingly.net> Anne van Kesteren wrote: > >> What I WOULD be interested in hearing opinions on is what would be the >> best way to maintain this code going forward: could it live as a >> separate module within html5lib repository? Should it be a separate >> repository? If separate, are there some changes to the tokenizer in >> particular that could be made that would either directly enable this >> usage or would make it easier to monkey-patch for usage by xhtml5lib? > > Can't you subclass the tokenizer? (I don't mind it being in the same > repository as html5lib by the way. Not sure what the best location is.) The current tokenizer has ".lower()" sprinkled throughout and doesn't expose in any meaningful way the difference between empty and start tags. For the tokenizer to be meaningfully subclassed (and by that, I mean without requiring wholesale duplication of a number of methods), these behaviors would need to be factored out into separate methods that could be overridden. - Sam Ruby From annevk at opera.com Mon Jan 8 08:48:12 2007 From: annevk at opera.com (Anne van Kesteren) Date: Mon, 08 Jan 2007 17:48:12 +0100 Subject: [Imps] Liberal XML parsing In-Reply-To: <45A27489.3090504@intertwingly.net> References: <45A2649F.2020308@intertwingly.net> <op.tluf36by64w2qv@id-c0020> <45A27489.3090504@intertwingly.net> Message-ID: <op.tlugyms364w2qv@id-c0020> On Mon, 08 Jan 2007 17:42:49 +0100, Sam Ruby <rubys at intertwingly.net> wrote: > The current tokenizer has ".lower()" sprinkled throughout and doesn't > expose in any meaningful way the difference between empty and start tags. Because there is no difference between them. See the HTML5 specification. > For the tokenizer to be meaningfully subclassed (and by that, I mean > without requiring wholesale duplication of a number of methods), these > behaviors would need to be factored out into separate methods that could > be overridden. You could subclass it and change processSolidusInTag. Instead of throwing an atheist parse error you would change the type of token to be "empty" or something. Not sure how to do the .lower() stuff. I kind of guessed the reason you wanted to change that was because of a project like this :-) -- Anne van Kesteren <http://annevankesteren.nl/> <http://www.opera.com/> From rubys at intertwingly.net Mon Jan 8 09:23:40 2007 From: rubys at intertwingly.net (Sam Ruby) Date: Mon, 08 Jan 2007 12:23:40 -0500 Subject: [Imps] Liberal XML parsing In-Reply-To: <op.tlugyms364w2qv@id-c0020> References: <45A2649F.2020308@intertwingly.net> <op.tluf36by64w2qv@id-c0020> <45A27489.3090504@intertwingly.net> <op.tlugyms364w2qv@id-c0020> Message-ID: <45A27E1C.7000601@intertwingly.net> Anne van Kesteren wrote: > On Mon, 08 Jan 2007 17:42:49 +0100, Sam Ruby <rubys at intertwingly.net> > wrote: >> The current tokenizer has ".lower()" sprinkled throughout and doesn't >> expose in any meaningful way the difference between empty and start tags. > > Because there is no difference between them. See the HTML5 specification. My point is that by "baking in" that behavior into the tokenizer, it essentially limits that tokenizer to just supporting HTML5. By providing one extra "bit" of information, the potential for reuse is increased. Of course, the html5parser will need to ignore this extra bit, and my patch includes that change. >> For the tokenizer to be meaningfully subclassed (and by that, I mean >> without requiring wholesale duplication of a number of methods), these >> behaviors would need to be factored out into separate methods that >> could be overridden. > > You could subclass it and change processSolidusInTag. Instead of > throwing an atheist parse error you would change the type of token to be > "empty" or something. From a maintenance point of view, that is suboptimal. As processSolidusInTag changes, that maintenance would need to occur in two places. > Not sure how to do the .lower() stuff. I kind of guessed the reason you > wanted to change that was because of a project like this :-) I've provided one way: by refactoring it so that all the lowercasing of element names is done in exactly one place, and that the lowercasing of attribute names is also done in exactly one place. That class can be subclassed to provide a different behavior. - - - It is no secret that my interest in the WHATWG started with a dissatisfaction with Python's sgmllib, particularly when used as a foundation for parsing HTML, XHTML, or as a fallback parser for XML. What I see in html5lib is a *much* better foundation. I'm in no particular rush, but if after a few days it turns out that people are OK with something *like* this going into the html5lib repository, I'd love to put it in there -- at which point it would be free to evolve, be renamed, refactored, and enhanced. One thing I would love to work on is a true DOM builder (at which point, I could throw away my XMLDocument, XMLElement, and XMLComment classes), but I would need changes to TreeBuilder so that I could provide my own Text class (for example). Needless to say, such a treebuilder could also be used with HTML5. Once this stabilized, I would them plan to look at having the UFP take advantage of this library, if it is installed/available. I'd also modify Venus, but such support would not need to be conditional there: Venus could simply include html5lib. - Sam Ruby From jg307 at cam.ac.uk Mon Jan 8 09:27:09 2007 From: jg307 at cam.ac.uk (James Graham) Date: Mon, 08 Jan 2007 17:27:09 +0000 Subject: [Imps] Liberal XML parsing In-Reply-To: <45A2649F.2020308@intertwingly.net> References: <45A2649F.2020308@intertwingly.net> Message-ID: <45A27EED.7000508@cam.ac.uk> Sam Ruby wrote: > What I WOULD be interested in hearing opinions on is what would be the > best way to maintain this code going forward: could it live as a > separate module within html5lib repository? Should it be a separate > repository? I'm open to hosting it in the same repository; the only issue that I see is that people may conflate the two parts and be put off downloading html5lib because they think it is a liberal XML parser or xhtml5lib because they think it is a html-only project. > If separate, are there some changes to the tokenizer in > particular that could be made that would either directly enable this > usage or would make it easier to monkey-patch for usage by xhtml5lib? Assuming the patches needed don't cause severe regressions in the code readability or performance of html5lib I think the existing tokenizer would be the right place to apply them. -- "Eternity's a terrible thought. I mean, where's it all going to end?" -- Tom Stoppard, Rosencrantz and Guildenstern are Dead From jg307 at cam.ac.uk Mon Jan 8 09:41:41 2007 From: jg307 at cam.ac.uk (James Graham) Date: Mon, 08 Jan 2007 17:41:41 +0000 Subject: [Imps] Liberal XML parsing In-Reply-To: <45A27E1C.7000601@intertwingly.net> References: <45A2649F.2020308@intertwingly.net> <op.tluf36by64w2qv@id-c0020> <45A27489.3090504@intertwingly.net> <op.tlugyms364w2qv@id-c0020> <45A27E1C.7000601@intertwingly.net> Message-ID: <45A28255.2050509@cam.ac.uk> Sam Ruby wrote: > I've provided one way: by refactoring it so that all the lowercasing of > element names is done in exactly one place, and that the lowercasing of > attribute names is also done in exactly one place. That class can be > subclassed to provide a different behavior. That sounds fine to me. We need to add some unicode tests though to be sure we're not lowercasing where we shouldn't be. > I'm in no particular rush, but if after a few days it turns out that > people are OK with something *like* this going into the html5lib > repository, I'd love to put it in there -- at which point it would be > free to evolve, be renamed, refactored, and enhanced. One thing I would > love to work on is a true DOM builder (at which point, I could throw > away my XMLDocument, XMLElement, and XMLComment classes), but I would > need changes to TreeBuilder so that I could provide my own Text class > (for example). FWIW I consider supporting one of the python DOM implementations a priority for the 0.3 release of html5lib (of course we need to release 0.2 first -- at this point that is basically a case of uploading the source archive). Using the current treebuilder interface it should be possible to support DOM-like text nodes without any changes but it's non-trivial so maybe the current interface is in need of improvement (the problem is that we aslo need to support ElementTree which regards text as attributes). -- "Eternity's a terrible thought. I mean, where's it all going to end?" -- Tom Stoppard, Rosencrantz and Guildenstern are Dead From annevk at opera.com Mon Jan 8 10:28:22 2007 From: annevk at opera.com (Anne van Kesteren) Date: Mon, 08 Jan 2007 19:28:22 +0100 Subject: [Imps] Liberal XML parsing In-Reply-To: <45A27E1C.7000601@intertwingly.net> References: <45A2649F.2020308@intertwingly.net> <op.tluf36by64w2qv@id-c0020> <45A27489.3090504@intertwingly.net> <op.tlugyms364w2qv@id-c0020> <45A27E1C.7000601@intertwingly.net> Message-ID: <op.tlullkuh64w2qv@id-c0020> On Mon, 08 Jan 2007 18:23:40 +0100, Sam Ruby <rubys at intertwingly.net> wrote: >> Because there is no difference between them. See the HTML5 >> specification. > > My point is that by "baking in" that behavior into the tokenizer, it > essentially limits that tokenizer to just supporting HTML5. By > providing one extra "bit" of information, the potential for reuse is > increased. Well, the next "bit" would probably be processing instructions. That's why it would be nice to have some formalization / standardization first to see how many changes are required exactly. Currently html5lib maps rather well to the specificaction which improves the readability of the code a lot (imho). I'd like to know at how many changes we're looking and how that impacts the code. > From a maintenance point of view, that is suboptimal. As > processSolidusInTag changes, that maintenance would need to occur in two > places. Well, the method isn't that big :-) >> Not sure how to do the .lower() stuff. I kind of guessed the reason you >> wanted to change that was because of a project like this :-) > > I've provided one way: by refactoring it so that all the lowercasing of > element names is done in exactly one place, and that the lowercasing of > attribute names is also done in exactly one place. That class can be > subclassed to provide a different behavior. Do you this as a standalone patch somewhere? As mentioned before, I'd like to see how it deals with non-ASCII characters. > Once this stabilized, I would them plan to look at having the UFP take > advantage of this library, if it is installed/available. I'd also > modify Venus, but such support would not need to be conditional there: > Venus could simply include html5lib. That'd be cool! I read today that actual usage and support is important if you want your library to be included in the default distribution. -- Anne van Kesteren <http://annevankesteren.nl/> <http://www.opera.com/> From rubys at intertwingly.net Mon Jan 8 10:46:27 2007 From: rubys at intertwingly.net (Sam Ruby) Date: Mon, 08 Jan 2007 13:46:27 -0500 Subject: [Imps] Liberal XML parsing In-Reply-To: <op.tlullkuh64w2qv@id-c0020> References: <45A2649F.2020308@intertwingly.net> <op.tluf36by64w2qv@id-c0020> <45A27489.3090504@intertwingly.net> <op.tlugyms364w2qv@id-c0020> <45A27E1C.7000601@intertwingly.net> <op.tlullkuh64w2qv@id-c0020> Message-ID: <45A29183.6080403@intertwingly.net> Anne van Kesteren wrote: > On Mon, 08 Jan 2007 18:23:40 +0100, Sam Ruby <rubys at intertwingly.net> > wrote: >>> Because there is no difference between them. See the HTML5 >>> specification. >> >> My point is that by "baking in" that behavior into the tokenizer, it >> essentially limits that tokenizer to just supporting HTML5. By >> providing one extra "bit" of information, the potential for reuse is >> increased. > > Well, the next "bit" would probably be processing instructions. That's > why it would be nice to have some formalization / standardization first > to see how many changes are required exactly. I have no interest in XML processing instructions at this time. > Currently html5lib maps rather well to the specificaction which improves > the readability of the code a lot (imho). I'd like to know at how many > changes we're looking and how that impacts the code. That's why I provided a comprehensive patch: http://intertwingly.net/stories/2007/01/08/xhtml5.diff >>> Not sure how to do the .lower() stuff. I kind of guessed the reason >>> you wanted to change that was because of a project like this :-) >> >> I've provided one way: by refactoring it so that all the lowercasing >> of element names is done in exactly one place, and that the >> lowercasing of attribute names is also done in exactly one place. >> That class can be subclassed to provide a different behavior. > > Do you this as a standalone patch somewhere? As mentioned before, I'd > like to see how it deals with non-ASCII characters. The patch isn't all that big. The relevant portions are: asciiLower = dict([(ord(c),ord(c.lower())) for c in string.ascii_uppercase]) token["name"] = token["name"].translate(asciiLower) token["data"] = dict([(attr.translate(asciiLower), value) for attr,value in token["data"][::-1]]) - Sam Ruby From annevk at opera.com Mon Jan 8 14:44:40 2007 From: annevk at opera.com (Anne van Kesteren) Date: Mon, 08 Jan 2007 23:44:40 +0100 Subject: [Imps] Liberal XML parsing In-Reply-To: <45A29183.6080403@intertwingly.net> References: <45A2649F.2020308@intertwingly.net> <op.tluf36by64w2qv@id-c0020> <45A27489.3090504@intertwingly.net> <op.tlugyms364w2qv@id-c0020> <45A27E1C.7000601@intertwingly.net> <op.tlullkuh64w2qv@id-c0020> <45A29183.6080403@intertwingly.net> Message-ID: <op.tluxgqrl64w2qv@id-c0020> On Mon, 08 Jan 2007 19:46:27 +0100, Sam Ruby <rubys at intertwingly.net> wrote: >> Well, the next "bit" would probably be processing instructions. That's >> why it would be nice to have some formalization / standardization first >> to see how many changes are required exactly. > > I have no interest in XML processing instructions at this time. Fair enough. But if this is becoming the foundation of an (experimental) liberal XML parser we'll have interest in due course I reckon. If only for <?xbl?> and <?xml-stylesheet?>. >> Currently html5lib maps rather well to the specificaction which >> improves the readability of the code a lot (imho). I'd like to know at >> how many changes we're looking and how that impacts the code. > > That's why I provided a comprehensive patch: > > http://intertwingly.net/stories/2007/01/08/xhtml5.diff Instead of using string.ascii_uppercase you should use our internal asciiUppercase. Also, instead of using a dict for translating can't you just provide two strings? I'd think that would be faster. The normalizeToken method should be inlined as you only want to do that from a single place anyway. And EndTag should use the translate method and not .lower(). I suppose these changes also remove the need for asciiLowercase (not asciiLower that you introduce) as defined in constants.py. Anyway, with these nits (open for debate) I think I'm ok with doing this assuming you will update the tests as well (or someone else will). I'd like to have a liberal XML parser too one day and working on an experimental implementation of one can't hurt I suppose :-) If xhtml5parser.py is the only other file I would be fine with adding that to src/ as liberalxmlparser.py. Bit of a lengthty name, but it more accurately reflects what it is. -- Anne van Kesteren <http://annevankesteren.nl/> <http://www.opera.com/> From rubys at intertwingly.net Mon Jan 8 17:15:38 2007 From: rubys at intertwingly.net (Sam Ruby) Date: Mon, 08 Jan 2007 20:15:38 -0500 Subject: [Imps] Liberal XML parsing In-Reply-To: <op.tluxgqrl64w2qv@id-c0020> References: <45A2649F.2020308@intertwingly.net> <op.tluf36by64w2qv@id-c0020> <45A27489.3090504@intertwingly.net> <op.tlugyms364w2qv@id-c0020> <45A27E1C.7000601@intertwingly.net> <op.tlullkuh64w2qv@id-c0020> <45A29183.6080403@intertwingly.net> <op.tluxgqrl64w2qv@id-c0020> Message-ID: <45A2ECBA.4040607@intertwingly.net> Anne van Kesteren wrote: > >>> Currently html5lib maps rather well to the specificaction which >>> improves the readability of the code a lot (imho). I'd like to know >>> at how many changes we're looking and how that impacts the code. >> >> That's why I provided a comprehensive patch: >> >> http://intertwingly.net/stories/2007/01/08/xhtml5.diff > > Instead of using string.ascii_uppercase you should use our internal > asciiUppercase. Also, instead of using a dict for translating can't you > just provide two strings? I'd think that would be faster. I don't understand the suggestion to use the internal asciiUppercase - with my patch, this constant is no longer used. And my constant was defined in the src/constants.py file... I also don't understand the suggestion to "just provide two strings". That's not how Python's unicode.translate() method works. > The normalizeToken method should be inlined as you only want to do that > from a single place anyway. And EndTag should use the translate method > and not .lower(). While it is true that normalizeToken is only called from one place, this method can't be inlined as the liberal XML parser subclass needs to override this behavior. > I suppose these changes also remove the need for asciiLowercase (not > asciiLower that you introduce) as defined in constants.py. asciiLowercase is still used in the portion of the logic dealing with DocTypes. But having two similarly named constants with quite different purposes is confusing, and clearly *that* should be changed. > Anyway, with these nits (open for debate) I think I'm ok with doing this > assuming you will update the tests as well (or someone else will). I'd > like to have a liberal XML parser too one day and working on an > experimental implementation of one can't hurt I suppose :-) In case you didn't notice it, here are the tests: http://intertwingly.net/stories/2007/01/08/tests/test_xhtml.py > If xhtml5parser.py is the only other file I would be fine with adding > that to src/ as liberalxmlparser.py. Bit of a lengthty name, but it more > accurately reflects what it is. I'm not worried about the the name. That name is fine. I'll look into committing this tomorrow, with your proposed module name, with the unit tests, and with some subset of these nits addressed. I'll add comments at the top of the module indicating that this support is experimental and subject to change and even removal at any time. - Sam Ruby From annevk at opera.com Mon Jan 8 17:27:49 2007 From: annevk at opera.com (Anne van Kesteren) Date: Tue, 09 Jan 2007 02:27:49 +0100 Subject: [Imps] Liberal XML parsing In-Reply-To: <45A2ECBA.4040607@intertwingly.net> References: <45A2649F.2020308@intertwingly.net> <op.tluf36by64w2qv@id-c0020> <45A27489.3090504@intertwingly.net> <op.tlugyms364w2qv@id-c0020> <45A27E1C.7000601@intertwingly.net> <op.tlullkuh64w2qv@id-c0020> <45A29183.6080403@intertwingly.net> <op.tluxgqrl64w2qv@id-c0020> <45A2ECBA.4040607@intertwingly.net> Message-ID: <op.tlu40nal64w2qv@id-c0020> On Tue, 09 Jan 2007 02:15:38 +0100, Sam Ruby <rubys at intertwingly.net> wrote: >>>> Currently html5lib maps rather well to the specificaction which >>>> improves the readability of the code a lot (imho). I'd like to know >>>> at how many changes we're looking and how that impacts the code. >>> >>> That's why I provided a comprehensive patch: >>> >>> http://intertwingly.net/stories/2007/01/08/xhtml5.diff >> Instead of using string.ascii_uppercase you should use our internal >> asciiUppercase. Also, instead of using a dict for translating can't you >> just provide two strings? I'd think that would be faster. > > I don't understand the suggestion to use the internal asciiUppercase - > with my patch, this constant is no longer used. And my constant was > defined in the src/constants.py file... But you haven't removed the constant either. But as you later note it's still used somewhere else... > I also don't understand the suggestion to "just provide two strings". > That's not how Python's unicode.translate() method works. Oh, sorry about that. I vaguely recalled translate() from http://diveintopython.org/performance_tuning/dictionary_lookups.html and apparently it works slightly different from what I remembered. Should we use string.maketrans? >> The normalizeToken method should be inlined as you only want to do that >> from a single place anyway. And EndTag should use the translate method >> and not .lower(). > > While it is true that normalizeToken is only called from one place, this > method can't be inlined as the liberal XML parser subclass needs to > override this behavior. Hmm, not so nice. For a large page that's a lot of additional method calls. Can you redo it a bit making sure we don't make that call for all non tag tokens at least, such as characters. >> I suppose these changes also remove the need for asciiLowercase (not >> asciiLower that you introduce) as defined in constants.py. > > asciiLowercase is still used in the portion of the logic dealing with > DocTypes. But having two similarly named constants with quite different > purposes is confusing, and clearly *that* should be changed. Yeah. >> Anyway, with these nits (open for debate) I think I'm ok with doing >> this assuming you will update the tests as well (or someone else will). >> I'd like to have a liberal XML parser too one day and working on an >> experimental implementation of one can't hurt I suppose :-) > > In case you didn't notice it, here are the tests: > > http://intertwingly.net/stories/2007/01/08/tests/test_xhtml.py I noticed those, but you also had some comments on updating tests. >> If xhtml5parser.py is the only other file I would be fine with adding >> that to src/ as liberalxmlparser.py. Bit of a lengthty name, but it >> more accurately reflects what it is. > > I'm not worried about the the name. That name is fine. > > I'll look into committing this tomorrow, with your proposed module name, > with the unit tests, and with some subset of these nits addressed. I'll > add comments at the top of the module indicating that this support is > experimental and subject to change and even removal at any time. Ok, cool. -- Anne van Kesteren <http://annevankesteren.nl/> <http://www.opera.com/> From teeshift at gmail.com Mon Jan 8 19:59:45 2007 From: teeshift at gmail.com (tee shift) Date: Tue, 9 Jan 2007 11:59:45 +0800 Subject: [Imps] adding object namespace function on canvas (so we could use them.) Message-ID: <18b3755f0701081959va5352e1o3606f64e67413d33@mail.gmail.com> Haven't read all the spect yet. I wonder if in future canvas will have the feature like the following beginPath circle = arc(30, 50, 0, Math.PI*2, 2) endPath ........ circle.moveTo(40,20) ..... we could assign name to object we have drawn so we could later change its attributes. In this example, location but we could change radius, center or size. Thanks. Tee Shift -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.whatwg.org/pipermail/implementors-whatwg.org/attachments/20070109/8693045e/attachment-0001.htm> From annevk at opera.com Tue Jan 9 02:25:18 2007 From: annevk at opera.com (Anne van Kesteren) Date: Tue, 09 Jan 2007 11:25:18 +0100 Subject: [Imps] adding object namespace function on canvas (so we could use them.) In-Reply-To: <18b3755f0701081959va5352e1o3606f64e67413d33@mail.gmail.com> References: <18b3755f0701081959va5352e1o3606f64e67413d33@mail.gmail.com> Message-ID: <op.tlvtwgls64w2qv@id-c0020> On Tue, 09 Jan 2007 04:59:45 +0100, tee shift <teeshift at gmail.com> wrote: > Haven't read all the spect yet. > I wonder if in future canvas will have the feature like the following > > beginPath > circle = arc(30, 50, 0, Math.PI*2, 2) > endPath > ........ > circle.moveTo(40,20) > ..... > > we could assign name to object we have drawn so we could later change its > attributes. > In this example, location but we could change radius, center or size. The idea of <canvas> is that it doesn't consist of objects, but is just a bitmap you can draw upon. If you really want the objects you want SVG. Also, feedback like this ought to go to whatwg at whatwg.org. -- Anne van Kesteren <http://annevankesteren.nl/> <http://www.opera.com/> From rubys at intertwingly.net Wed Jan 10 02:36:40 2007 From: rubys at intertwingly.net (Sam Ruby) Date: Wed, 10 Jan 2007 05:36:40 -0500 Subject: [Imps] True DOM TreeBuilder Message-ID: <45A4C1B8.1000708@intertwingly.net> I just committed a minidom.getDOMImplementation() based TreeBuilder to html5lib. Notes: 1) I had to monkey patch minidom in order to get text nodes that are immediate children of the document node to work. 2) Based on how html5 is spec'ed, the doctypes become "HTML" instead of "html", which is what you would expect in an XML DOM representation. 3) This implementation is not namespace aware, nor are the elements placed in the XHTML namespace. Demo: http://code.google.com/p/html5lib/ is purportedly XHTML 1.0 Strict, but is served as text/html and contains such dubious constructs as "<div id=gaia>". You can obtained a cleaned up version of this page after a side trip through the DOM via: $ python parse.py -b dom -x http://code.google.com/p/html5lib/ In particular, note what the DOM's default "toxml()" method does to the script near the end of this page. - Sam Ruby From ian at hixie.ch Fri Jan 12 15:13:21 2007 From: ian at hixie.ch (Ian Hickson) Date: Fri, 12 Jan 2007 23:13:21 +0000 (UTC) Subject: [Imps] <style> processing change in the HTML5 parsing spec Message-ID: <Pine.LNX.4.62.0701122312130.4611@dhalsim.dreamhost.com> Ostensibly for compatiblity with IE, the HTML5 parsing spec is going to get an experimental change to the processing of <style> elements. Instead of being moved to the <head>, they are now going to be left wherever they are found. In addition, <script> and <style> elements found during the "after head" mode will be inserted into the <head>, instead of being inserted between the <head> and the <body>. Let me know if any of this causes you problems. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' From annevk at opera.com Sat Jan 13 02:46:31 2007 From: annevk at opera.com (Anne van Kesteren) Date: Sat, 13 Jan 2007 11:46:31 +0100 Subject: [Imps] <style> processing change in the HTML5 parsing spec In-Reply-To: <Pine.LNX.4.62.0701122312130.4611@dhalsim.dreamhost.com> References: <Pine.LNX.4.62.0701122312130.4611@dhalsim.dreamhost.com> Message-ID: <op.tl29jthb64w2qv@id-c0020> On Sat, 13 Jan 2007 00:13:21 +0100, Ian Hickson <ian at hixie.ch> wrote: > Ostensibly for compatiblity with IE, the HTML5 parsing spec is going to > get an experimental change to the processing of <style> elements. > > Instead of being moved to the <head>, they are now going to be left > wherever they are found. > > In addition, <script> and <style> elements found during the "after head" > mode will be inserted into the <head>, instead of being inserted between > the <head> and the <body>. > > Let me know if any of this causes you problems. r475 of html5lib contains this change. Thanks for the testcases by the way! -- Anne van Kesteren <http://annevankesteren.nl/> <http://www.opera.com/> From annevk at opera.com Sun Jan 14 10:05:02 2007 From: annevk at opera.com (Anne van Kesteren) Date: Sun, 14 Jan 2007 19:05:02 +0100 Subject: [Imps] </body> and after body phase Message-ID: <op.tl5oio0f64w2qv@id-c0020> I need some kind of strategy that: <!doctype html><li></body> doesn't cause any parse errors but that <!doctype html><div><li></body> does. Note you can't actually imply </li> on </body> because that would break stuff. (This doesn't seem to be covered in the specification, but is assumed in at least one of the Google testcases (and makes sense).) -- Anne van Kesteren <http://annevankesteren.nl/> <http://www.opera.com/> From rubys at intertwingly.net Mon Jan 15 00:39:26 2007 From: rubys at intertwingly.net (Sam Ruby) Date: Mon, 15 Jan 2007 03:39:26 -0500 Subject: [Imps] i18n message discussion Message-ID: <45AB3DBE.2040204@intertwingly.net> Just tossing this out for discussion... Typical error in html5lib: self.parser.parseError(_("Unexpected end of file. Expected end " u"tag (" + self.tree.openElements[1].name + u") first.")) Typical error in feedvalidator: self.log(UndefinedNamedEntity({'value':name})) Discussion: in the feedvalidator, each class of error is mapped to a Python class, and a parameterizable. The unit tests can verify that a specific error is (or is not) generated, and can even match on parameters. At runtime, each error class is mapped to a sprintf type of string, which can be language specific. The validator also uses the class name as the name of an html page which can contain more information and pointers to spec text or other information. Obviously, the reason why I am bringing this up is that I prefer the feedvalidator approach (credit due to Mark Pilgrim). - Sam Ruby From annevk at opera.com Mon Jan 15 02:17:04 2007 From: annevk at opera.com (Anne van Kesteren) Date: Mon, 15 Jan 2007 11:17:04 +0100 Subject: [Imps] i18n message discussion In-Reply-To: <45AB3DBE.2040204@intertwingly.net> References: <45AB3DBE.2040204@intertwingly.net> Message-ID: <op.tl6xiqcd64w2qv@id-c0020> On Mon, 15 Jan 2007 09:39:26 +0100, Sam Ruby <rubys at intertwingly.net> wrote: > Typical error in html5lib: > > self.parser.parseError(_("Unexpected end of file. Expected end " > u"tag (" + self.tree.openElements[1].name + u") first.")) > > Typical error in feedvalidator: > > self.log(UndefinedNamedEntity({'value':name})) I agree the latter is a lot better. It's not entirely clear how to do this though. Everytime you emit a parse error per the specification it's quite different and currently we have the ability to make the error messages as accurate as possible for every situation. I'd be happy with a more concrete proposal on how to do this though. -- Anne van Kesteren <http://annevankesteren.nl/> <http://www.opera.com/> From jg307 at cam.ac.uk Mon Jan 15 03:53:47 2007 From: jg307 at cam.ac.uk (James Graham) Date: Mon, 15 Jan 2007 11:53:47 +0000 Subject: [Imps] i18n message discussion In-Reply-To: <op.tl6xiqcd64w2qv@id-c0020> References: <45AB3DBE.2040204@intertwingly.net> <op.tl6xiqcd64w2qv@id-c0020> Message-ID: <45AB6B4B.6030304@cam.ac.uk> Anne van Kesteren wrote: > On Mon, 15 Jan 2007 09:39:26 +0100, Sam Ruby <rubys at intertwingly.net> > wrote: >> Typical error in html5lib: >> >> self.parser.parseError(_("Unexpected end of file. Expected end " >> u"tag (" + self.tree.openElements[1].name + u") first.")) >> >> Typical error in feedvalidator: >> >> self.log(UndefinedNamedEntity({'value':name})) > > I agree the latter is a lot better. It's not entirely clear how to do this > though. Everytime you emit a parse error per the specification it's quite > different and currently we have the ability to make the error messages as > accurate as possible for every situation. I'd be happy with a more > concrete proposal on how to do this though. Well with enough parameters to the class it shouldn't be impossible. Also, I've just set up a google groups group/mailing list for html5lib specific issues such as these; html5lib-discuss at googlegroups.com - see http://groups.google.com/group/html5lib-discuss/topics -- "Eternity's a terrible thought. I mean, where's it all going to end?" -- Tom Stoppard, Rosencrantz and Guildenstern are Dead From ian at hixie.ch Tue Jan 16 14:49:59 2007 From: ian at hixie.ch (Ian Hickson) Date: Tue, 16 Jan 2007 22:49:59 +0000 (UTC) Subject: [Imps] </body> and after body phase In-Reply-To: <op.tl5oio0f64w2qv@id-c0020> References: <op.tl5oio0f64w2qv@id-c0020> Message-ID: <Pine.LNX.4.62.0701162236190.4611@dhalsim.dreamhost.com> On Sun, 14 Jan 2007, Anne van Kesteren wrote: > > I need some kind of strategy that: > > <!doctype html><li></body> > > doesn't cause any parse errors but that > > <!doctype html><div><li></body> > > does. > > Note you can't actually imply </li> on </body> because that would break > stuff. > > (This doesn't seem to be covered in the specification, but is assumed in > at least one of the Google testcases (and makes sense).) Yeah, known issue. You just want to start at the end off the stack and walk back until the second element. If they are all elements that get closed with you generate implied end tags, and if the second element is <body>, then you're ok, otherwise, raise a parse error. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' From jg307 at cam.ac.uk Wed Jan 3 08:07:30 2007 From: jg307 at cam.ac.uk (James Graham) Date: Wed, 03 Jan 2007 16:07:30 +0000 Subject: [Imps] Adding "content model flags" to tokenization tests In-Reply-To: <a9699fd20612290330j6f9d5a7dkd24b472980b22449@mail.gmail.com> References: <a9699fd20612290330j6f9d5a7dkd24b472980b22449@mail.gmail.com> Message-ID: <459BD4C2.3060100@cam.ac.uk> Thomas Broyer wrote: > 2006/12/28, Thomas Broyer: >> 2006/12/23, James Graham: >> > >> > [1] An example of something that, at present can only be checked >> > through a parser test is the proper tokenizing of a fragment like >> > <plaintext><head>&body; >> >> How about adding a new "parameter" to tests to set the initial >> "content model flag" (defaulting to "PCDATA" if not present)? > > I've finally created some test cases (attached) with a > "contentModelFlags" property whose value is a list of "content model > flag"s. The test case is then run successively with the same input and > expected output but initialized with a different "content model flag". > If the property is not given, it defaults to ["PCDATA"] (a list with a > single value "PCDATA"). That's; I've added these to the html5lib svn repository and updated our test framework to run the new tests. -- "The universe doesn't care what you believe. The wonderful thing about science is that it doesn't ask for your faith, it just asks for your eyes" --- http://xkcd.com/c154.html From ian at hixie.ch Wed Jan 3 15:07:20 2007 From: ian at hixie.ch (Ian Hickson) Date: Wed, 3 Jan 2007 23:07:20 +0000 (UTC) Subject: [Imps] Reasonable limits on buffered values In-Reply-To: <BAY109-F26C90CA07674EF007E2364B4C60@phx.gbl> References: <BAY109-F26C90CA07674EF007E2364B4C60@phx.gbl> Message-ID: <Pine.LNX.4.62.0701032249260.4611@dhalsim.dreamhost.com> On Fri, 29 Dec 2006, Simon Pieters wrote: > > From: Henri Sivonen <hsivonen at iki.fi> > >I'm wondering if there's a best practice here. Is there data on how > >long non-malicious attribute values legitimately appear on the Web? I'll see if I can get some data. (No ETA.) > Additionally, .NET applications can have long attribute values too. See > "Figure 3. Simple page LessViewState.aspx with DataGrid1" at > > http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnaspnet/html/asp11222001.asp > > That's 3.05 KiB, but can get a lot longer depending on the number of > form controls, I think. I myself have written pages with significantly longer href="" attributes, e.g. when using long data: URIs of big images. The problem is that whatever limit you set, you'll always find a legitimate document that's bigger. It sounds stupid but the best practice really is to not have explicit limits, but instead to have algorithms that can handle any volume of input without exploding. It might be best, in fact, to limit CPU and memory usage, rather than attempting to limit input buffers. ("This page would take too many resources to handle.") That actually handles the billion laughs problem without having to special case anything to do with it. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' From rubys at intertwingly.net Mon Jan 8 07:34:55 2007 From: rubys at intertwingly.net (Sam Ruby) Date: Mon, 08 Jan 2007 10:34:55 -0500 Subject: [Imps] Liberal XML parsing Message-ID: <45A2649F.2020308@intertwingly.net> I've posted a note on how the code in html5lib could serve as an excellent foundation for a number of "liberal" XML parsing tasks: http://www.intertwingly.net/blog/2007/01/08/Xhtml5lib Personally, I'm not overly interested in hearing more opinions as to whether or not there is a valid demand for liberal XML parsing. If you don't want to use it, don't. What I WOULD be interested in hearing opinions on is what would be the best way to maintain this code going forward: could it live as a separate module within html5lib repository? Should it be a separate repository? If separate, are there some changes to the tokenizer in particular that could be made that would either directly enable this usage or would make it easier to monkey-patch for usage by xhtml5lib? - Sam Ruby From annevk at opera.com Mon Jan 8 08:29:56 2007 From: annevk at opera.com (Anne van Kesteren) Date: Mon, 08 Jan 2007 17:29:56 +0100 Subject: [Imps] Liberal XML parsing In-Reply-To: <45A2649F.2020308@intertwingly.net> References: <45A2649F.2020308@intertwingly.net> Message-ID: <op.tluf36by64w2qv@id-c0020> On Mon, 08 Jan 2007 16:34:55 +0100, Sam Ruby <rubys at intertwingly.net> wrote: > I've posted a note on how the code in html5lib could serve as an > excellent foundation for a number of "liberal" XML parsing tasks: > > http://www.intertwingly.net/blog/2007/01/08/Xhtml5lib > > Personally, I'm not overly interested in hearing more opinions as to > whether or not there is a valid demand for liberal XML parsing. If you > don't want to use it, don't. I've nothing against liberal XML parsing and I would actually like it to be formalized somewhere, but I do think that calling it an XHTML5 parser is wrong given that XHTML5 as it stands now is supposed to be parsed by an XML parser. > What I WOULD be interested in hearing opinions on is what would be the > best way to maintain this code going forward: could it live as a > separate module within html5lib repository? Should it be a separate > repository? If separate, are there some changes to the tokenizer in > particular that could be made that would either directly enable this > usage or would make it easier to monkey-patch for usage by xhtml5lib? Can't you subclass the tokenizer? (I don't mind it being in the same repository as html5lib by the way. Not sure what the best location is.) -- Anne van Kesteren <http://annevankesteren.nl/> <http://www.opera.com/> From rubys at intertwingly.net Mon Jan 8 08:42:49 2007 From: rubys at intertwingly.net (Sam Ruby) Date: Mon, 08 Jan 2007 11:42:49 -0500 Subject: [Imps] Liberal XML parsing In-Reply-To: <op.tluf36by64w2qv@id-c0020> References: <45A2649F.2020308@intertwingly.net> <op.tluf36by64w2qv@id-c0020> Message-ID: <45A27489.3090504@intertwingly.net> Anne van Kesteren wrote: > >> What I WOULD be interested in hearing opinions on is what would be the >> best way to maintain this code going forward: could it live as a >> separate module within html5lib repository? Should it be a separate >> repository? If separate, are there some changes to the tokenizer in >> particular that could be made that would either directly enable this >> usage or would make it easier to monkey-patch for usage by xhtml5lib? > > Can't you subclass the tokenizer? (I don't mind it being in the same > repository as html5lib by the way. Not sure what the best location is.) The current tokenizer has ".lower()" sprinkled throughout and doesn't expose in any meaningful way the difference between empty and start tags. For the tokenizer to be meaningfully subclassed (and by that, I mean without requiring wholesale duplication of a number of methods), these behaviors would need to be factored out into separate methods that could be overridden. - Sam Ruby From annevk at opera.com Mon Jan 8 08:48:12 2007 From: annevk at opera.com (Anne van Kesteren) Date: Mon, 08 Jan 2007 17:48:12 +0100 Subject: [Imps] Liberal XML parsing In-Reply-To: <45A27489.3090504@intertwingly.net> References: <45A2649F.2020308@intertwingly.net> <op.tluf36by64w2qv@id-c0020> <45A27489.3090504@intertwingly.net> Message-ID: <op.tlugyms364w2qv@id-c0020> On Mon, 08 Jan 2007 17:42:49 +0100, Sam Ruby <rubys at intertwingly.net> wrote: > The current tokenizer has ".lower()" sprinkled throughout and doesn't > expose in any meaningful way the difference between empty and start tags. Because there is no difference between them. See the HTML5 specification. > For the tokenizer to be meaningfully subclassed (and by that, I mean > without requiring wholesale duplication of a number of methods), these > behaviors would need to be factored out into separate methods that could > be overridden. You could subclass it and change processSolidusInTag. Instead of throwing an atheist parse error you would change the type of token to be "empty" or something. Not sure how to do the .lower() stuff. I kind of guessed the reason you wanted to change that was because of a project like this :-) -- Anne van Kesteren <http://annevankesteren.nl/> <http://www.opera.com/> From rubys at intertwingly.net Mon Jan 8 09:23:40 2007 From: rubys at intertwingly.net (Sam Ruby) Date: Mon, 08 Jan 2007 12:23:40 -0500 Subject: [Imps] Liberal XML parsing In-Reply-To: <op.tlugyms364w2qv@id-c0020> References: <45A2649F.2020308@intertwingly.net> <op.tluf36by64w2qv@id-c0020> <45A27489.3090504@intertwingly.net> <op.tlugyms364w2qv@id-c0020> Message-ID: <45A27E1C.7000601@intertwingly.net> Anne van Kesteren wrote: > On Mon, 08 Jan 2007 17:42:49 +0100, Sam Ruby <rubys at intertwingly.net> > wrote: >> The current tokenizer has ".lower()" sprinkled throughout and doesn't >> expose in any meaningful way the difference between empty and start tags. > > Because there is no difference between them. See the HTML5 specification. My point is that by "baking in" that behavior into the tokenizer, it essentially limits that tokenizer to just supporting HTML5. By providing one extra "bit" of information, the potential for reuse is increased. Of course, the html5parser will need to ignore this extra bit, and my patch includes that change. >> For the tokenizer to be meaningfully subclassed (and by that, I mean >> without requiring wholesale duplication of a number of methods), these >> behaviors would need to be factored out into separate methods that >> could be overridden. > > You could subclass it and change processSolidusInTag. Instead of > throwing an atheist parse error you would change the type of token to be > "empty" or something. From a maintenance point of view, that is suboptimal. As processSolidusInTag changes, that maintenance would need to occur in two places. > Not sure how to do the .lower() stuff. I kind of guessed the reason you > wanted to change that was because of a project like this :-) I've provided one way: by refactoring it so that all the lowercasing of element names is done in exactly one place, and that the lowercasing of attribute names is also done in exactly one place. That class can be subclassed to provide a different behavior. - - - It is no secret that my interest in the WHATWG started with a dissatisfaction with Python's sgmllib, particularly when used as a foundation for parsing HTML, XHTML, or as a fallback parser for XML. What I see in html5lib is a *much* better foundation. I'm in no particular rush, but if after a few days it turns out that people are OK with something *like* this going into the html5lib repository, I'd love to put it in there -- at which point it would be free to evolve, be renamed, refactored, and enhanced. One thing I would love to work on is a true DOM builder (at which point, I could throw away my XMLDocument, XMLElement, and XMLComment classes), but I would need changes to TreeBuilder so that I could provide my own Text class (for example). Needless to say, such a treebuilder could also be used with HTML5. Once this stabilized, I would them plan to look at having the UFP take advantage of this library, if it is installed/available. I'd also modify Venus, but such support would not need to be conditional there: Venus could simply include html5lib. - Sam Ruby From jg307 at cam.ac.uk Mon Jan 8 09:27:09 2007 From: jg307 at cam.ac.uk (James Graham) Date: Mon, 08 Jan 2007 17:27:09 +0000 Subject: [Imps] Liberal XML parsing In-Reply-To: <45A2649F.2020308@intertwingly.net> References: <45A2649F.2020308@intertwingly.net> Message-ID: <45A27EED.7000508@cam.ac.uk> Sam Ruby wrote: > What I WOULD be interested in hearing opinions on is what would be the > best way to maintain this code going forward: could it live as a > separate module within html5lib repository? Should it be a separate > repository? I'm open to hosting it in the same repository; the only issue that I see is that people may conflate the two parts and be put off downloading html5lib because they think it is a liberal XML parser or xhtml5lib because they think it is a html-only project. > If separate, are there some changes to the tokenizer in > particular that could be made that would either directly enable this > usage or would make it easier to monkey-patch for usage by xhtml5lib? Assuming the patches needed don't cause severe regressions in the code readability or performance of html5lib I think the existing tokenizer would be the right place to apply them. -- "Eternity's a terrible thought. I mean, where's it all going to end?" -- Tom Stoppard, Rosencrantz and Guildenstern are Dead From jg307 at cam.ac.uk Mon Jan 8 09:41:41 2007 From: jg307 at cam.ac.uk (James Graham) Date: Mon, 08 Jan 2007 17:41:41 +0000 Subject: [Imps] Liberal XML parsing In-Reply-To: <45A27E1C.7000601@intertwingly.net> References: <45A2649F.2020308@intertwingly.net> <op.tluf36by64w2qv@id-c0020> <45A27489.3090504@intertwingly.net> <op.tlugyms364w2qv@id-c0020> <45A27E1C.7000601@intertwingly.net> Message-ID: <45A28255.2050509@cam.ac.uk> Sam Ruby wrote: > I've provided one way: by refactoring it so that all the lowercasing of > element names is done in exactly one place, and that the lowercasing of > attribute names is also done in exactly one place. That class can be > subclassed to provide a different behavior. That sounds fine to me. We need to add some unicode tests though to be sure we're not lowercasing where we shouldn't be. > I'm in no particular rush, but if after a few days it turns out that > people are OK with something *like* this going into the html5lib > repository, I'd love to put it in there -- at which point it would be > free to evolve, be renamed, refactored, and enhanced. One thing I would > love to work on is a true DOM builder (at which point, I could throw > away my XMLDocument, XMLElement, and XMLComment classes), but I would > need changes to TreeBuilder so that I could provide my own Text class > (for example). FWIW I consider supporting one of the python DOM implementations a priority for the 0.3 release of html5lib (of course we need to release 0.2 first -- at this point that is basically a case of uploading the source archive). Using the current treebuilder interface it should be possible to support DOM-like text nodes without any changes but it's non-trivial so maybe the current interface is in need of improvement (the problem is that we aslo need to support ElementTree which regards text as attributes). -- "Eternity's a terrible thought. I mean, where's it all going to end?" -- Tom Stoppard, Rosencrantz and Guildenstern are Dead From annevk at opera.com Mon Jan 8 10:28:22 2007 From: annevk at opera.com (Anne van Kesteren) Date: Mon, 08 Jan 2007 19:28:22 +0100 Subject: [Imps] Liberal XML parsing In-Reply-To: <45A27E1C.7000601@intertwingly.net> References: <45A2649F.2020308@intertwingly.net> <op.tluf36by64w2qv@id-c0020> <45A27489.3090504@intertwingly.net> <op.tlugyms364w2qv@id-c0020> <45A27E1C.7000601@intertwingly.net> Message-ID: <op.tlullkuh64w2qv@id-c0020> On Mon, 08 Jan 2007 18:23:40 +0100, Sam Ruby <rubys at intertwingly.net> wrote: >> Because there is no difference between them. See the HTML5 >> specification. > > My point is that by "baking in" that behavior into the tokenizer, it > essentially limits that tokenizer to just supporting HTML5. By > providing one extra "bit" of information, the potential for reuse is > increased. Well, the next "bit" would probably be processing instructions. That's why it would be nice to have some formalization / standardization first to see how many changes are required exactly. Currently html5lib maps rather well to the specificaction which improves the readability of the code a lot (imho). I'd like to know at how many changes we're looking and how that impacts the code. > From a maintenance point of view, that is suboptimal. As > processSolidusInTag changes, that maintenance would need to occur in two > places. Well, the method isn't that big :-) >> Not sure how to do the .lower() stuff. I kind of guessed the reason you >> wanted to change that was because of a project like this :-) > > I've provided one way: by refactoring it so that all the lowercasing of > element names is done in exactly one place, and that the lowercasing of > attribute names is also done in exactly one place. That class can be > subclassed to provide a different behavior. Do you this as a standalone patch somewhere? As mentioned before, I'd like to see how it deals with non-ASCII characters. > Once this stabilized, I would them plan to look at having the UFP take > advantage of this library, if it is installed/available. I'd also > modify Venus, but such support would not need to be conditional there: > Venus could simply include html5lib. That'd be cool! I read today that actual usage and support is important if you want your library to be included in the default distribution. -- Anne van Kesteren <http://annevankesteren.nl/> <http://www.opera.com/> From rubys at intertwingly.net Mon Jan 8 10:46:27 2007 From: rubys at intertwingly.net (Sam Ruby) Date: Mon, 08 Jan 2007 13:46:27 -0500 Subject: [Imps] Liberal XML parsing In-Reply-To: <op.tlullkuh64w2qv@id-c0020> References: <45A2649F.2020308@intertwingly.net> <op.tluf36by64w2qv@id-c0020> <45A27489.3090504@intertwingly.net> <op.tlugyms364w2qv@id-c0020> <45A27E1C.7000601@intertwingly.net> <op.tlullkuh64w2qv@id-c0020> Message-ID: <45A29183.6080403@intertwingly.net> Anne van Kesteren wrote: > On Mon, 08 Jan 2007 18:23:40 +0100, Sam Ruby <rubys at intertwingly.net> > wrote: >>> Because there is no difference between them. See the HTML5 >>> specification. >> >> My point is that by "baking in" that behavior into the tokenizer, it >> essentially limits that tokenizer to just supporting HTML5. By >> providing one extra "bit" of information, the potential for reuse is >> increased. > > Well, the next "bit" would probably be processing instructions. That's > why it would be nice to have some formalization / standardization first > to see how many changes are required exactly. I have no interest in XML processing instructions at this time. > Currently html5lib maps rather well to the specificaction which improves > the readability of the code a lot (imho). I'd like to know at how many > changes we're looking and how that impacts the code. That's why I provided a comprehensive patch: http://intertwingly.net/stories/2007/01/08/xhtml5.diff >>> Not sure how to do the .lower() stuff. I kind of guessed the reason >>> you wanted to change that was because of a project like this :-) >> >> I've provided one way: by refactoring it so that all the lowercasing >> of element names is done in exactly one place, and that the >> lowercasing of attribute names is also done in exactly one place. >> That class can be subclassed to provide a different behavior. > > Do you this as a standalone patch somewhere? As mentioned before, I'd > like to see how it deals with non-ASCII characters. The patch isn't all that big. The relevant portions are: asciiLower = dict([(ord(c),ord(c.lower())) for c in string.ascii_uppercase]) token["name"] = token["name"].translate(asciiLower) token["data"] = dict([(attr.translate(asciiLower), value) for attr,value in token["data"][::-1]]) - Sam Ruby From annevk at opera.com Mon Jan 8 14:44:40 2007 From: annevk at opera.com (Anne van Kesteren) Date: Mon, 08 Jan 2007 23:44:40 +0100 Subject: [Imps] Liberal XML parsing In-Reply-To: <45A29183.6080403@intertwingly.net> References: <45A2649F.2020308@intertwingly.net> <op.tluf36by64w2qv@id-c0020> <45A27489.3090504@intertwingly.net> <op.tlugyms364w2qv@id-c0020> <45A27E1C.7000601@intertwingly.net> <op.tlullkuh64w2qv@id-c0020> <45A29183.6080403@intertwingly.net> Message-ID: <op.tluxgqrl64w2qv@id-c0020> On Mon, 08 Jan 2007 19:46:27 +0100, Sam Ruby <rubys at intertwingly.net> wrote: >> Well, the next "bit" would probably be processing instructions. That's >> why it would be nice to have some formalization / standardization first >> to see how many changes are required exactly. > > I have no interest in XML processing instructions at this time. Fair enough. But if this is becoming the foundation of an (experimental) liberal XML parser we'll have interest in due course I reckon. If only for <?xbl?> and <?xml-stylesheet?>. >> Currently html5lib maps rather well to the specificaction which >> improves the readability of the code a lot (imho). I'd like to know at >> how many changes we're looking and how that impacts the code. > > That's why I provided a comprehensive patch: > > http://intertwingly.net/stories/2007/01/08/xhtml5.diff Instead of using string.ascii_uppercase you should use our internal asciiUppercase. Also, instead of using a dict for translating can't you just provide two strings? I'd think that would be faster. The normalizeToken method should be inlined as you only want to do that from a single place anyway. And EndTag should use the translate method and not .lower(). I suppose these changes also remove the need for asciiLowercase (not asciiLower that you introduce) as defined in constants.py. Anyway, with these nits (open for debate) I think I'm ok with doing this assuming you will update the tests as well (or someone else will). I'd like to have a liberal XML parser too one day and working on an experimental implementation of one can't hurt I suppose :-) If xhtml5parser.py is the only other file I would be fine with adding that to src/ as liberalxmlparser.py. Bit of a lengthty name, but it more accurately reflects what it is. -- Anne van Kesteren <http://annevankesteren.nl/> <http://www.opera.com/> From rubys at intertwingly.net Mon Jan 8 17:15:38 2007 From: rubys at intertwingly.net (Sam Ruby) Date: Mon, 08 Jan 2007 20:15:38 -0500 Subject: [Imps] Liberal XML parsing In-Reply-To: <op.tluxgqrl64w2qv@id-c0020> References: <45A2649F.2020308@intertwingly.net> <op.tluf36by64w2qv@id-c0020> <45A27489.3090504@intertwingly.net> <op.tlugyms364w2qv@id-c0020> <45A27E1C.7000601@intertwingly.net> <op.tlullkuh64w2qv@id-c0020> <45A29183.6080403@intertwingly.net> <op.tluxgqrl64w2qv@id-c0020> Message-ID: <45A2ECBA.4040607@intertwingly.net> Anne van Kesteren wrote: > >>> Currently html5lib maps rather well to the specificaction which >>> improves the readability of the code a lot (imho). I'd like to know >>> at how many changes we're looking and how that impacts the code. >> >> That's why I provided a comprehensive patch: >> >> http://intertwingly.net/stories/2007/01/08/xhtml5.diff > > Instead of using string.ascii_uppercase you should use our internal > asciiUppercase. Also, instead of using a dict for translating can't you > just provide two strings? I'd think that would be faster. I don't understand the suggestion to use the internal asciiUppercase - with my patch, this constant is no longer used. And my constant was defined in the src/constants.py file... I also don't understand the suggestion to "just provide two strings". That's not how Python's unicode.translate() method works. > The normalizeToken method should be inlined as you only want to do that > from a single place anyway. And EndTag should use the translate method > and not .lower(). While it is true that normalizeToken is only called from one place, this method can't be inlined as the liberal XML parser subclass needs to override this behavior. > I suppose these changes also remove the need for asciiLowercase (not > asciiLower that you introduce) as defined in constants.py. asciiLowercase is still used in the portion of the logic dealing with DocTypes. But having two similarly named constants with quite different purposes is confusing, and clearly *that* should be changed. > Anyway, with these nits (open for debate) I think I'm ok with doing this > assuming you will update the tests as well (or someone else will). I'd > like to have a liberal XML parser too one day and working on an > experimental implementation of one can't hurt I suppose :-) In case you didn't notice it, here are the tests: http://intertwingly.net/stories/2007/01/08/tests/test_xhtml.py > If xhtml5parser.py is the only other file I would be fine with adding > that to src/ as liberalxmlparser.py. Bit of a lengthty name, but it more > accurately reflects what it is. I'm not worried about the the name. That name is fine. I'll look into committing this tomorrow, with your proposed module name, with the unit tests, and with some subset of these nits addressed. I'll add comments at the top of the module indicating that this support is experimental and subject to change and even removal at any time. - Sam Ruby From annevk at opera.com Mon Jan 8 17:27:49 2007 From: annevk at opera.com (Anne van Kesteren) Date: Tue, 09 Jan 2007 02:27:49 +0100 Subject: [Imps] Liberal XML parsing In-Reply-To: <45A2ECBA.4040607@intertwingly.net> References: <45A2649F.2020308@intertwingly.net> <op.tluf36by64w2qv@id-c0020> <45A27489.3090504@intertwingly.net> <op.tlugyms364w2qv@id-c0020> <45A27E1C.7000601@intertwingly.net> <op.tlullkuh64w2qv@id-c0020> <45A29183.6080403@intertwingly.net> <op.tluxgqrl64w2qv@id-c0020> <45A2ECBA.4040607@intertwingly.net> Message-ID: <op.tlu40nal64w2qv@id-c0020> On Tue, 09 Jan 2007 02:15:38 +0100, Sam Ruby <rubys at intertwingly.net> wrote: >>>> Currently html5lib maps rather well to the specificaction which >>>> improves the readability of the code a lot (imho). I'd like to know >>>> at how many changes we're looking and how that impacts the code. >>> >>> That's why I provided a comprehensive patch: >>> >>> http://intertwingly.net/stories/2007/01/08/xhtml5.diff >> Instead of using string.ascii_uppercase you should use our internal >> asciiUppercase. Also, instead of using a dict for translating can't you >> just provide two strings? I'd think that would be faster. > > I don't understand the suggestion to use the internal asciiUppercase - > with my patch, this constant is no longer used. And my constant was > defined in the src/constants.py file... But you haven't removed the constant either. But as you later note it's still used somewhere else... > I also don't understand the suggestion to "just provide two strings". > That's not how Python's unicode.translate() method works. Oh, sorry about that. I vaguely recalled translate() from http://diveintopython.org/performance_tuning/dictionary_lookups.html and apparently it works slightly different from what I remembered. Should we use string.maketrans? >> The normalizeToken method should be inlined as you only want to do that >> from a single place anyway. And EndTag should use the translate method >> and not .lower(). > > While it is true that normalizeToken is only called from one place, this > method can't be inlined as the liberal XML parser subclass needs to > override this behavior. Hmm, not so nice. For a large page that's a lot of additional method calls. Can you redo it a bit making sure we don't make that call for all non tag tokens at least, such as characters. >> I suppose these changes also remove the need for asciiLowercase (not >> asciiLower that you introduce) as defined in constants.py. > > asciiLowercase is still used in the portion of the logic dealing with > DocTypes. But having two similarly named constants with quite different > purposes is confusing, and clearly *that* should be changed. Yeah. >> Anyway, with these nits (open for debate) I think I'm ok with doing >> this assuming you will update the tests as well (or someone else will). >> I'd like to have a liberal XML parser too one day and working on an >> experimental implementation of one can't hurt I suppose :-) > > In case you didn't notice it, here are the tests: > > http://intertwingly.net/stories/2007/01/08/tests/test_xhtml.py I noticed those, but you also had some comments on updating tests. >> If xhtml5parser.py is the only other file I would be fine with adding >> that to src/ as liberalxmlparser.py. Bit of a lengthty name, but it >> more accurately reflects what it is. > > I'm not worried about the the name. That name is fine. > > I'll look into committing this tomorrow, with your proposed module name, > with the unit tests, and with some subset of these nits addressed. I'll > add comments at the top of the module indicating that this support is > experimental and subject to change and even removal at any time. Ok, cool. -- Anne van Kesteren <http://annevankesteren.nl/> <http://www.opera.com/> From teeshift at gmail.com Mon Jan 8 19:59:45 2007 From: teeshift at gmail.com (tee shift) Date: Tue, 9 Jan 2007 11:59:45 +0800 Subject: [Imps] adding object namespace function on canvas (so we could use them.) Message-ID: <18b3755f0701081959va5352e1o3606f64e67413d33@mail.gmail.com> Haven't read all the spect yet. I wonder if in future canvas will have the feature like the following beginPath circle = arc(30, 50, 0, Math.PI*2, 2) endPath ........ circle.moveTo(40,20) ..... we could assign name to object we have drawn so we could later change its attributes. In this example, location but we could change radius, center or size. Thanks. Tee Shift -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.whatwg.org/pipermail/implementors-whatwg.org/attachments/20070109/8693045e/attachment-0002.htm> From annevk at opera.com Tue Jan 9 02:25:18 2007 From: annevk at opera.com (Anne van Kesteren) Date: Tue, 09 Jan 2007 11:25:18 +0100 Subject: [Imps] adding object namespace function on canvas (so we could use them.) In-Reply-To: <18b3755f0701081959va5352e1o3606f64e67413d33@mail.gmail.com> References: <18b3755f0701081959va5352e1o3606f64e67413d33@mail.gmail.com> Message-ID: <op.tlvtwgls64w2qv@id-c0020> On Tue, 09 Jan 2007 04:59:45 +0100, tee shift <teeshift at gmail.com> wrote: > Haven't read all the spect yet. > I wonder if in future canvas will have the feature like the following > > beginPath > circle = arc(30, 50, 0, Math.PI*2, 2) > endPath > ........ > circle.moveTo(40,20) > ..... > > we could assign name to object we have drawn so we could later change its > attributes. > In this example, location but we could change radius, center or size. The idea of <canvas> is that it doesn't consist of objects, but is just a bitmap you can draw upon. If you really want the objects you want SVG. Also, feedback like this ought to go to whatwg at whatwg.org. -- Anne van Kesteren <http://annevankesteren.nl/> <http://www.opera.com/> From rubys at intertwingly.net Wed Jan 10 02:36:40 2007 From: rubys at intertwingly.net (Sam Ruby) Date: Wed, 10 Jan 2007 05:36:40 -0500 Subject: [Imps] True DOM TreeBuilder Message-ID: <45A4C1B8.1000708@intertwingly.net> I just committed a minidom.getDOMImplementation() based TreeBuilder to html5lib. Notes: 1) I had to monkey patch minidom in order to get text nodes that are immediate children of the document node to work. 2) Based on how html5 is spec'ed, the doctypes become "HTML" instead of "html", which is what you would expect in an XML DOM representation. 3) This implementation is not namespace aware, nor are the elements placed in the XHTML namespace. Demo: http://code.google.com/p/html5lib/ is purportedly XHTML 1.0 Strict, but is served as text/html and contains such dubious constructs as "<div id=gaia>". You can obtained a cleaned up version of this page after a side trip through the DOM via: $ python parse.py -b dom -x http://code.google.com/p/html5lib/ In particular, note what the DOM's default "toxml()" method does to the script near the end of this page. - Sam Ruby From ian at hixie.ch Fri Jan 12 15:13:21 2007 From: ian at hixie.ch (Ian Hickson) Date: Fri, 12 Jan 2007 23:13:21 +0000 (UTC) Subject: [Imps] <style> processing change in the HTML5 parsing spec Message-ID: <Pine.LNX.4.62.0701122312130.4611@dhalsim.dreamhost.com> Ostensibly for compatiblity with IE, the HTML5 parsing spec is going to get an experimental change to the processing of <style> elements. Instead of being moved to the <head>, they are now going to be left wherever they are found. In addition, <script> and <style> elements found during the "after head" mode will be inserted into the <head>, instead of being inserted between the <head> and the <body>. Let me know if any of this causes you problems. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' From annevk at opera.com Sat Jan 13 02:46:31 2007 From: annevk at opera.com (Anne van Kesteren) Date: Sat, 13 Jan 2007 11:46:31 +0100 Subject: [Imps] <style> processing change in the HTML5 parsing spec In-Reply-To: <Pine.LNX.4.62.0701122312130.4611@dhalsim.dreamhost.com> References: <Pine.LNX.4.62.0701122312130.4611@dhalsim.dreamhost.com> Message-ID: <op.tl29jthb64w2qv@id-c0020> On Sat, 13 Jan 2007 00:13:21 +0100, Ian Hickson <ian at hixie.ch> wrote: > Ostensibly for compatiblity with IE, the HTML5 parsing spec is going to > get an experimental change to the processing of <style> elements. > > Instead of being moved to the <head>, they are now going to be left > wherever they are found. > > In addition, <script> and <style> elements found during the "after head" > mode will be inserted into the <head>, instead of being inserted between > the <head> and the <body>. > > Let me know if any of this causes you problems. r475 of html5lib contains this change. Thanks for the testcases by the way! -- Anne van Kesteren <http://annevankesteren.nl/> <http://www.opera.com/> From annevk at opera.com Sun Jan 14 10:05:02 2007 From: annevk at opera.com (Anne van Kesteren) Date: Sun, 14 Jan 2007 19:05:02 +0100 Subject: [Imps] </body> and after body phase Message-ID: <op.tl5oio0f64w2qv@id-c0020> I need some kind of strategy that: <!doctype html><li></body> doesn't cause any parse errors but that <!doctype html><div><li></body> does. Note you can't actually imply </li> on </body> because that would break stuff. (This doesn't seem to be covered in the specification, but is assumed in at least one of the Google testcases (and makes sense).) -- Anne van Kesteren <http://annevankesteren.nl/> <http://www.opera.com/> From rubys at intertwingly.net Mon Jan 15 00:39:26 2007 From: rubys at intertwingly.net (Sam Ruby) Date: Mon, 15 Jan 2007 03:39:26 -0500 Subject: [Imps] i18n message discussion Message-ID: <45AB3DBE.2040204@intertwingly.net> Just tossing this out for discussion... Typical error in html5lib: self.parser.parseError(_("Unexpected end of file. Expected end " u"tag (" + self.tree.openElements[1].name + u") first.")) Typical error in feedvalidator: self.log(UndefinedNamedEntity({'value':name})) Discussion: in the feedvalidator, each class of error is mapped to a Python class, and a parameterizable. The unit tests can verify that a specific error is (or is not) generated, and can even match on parameters. At runtime, each error class is mapped to a sprintf type of string, which can be language specific. The validator also uses the class name as the name of an html page which can contain more information and pointers to spec text or other information. Obviously, the reason why I am bringing this up is that I prefer the feedvalidator approach (credit due to Mark Pilgrim). - Sam Ruby From annevk at opera.com Mon Jan 15 02:17:04 2007 From: annevk at opera.com (Anne van Kesteren) Date: Mon, 15 Jan 2007 11:17:04 +0100 Subject: [Imps] i18n message discussion In-Reply-To: <45AB3DBE.2040204@intertwingly.net> References: <45AB3DBE.2040204@intertwingly.net> Message-ID: <op.tl6xiqcd64w2qv@id-c0020> On Mon, 15 Jan 2007 09:39:26 +0100, Sam Ruby <rubys at intertwingly.net> wrote: > Typical error in html5lib: > > self.parser.parseError(_("Unexpected end of file. Expected end " > u"tag (" + self.tree.openElements[1].name + u") first.")) > > Typical error in feedvalidator: > > self.log(UndefinedNamedEntity({'value':name})) I agree the latter is a lot better. It's not entirely clear how to do this though. Everytime you emit a parse error per the specification it's quite different and currently we have the ability to make the error messages as accurate as possible for every situation. I'd be happy with a more concrete proposal on how to do this though. -- Anne van Kesteren <http://annevankesteren.nl/> <http://www.opera.com/> From jg307 at cam.ac.uk Mon Jan 15 03:53:47 2007 From: jg307 at cam.ac.uk (James Graham) Date: Mon, 15 Jan 2007 11:53:47 +0000 Subject: [Imps] i18n message discussion In-Reply-To: <op.tl6xiqcd64w2qv@id-c0020> References: <45AB3DBE.2040204@intertwingly.net> <op.tl6xiqcd64w2qv@id-c0020> Message-ID: <45AB6B4B.6030304@cam.ac.uk> Anne van Kesteren wrote: > On Mon, 15 Jan 2007 09:39:26 +0100, Sam Ruby <rubys at intertwingly.net> > wrote: >> Typical error in html5lib: >> >> self.parser.parseError(_("Unexpected end of file. Expected end " >> u"tag (" + self.tree.openElements[1].name + u") first.")) >> >> Typical error in feedvalidator: >> >> self.log(UndefinedNamedEntity({'value':name})) > > I agree the latter is a lot better. It's not entirely clear how to do this > though. Everytime you emit a parse error per the specification it's quite > different and currently we have the ability to make the error messages as > accurate as possible for every situation. I'd be happy with a more > concrete proposal on how to do this though. Well with enough parameters to the class it shouldn't be impossible. Also, I've just set up a google groups group/mailing list for html5lib specific issues such as these; html5lib-discuss at googlegroups.com - see http://groups.google.com/group/html5lib-discuss/topics -- "Eternity's a terrible thought. I mean, where's it all going to end?" -- Tom Stoppard, Rosencrantz and Guildenstern are Dead From ian at hixie.ch Tue Jan 16 14:49:59 2007 From: ian at hixie.ch (Ian Hickson) Date: Tue, 16 Jan 2007 22:49:59 +0000 (UTC) Subject: [Imps] </body> and after body phase In-Reply-To: <op.tl5oio0f64w2qv@id-c0020> References: <op.tl5oio0f64w2qv@id-c0020> Message-ID: <Pine.LNX.4.62.0701162236190.4611@dhalsim.dreamhost.com> On Sun, 14 Jan 2007, Anne van Kesteren wrote: > > I need some kind of strategy that: > > <!doctype html><li></body> > > doesn't cause any parse errors but that > > <!doctype html><div><li></body> > > does. > > Note you can't actually imply </li> on </body> because that would break > stuff. > > (This doesn't seem to be covered in the specification, but is assumed in > at least one of the Google testcases (and makes sense).) Yeah, known issue. You just want to start at the end off the stack and walk back until the second element. If they are all elements that get closed with you generate implied end tags, and if the second element is <body>, then you're ok, otherwise, raise a parse error. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'