From t.broyer at gmail.com Fri Jun 8 11:38:29 2007 From: t.broyer at gmail.com (Thomas Broyer) Date: Fri, 8 Jun 2007 20:38:29 +0200 Subject: [Imps] [whatwg] HTML5Lib Test Suite vs. PHP 5 HTML Parser In-Reply-To: <2A1705BE-95FD-4A9C-9DFC-E25786F0AF4B@michelf.com> References: <2A1705BE-95FD-4A9C-9DFC-E25786F0AF4B@michelf.com> Message-ID: 2007/6/8, Michel Fortin: > Perhaps someone will find this raw data interesting. I've made a > script to run the HTML5Lib test cases against the built-in HTML > parser in PHP 5. And here's the result: > > Have you tried PH5P (pure PHP HTML5 parser)? http://jero.net/lab/ph5p/ [CC'd Implementors list, please follow-up there] -- Thomas Broyer From michel.fortin at michelf.com Fri Jun 8 12:12:56 2007 From: michel.fortin at michelf.com (Michel Fortin) Date: Fri, 08 Jun 2007 15:12:56 -0400 Subject: [Imps] [whatwg] HTML5Lib Test Suite vs. PHP 5 HTML Parser In-Reply-To: References: <2A1705BE-95FD-4A9C-9DFC-E25786F0AF4B@michelf.com> Message-ID: <06747511-EFAC-4563-875E-FB206B89B0F3@michelf.com> Le 2007-06-08 ? 14:38, Thomas Broyer a ?crit : > Have you tried PH5P (pure PHP HTML5 parser)? > > http://jero.net/lab/ph5p/ Just did, but unfortunately at test case 29 of the first file it falls into this piece of code: exit('

select not yet supported.

'); which ends the process. So the result isn't much interesting. If I remove it -- and I also remove the two others `exit` statements found in the code -- it falls into an infinite loop at test2.dat, case 4 (
test
). Here's the result anyway: Note that there is nothing right now to tell appart failing and passing tests. Michel Fortin michel.fortin at michelf.com http://www.michelf.com/ From hsivonen at iki.fi Sun Jun 10 13:42:38 2007 From: hsivonen at iki.fi (Henri Sivonen) Date: Sun, 10 Jun 2007 23:42:38 +0300 Subject: [Imps] Percentages of HTML tags with a given number of attributes Message-ID: <9599AA61-4391-4B8B-9B50-9C7CA6D09809@iki.fi> Has anyone researched what percentage of HTML tags out there has <= 1 attributes, <= 2 attributes, etc.? If yes, are the numbers public somewhere? This kind of data is of interest when choosing a memory allocation policy for attributes. -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From hsivonen at iki.fi Fri Jun 15 06:32:19 2007 From: hsivonen at iki.fi (Henri Sivonen) Date: Fri, 15 Jun 2007 16:32:19 +0300 Subject: [Imps] Character token coalescing in tokenizer tests Message-ID: <310EA4C7-D4DB-440F-98EC-2B3608D867ED@iki.fi> > {"description":"Ampersand, number sign", > "input":"&#", > "output":["ParseError", ["Character", "&"], ["Character", "#"]]}, > > {"description":"Unfinished numeric entity", > "input":"&#x", > "output":["ParseError", ["Character", "&#x"]]}, Would it work for html5lib if consistent coalescing was used in the test format? That is, could the first of these two changed to {"description":"Ampersand, number sign", "input":"&#", "output":["ParseError", ["Character", "&#"]]}, ? -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From jg307 at cam.ac.uk Fri Jun 15 06:47:49 2007 From: jg307 at cam.ac.uk (James Graham) Date: Fri, 15 Jun 2007 14:47:49 +0100 Subject: [Imps] Character token coalescing in tokenizer tests In-Reply-To: <310EA4C7-D4DB-440F-98EC-2B3608D867ED@iki.fi> References: <310EA4C7-D4DB-440F-98EC-2B3608D867ED@iki.fi> Message-ID: <46729885.20706@cam.ac.uk> Henri Sivonen wrote: >> {"description":"Ampersand, number sign", >> "input":"&#", >> "output":["ParseError", ["Character", "&"], ["Character", "#"]]}, >> >> {"description":"Unfinished numeric entity", >> "input":"&#x", >> "output":["ParseError", ["Character", "&#x"]]}, > > Would it work for html5lib if consistent coalescing was used in the > test format? That is, could the first of these two changed to > {"description":"Ampersand, number sign", > "input":"&#", > "output":["ParseError", ["Character", "&#"]]}, > ? > Yes. -- "Eternity's a terrible thought. I mean, where's it all going to end?" -- Tom Stoppard, Rosencrantz and Guildenstern are Dead From hsivonen at iki.fi Sun Jun 17 09:21:56 2007 From: hsivonen at iki.fi (Henri Sivonen) Date: Sun, 17 Jun 2007 19:21:56 +0300 Subject: [Imps] Character token coalescing in tokenizer tests In-Reply-To: <46729885.20706@cam.ac.uk> References: <310EA4C7-D4DB-440F-98EC-2B3608D867ED@iki.fi> <46729885.20706@cam.ac.uk> Message-ID: On Jun 15, 2007, at 16:47, James Graham wrote: > Yes. Great. Filed bug with test patch: http://code.google.com/p/html5lib/issues/detail?id=45 -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From t.broyer at gmail.com Wed Jun 20 00:38:31 2007 From: t.broyer at gmail.com (Thomas Broyer) Date: Wed, 20 Jun 2007 09:38:31 +0200 Subject: [Imps] [whatwg] html5 parsing/tokenizing In-Reply-To: <8ad71be30706191620t43a6ab88v51037e1bf6c49f6@mail.gmail.com> References: <8ad71be30706191620t43a6ab88v51037e1bf6c49f6@mail.gmail.com> Message-ID: > When the tokenization state machine is defined, every state first > "consumes" and then potentially "emits". Some of the states transfer to > another state with an order to "re-consume the character in the next > state". This means that what you do in the new state is dependant on > what you did in the last state and that the "comsume" is necessarily an > inconsistent operation. A much better wording would be "look at the next > character" and on state transition "consume and emit" or just "emit > without consumption" making it clear when the input cursor moves. I did the same in Twintsam with PeekChar/PeekChars and EatChar/EatChars methods. http://twintsam.googlecode.com/svn/trunk/Twintsam/Html/HtmlReader.StreamHandling.cs (beware, Twintsam hasn't been updated since January so it's not in sync with the spec as it is now) though actually you could just use a character queue into which you push back characters that needs to be "re-consumed" (i.e. you "un-read" the character and then you switch to the other state). This is what html5lib does: http://html5lib.googlecode.com/svn/trunk/python/src/tokenizer.py (search for self.stream.queue; this needs to be refactored with an unread() method on the HTMLInputStream) That is to say, I don't think the spec should be changed at all. It's just a matter of how you implement it. You just have to know that the "queue" won't ever be larger than 9 characters as there are tweaks for 0-prefixed numeric entities and/or numeric entities greater 1114111. > It would be nice if all tags (except comments) were considered > "declarations" instead of bogus comments. Then DOCTYPE wouldn't need > special handling by the tokenizer, just special handling by the parser. > (Too much of the parser seems to have gotten into the tokenizer; with > CDATA and RCDATA, this is a necessary evil. With it > isn't.) I can't see the problem here; plus DOCTYPE parsing is special because we need the DOCTYPE name. Moreover, the spec has changed recently so that DOCTYPE parsing takes care of PUBLIC and SYSTEM identifiers. -- Thomas Broyer From hsivonen at iki.fi Thu Jun 28 05:53:33 2007 From: hsivonen at iki.fi (Henri Sivonen) Date: Thu, 28 Jun 2007 15:53:33 +0300 Subject: [Imps] HTML 4.01 compatibility modes for an HTML5 parser Message-ID: <494E0389-B04D-403B-ABBD-9063C288F239@iki.fi> Recognizing that there are people who want to treat the HTML 4.01 doctypes as non-errors for the time being, my old prototype parser had four modes for dealing with the HTML 4.01 legacy as interesting to users today. To avoid regressing on functionality with my current replacement parser project, I've been thinking that I should retain those four modes. The modes are now drafted as follows: /** * Be a pure HTML5 parser. */ HTML5, /** * Require the HTML 4.01 Transitional public id. Turn on HTML4- specific * additional errors regardless of doctype. */ HTML401_TRANSITIONAL, /** * Require the HTML 4.01 Transitional public id and a system id. Turn on * HTML4-specific additional errors regardless of doctype. */ HTML401_STRICT, /** * Treat the HTML5 doctype, doctypes with the HTML 4.01 Strict public id and * doctypes with the HTML 4.01 Transitional public id and a system id as * non-errors. Turn of HTML4-specific additional errors if the public id is * the HTML 4.01 Strict or Transitional public id. */ AUTO Does this seem reasonable? Are there additional modes that would be such low-hanging fruit that I should offer more modes? On the other hand, is there something wrong with offering these modes? Note that not providing modes for Appendix C checking is a deliberate choice to better manage how I use my time. -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From hsivonen at iki.fi Thu Jun 28 07:05:44 2007 From: hsivonen at iki.fi (Henri Sivonen) Date: Thu, 28 Jun 2007 17:05:44 +0300 Subject: [Imps] HTML 4.01 compatibility modes for an HTML5 parser In-Reply-To: <494E0389-B04D-403B-ABBD-9063C288F239@iki.fi> References: <494E0389-B04D-403B-ABBD-9063C288F239@iki.fi> Message-ID: <0C42402A-72DB-4855-B350-A9AE3A365D41@iki.fi> On Jun 28, 2007, at 15:53, Henri Sivonen wrote: > Does this seem reasonable? Are there additional modes that would be > such low-hanging fruit that I should offer more modes? On the other > hand, is there something wrong with offering these modes? Revised per IRC discussion with Anne: /** * Be a pure HTML5 parser. */ HTML, /** * Require the HTML 4.01 Transitional public id. Turn on HTML4- specific * additional errors regardless of doctype. */ HTML401_TRANSITIONAL, /** * Require the HTML 4.01 Transitional public id and a system id. Turn on * HTML4-specific additional errors regardless of doctype. */ HTML401_STRICT, /** * Treat the doctype required by HTML 5, doctypes with the HTML 4.01 Strict * public id and doctypes with the HTML 4.01 Transitional public id and a * system id as non-errors. Turn on HTML4-specific additional errors if the * public id is the HTML 4.01 Strict or Transitional public id. */ AUTO, /** * Never enable HTML4-specific error checks. Never report any doctype * condition as an error. (Doctype tokens in wrong places will be * reported as errors, though.) The application may decide what to log * in response to calls to DocumentModeHanler. This mode * in meant for doing surveys on existing content. */ NO_DOCTYPE_ERRORS -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From annevk at opera.com Thu Jun 28 07:08:53 2007 From: annevk at opera.com (Anne van Kesteren) Date: Thu, 28 Jun 2007 16:08:53 +0200 Subject: [Imps] HTML 4.01 compatibility modes for an HTML5 parser In-Reply-To: <0C42402A-72DB-4855-B350-A9AE3A365D41@iki.fi> References: <494E0389-B04D-403B-ABBD-9063C288F239@iki.fi> <0C42402A-72DB-4855-B350-A9AE3A365D41@iki.fi> Message-ID: On Thu, 28 Jun 2007 16:05:44 +0200, Henri Sivonen wrote: > NO_DOCTYPE_ERRORS I think you do want every DOCTYPE that triggers quirks mode or almost quirks mode (or whatever they're called these days) to trigger an error in this mode. -- Anne van Kesteren From hsivonen at iki.fi Thu Jun 28 07:11:58 2007 From: hsivonen at iki.fi (Henri Sivonen) Date: Thu, 28 Jun 2007 17:11:58 +0300 Subject: [Imps] HTML 4.01 compatibility modes for an HTML5 parser In-Reply-To: References: <494E0389-B04D-403B-ABBD-9063C288F239@iki.fi> <0C42402A-72DB-4855-B350-A9AE3A365D41@iki.fi> Message-ID: On Jun 28, 2007, at 17:08, Anne van Kesteren wrote: > On Thu, 28 Jun 2007 16:05:44 +0200, Henri Sivonen > wrote: >> NO_DOCTYPE_ERRORS > > I think you do want every DOCTYPE that triggers quirks mode or > almost quirks mode (or whatever they're called these days) to > trigger an error in this mode. The definition would delegate actions on a quirky doctype to the app so that survey spiders could note the situation specifically. Do you expect people to do bad things with this mode if I leave even quirkiness reporting to the app? -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From annevk at opera.com Thu Jun 28 07:19:00 2007 From: annevk at opera.com (Anne van Kesteren) Date: Thu, 28 Jun 2007 16:19:00 +0200 Subject: [Imps] HTML 4.01 compatibility modes for an HTML5 parser In-Reply-To: References: <494E0389-B04D-403B-ABBD-9063C288F239@iki.fi> <0C42402A-72DB-4855-B350-A9AE3A365D41@iki.fi> Message-ID: On Thu, 28 Jun 2007 16:11:58 +0200, Henri Sivonen wrote: > On Jun 28, 2007, at 17:08, Anne van Kesteren wrote: >> On Thu, 28 Jun 2007 16:05:44 +0200, Henri Sivonen >> wrote: >>> NO_DOCTYPE_ERRORS >> >> I think you do want every DOCTYPE that triggers quirks mode or almost >> quirks mode (or whatever they're called these days) to trigger an error >> in this mode. > > The definition would delegate actions on a quirky doctype to the app so > that survey spiders could note the situation specifically. Do you expect > people to do bad things with this mode if I leave even quirkiness > reporting to the app? My thought was that it would be more useful for surveys if you don't have to manually add such errors. However, I suppose you can argue either way and I don't feel strongly about it. -- Anne van Kesteren From hsivonen at iki.fi Thu Jun 28 07:21:29 2007 From: hsivonen at iki.fi (Henri Sivonen) Date: Thu, 28 Jun 2007 17:21:29 +0300 Subject: [Imps] HTML 4.01 compatibility modes for an HTML5 parser In-Reply-To: References: <494E0389-B04D-403B-ABBD-9063C288F239@iki.fi> <0C42402A-72DB-4855-B350-A9AE3A365D41@iki.fi> Message-ID: <67E54C85-4D41-41F3-AE44-686670097849@iki.fi> On Jun 28, 2007, at 17:19, Anne van Kesteren wrote: > On Thu, 28 Jun 2007 16:11:58 +0200, Henri Sivonen > wrote: >> On Jun 28, 2007, at 17:08, Anne van Kesteren wrote: >>> On Thu, 28 Jun 2007 16:05:44 +0200, Henri Sivonen >>> wrote: >>>> NO_DOCTYPE_ERRORS >>> >>> I think you do want every DOCTYPE that triggers quirks mode or >>> almost quirks mode (or whatever they're called these days) to >>> trigger an error in this mode. >> >> The definition would delegate actions on a quirky doctype to the >> app so that survey spiders could note the situation specifically. >> Do you expect people to do bad things with this mode if I leave >> even quirkiness reporting to the app? > > My thought was that it would be more useful for surveys if you > don't have to manually add such errors. However, I suppose you can > argue either way and I don't feel strongly about it. There's also the option of adding yet another mode. Make it a pref. ;-) ALLOW_ANY_STANDARDS_MODE -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From t.broyer at gmail.com Fri Jun 8 11:38:29 2007 From: t.broyer at gmail.com (Thomas Broyer) Date: Fri, 8 Jun 2007 20:38:29 +0200 Subject: [Imps] [whatwg] HTML5Lib Test Suite vs. PHP 5 HTML Parser In-Reply-To: <2A1705BE-95FD-4A9C-9DFC-E25786F0AF4B@michelf.com> References: <2A1705BE-95FD-4A9C-9DFC-E25786F0AF4B@michelf.com> Message-ID: 2007/6/8, Michel Fortin: > Perhaps someone will find this raw data interesting. I've made a > script to run the HTML5Lib test cases against the built-in HTML > parser in PHP 5. And here's the result: > > Have you tried PH5P (pure PHP HTML5 parser)? http://jero.net/lab/ph5p/ [CC'd Implementors list, please follow-up there] -- Thomas Broyer From michel.fortin at michelf.com Fri Jun 8 12:12:56 2007 From: michel.fortin at michelf.com (Michel Fortin) Date: Fri, 08 Jun 2007 15:12:56 -0400 Subject: [Imps] [whatwg] HTML5Lib Test Suite vs. PHP 5 HTML Parser In-Reply-To: References: <2A1705BE-95FD-4A9C-9DFC-E25786F0AF4B@michelf.com> Message-ID: <06747511-EFAC-4563-875E-FB206B89B0F3@michelf.com> Le 2007-06-08 ? 14:38, Thomas Broyer a ?crit : > Have you tried PH5P (pure PHP HTML5 parser)? > > http://jero.net/lab/ph5p/ Just did, but unfortunately at test case 29 of the first file it falls into this piece of code: exit('

select not yet supported.

'); which ends the process. So the result isn't much interesting. If I remove it -- and I also remove the two others `exit` statements found in the code -- it falls into an infinite loop at test2.dat, case 4 (
test
). Here's the result anyway: Note that there is nothing right now to tell appart failing and passing tests. Michel Fortin michel.fortin at michelf.com http://www.michelf.com/ From hsivonen at iki.fi Sun Jun 10 13:42:38 2007 From: hsivonen at iki.fi (Henri Sivonen) Date: Sun, 10 Jun 2007 23:42:38 +0300 Subject: [Imps] Percentages of HTML tags with a given number of attributes Message-ID: <9599AA61-4391-4B8B-9B50-9C7CA6D09809@iki.fi> Has anyone researched what percentage of HTML tags out there has <= 1 attributes, <= 2 attributes, etc.? If yes, are the numbers public somewhere? This kind of data is of interest when choosing a memory allocation policy for attributes. -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From hsivonen at iki.fi Fri Jun 15 06:32:19 2007 From: hsivonen at iki.fi (Henri Sivonen) Date: Fri, 15 Jun 2007 16:32:19 +0300 Subject: [Imps] Character token coalescing in tokenizer tests Message-ID: <310EA4C7-D4DB-440F-98EC-2B3608D867ED@iki.fi> > {"description":"Ampersand, number sign", > "input":"&#", > "output":["ParseError", ["Character", "&"], ["Character", "#"]]}, > > {"description":"Unfinished numeric entity", > "input":"&#x", > "output":["ParseError", ["Character", "&#x"]]}, Would it work for html5lib if consistent coalescing was used in the test format? That is, could the first of these two changed to {"description":"Ampersand, number sign", "input":"&#", "output":["ParseError", ["Character", "&#"]]}, ? -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From jg307 at cam.ac.uk Fri Jun 15 06:47:49 2007 From: jg307 at cam.ac.uk (James Graham) Date: Fri, 15 Jun 2007 14:47:49 +0100 Subject: [Imps] Character token coalescing in tokenizer tests In-Reply-To: <310EA4C7-D4DB-440F-98EC-2B3608D867ED@iki.fi> References: <310EA4C7-D4DB-440F-98EC-2B3608D867ED@iki.fi> Message-ID: <46729885.20706@cam.ac.uk> Henri Sivonen wrote: >> {"description":"Ampersand, number sign", >> "input":"&#", >> "output":["ParseError", ["Character", "&"], ["Character", "#"]]}, >> >> {"description":"Unfinished numeric entity", >> "input":"&#x", >> "output":["ParseError", ["Character", "&#x"]]}, > > Would it work for html5lib if consistent coalescing was used in the > test format? That is, could the first of these two changed to > {"description":"Ampersand, number sign", > "input":"&#", > "output":["ParseError", ["Character", "&#"]]}, > ? > Yes. -- "Eternity's a terrible thought. I mean, where's it all going to end?" -- Tom Stoppard, Rosencrantz and Guildenstern are Dead From hsivonen at iki.fi Sun Jun 17 09:21:56 2007 From: hsivonen at iki.fi (Henri Sivonen) Date: Sun, 17 Jun 2007 19:21:56 +0300 Subject: [Imps] Character token coalescing in tokenizer tests In-Reply-To: <46729885.20706@cam.ac.uk> References: <310EA4C7-D4DB-440F-98EC-2B3608D867ED@iki.fi> <46729885.20706@cam.ac.uk> Message-ID: On Jun 15, 2007, at 16:47, James Graham wrote: > Yes. Great. Filed bug with test patch: http://code.google.com/p/html5lib/issues/detail?id=45 -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From t.broyer at gmail.com Wed Jun 20 00:38:31 2007 From: t.broyer at gmail.com (Thomas Broyer) Date: Wed, 20 Jun 2007 09:38:31 +0200 Subject: [Imps] [whatwg] html5 parsing/tokenizing In-Reply-To: <8ad71be30706191620t43a6ab88v51037e1bf6c49f6@mail.gmail.com> References: <8ad71be30706191620t43a6ab88v51037e1bf6c49f6@mail.gmail.com> Message-ID: > When the tokenization state machine is defined, every state first > "consumes" and then potentially "emits". Some of the states transfer to > another state with an order to "re-consume the character in the next > state". This means that what you do in the new state is dependant on > what you did in the last state and that the "comsume" is necessarily an > inconsistent operation. A much better wording would be "look at the next > character" and on state transition "consume and emit" or just "emit > without consumption" making it clear when the input cursor moves. I did the same in Twintsam with PeekChar/PeekChars and EatChar/EatChars methods. http://twintsam.googlecode.com/svn/trunk/Twintsam/Html/HtmlReader.StreamHandling.cs (beware, Twintsam hasn't been updated since January so it's not in sync with the spec as it is now) though actually you could just use a character queue into which you push back characters that needs to be "re-consumed" (i.e. you "un-read" the character and then you switch to the other state). This is what html5lib does: http://html5lib.googlecode.com/svn/trunk/python/src/tokenizer.py (search for self.stream.queue; this needs to be refactored with an unread() method on the HTMLInputStream) That is to say, I don't think the spec should be changed at all. It's just a matter of how you implement it. You just have to know that the "queue" won't ever be larger than 9 characters as there are tweaks for 0-prefixed numeric entities and/or numeric entities greater 1114111. > It would be nice if all tags (except comments) were considered > "declarations" instead of bogus comments. Then DOCTYPE wouldn't need > special handling by the tokenizer, just special handling by the parser. > (Too much of the parser seems to have gotten into the tokenizer; with > CDATA and RCDATA, this is a necessary evil. With it > isn't.) I can't see the problem here; plus DOCTYPE parsing is special because we need the DOCTYPE name. Moreover, the spec has changed recently so that DOCTYPE parsing takes care of PUBLIC and SYSTEM identifiers. -- Thomas Broyer From hsivonen at iki.fi Thu Jun 28 05:53:33 2007 From: hsivonen at iki.fi (Henri Sivonen) Date: Thu, 28 Jun 2007 15:53:33 +0300 Subject: [Imps] HTML 4.01 compatibility modes for an HTML5 parser Message-ID: <494E0389-B04D-403B-ABBD-9063C288F239@iki.fi> Recognizing that there are people who want to treat the HTML 4.01 doctypes as non-errors for the time being, my old prototype parser had four modes for dealing with the HTML 4.01 legacy as interesting to users today. To avoid regressing on functionality with my current replacement parser project, I've been thinking that I should retain those four modes. The modes are now drafted as follows: /** * Be a pure HTML5 parser. */ HTML5, /** * Require the HTML 4.01 Transitional public id. Turn on HTML4- specific * additional errors regardless of doctype. */ HTML401_TRANSITIONAL, /** * Require the HTML 4.01 Transitional public id and a system id. Turn on * HTML4-specific additional errors regardless of doctype. */ HTML401_STRICT, /** * Treat the HTML5 doctype, doctypes with the HTML 4.01 Strict public id and * doctypes with the HTML 4.01 Transitional public id and a system id as * non-errors. Turn of HTML4-specific additional errors if the public id is * the HTML 4.01 Strict or Transitional public id. */ AUTO Does this seem reasonable? Are there additional modes that would be such low-hanging fruit that I should offer more modes? On the other hand, is there something wrong with offering these modes? Note that not providing modes for Appendix C checking is a deliberate choice to better manage how I use my time. -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From hsivonen at iki.fi Thu Jun 28 07:05:44 2007 From: hsivonen at iki.fi (Henri Sivonen) Date: Thu, 28 Jun 2007 17:05:44 +0300 Subject: [Imps] HTML 4.01 compatibility modes for an HTML5 parser In-Reply-To: <494E0389-B04D-403B-ABBD-9063C288F239@iki.fi> References: <494E0389-B04D-403B-ABBD-9063C288F239@iki.fi> Message-ID: <0C42402A-72DB-4855-B350-A9AE3A365D41@iki.fi> On Jun 28, 2007, at 15:53, Henri Sivonen wrote: > Does this seem reasonable? Are there additional modes that would be > such low-hanging fruit that I should offer more modes? On the other > hand, is there something wrong with offering these modes? Revised per IRC discussion with Anne: /** * Be a pure HTML5 parser. */ HTML, /** * Require the HTML 4.01 Transitional public id. Turn on HTML4- specific * additional errors regardless of doctype. */ HTML401_TRANSITIONAL, /** * Require the HTML 4.01 Transitional public id and a system id. Turn on * HTML4-specific additional errors regardless of doctype. */ HTML401_STRICT, /** * Treat the doctype required by HTML 5, doctypes with the HTML 4.01 Strict * public id and doctypes with the HTML 4.01 Transitional public id and a * system id as non-errors. Turn on HTML4-specific additional errors if the * public id is the HTML 4.01 Strict or Transitional public id. */ AUTO, /** * Never enable HTML4-specific error checks. Never report any doctype * condition as an error. (Doctype tokens in wrong places will be * reported as errors, though.) The application may decide what to log * in response to calls to DocumentModeHanler. This mode * in meant for doing surveys on existing content. */ NO_DOCTYPE_ERRORS -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From annevk at opera.com Thu Jun 28 07:08:53 2007 From: annevk at opera.com (Anne van Kesteren) Date: Thu, 28 Jun 2007 16:08:53 +0200 Subject: [Imps] HTML 4.01 compatibility modes for an HTML5 parser In-Reply-To: <0C42402A-72DB-4855-B350-A9AE3A365D41@iki.fi> References: <494E0389-B04D-403B-ABBD-9063C288F239@iki.fi> <0C42402A-72DB-4855-B350-A9AE3A365D41@iki.fi> Message-ID: On Thu, 28 Jun 2007 16:05:44 +0200, Henri Sivonen wrote: > NO_DOCTYPE_ERRORS I think you do want every DOCTYPE that triggers quirks mode or almost quirks mode (or whatever they're called these days) to trigger an error in this mode. -- Anne van Kesteren From hsivonen at iki.fi Thu Jun 28 07:11:58 2007 From: hsivonen at iki.fi (Henri Sivonen) Date: Thu, 28 Jun 2007 17:11:58 +0300 Subject: [Imps] HTML 4.01 compatibility modes for an HTML5 parser In-Reply-To: References: <494E0389-B04D-403B-ABBD-9063C288F239@iki.fi> <0C42402A-72DB-4855-B350-A9AE3A365D41@iki.fi> Message-ID: On Jun 28, 2007, at 17:08, Anne van Kesteren wrote: > On Thu, 28 Jun 2007 16:05:44 +0200, Henri Sivonen > wrote: >> NO_DOCTYPE_ERRORS > > I think you do want every DOCTYPE that triggers quirks mode or > almost quirks mode (or whatever they're called these days) to > trigger an error in this mode. The definition would delegate actions on a quirky doctype to the app so that survey spiders could note the situation specifically. Do you expect people to do bad things with this mode if I leave even quirkiness reporting to the app? -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From annevk at opera.com Thu Jun 28 07:19:00 2007 From: annevk at opera.com (Anne van Kesteren) Date: Thu, 28 Jun 2007 16:19:00 +0200 Subject: [Imps] HTML 4.01 compatibility modes for an HTML5 parser In-Reply-To: References: <494E0389-B04D-403B-ABBD-9063C288F239@iki.fi> <0C42402A-72DB-4855-B350-A9AE3A365D41@iki.fi> Message-ID: On Thu, 28 Jun 2007 16:11:58 +0200, Henri Sivonen wrote: > On Jun 28, 2007, at 17:08, Anne van Kesteren wrote: >> On Thu, 28 Jun 2007 16:05:44 +0200, Henri Sivonen >> wrote: >>> NO_DOCTYPE_ERRORS >> >> I think you do want every DOCTYPE that triggers quirks mode or almost >> quirks mode (or whatever they're called these days) to trigger an error >> in this mode. > > The definition would delegate actions on a quirky doctype to the app so > that survey spiders could note the situation specifically. Do you expect > people to do bad things with this mode if I leave even quirkiness > reporting to the app? My thought was that it would be more useful for surveys if you don't have to manually add such errors. However, I suppose you can argue either way and I don't feel strongly about it. -- Anne van Kesteren From hsivonen at iki.fi Thu Jun 28 07:21:29 2007 From: hsivonen at iki.fi (Henri Sivonen) Date: Thu, 28 Jun 2007 17:21:29 +0300 Subject: [Imps] HTML 4.01 compatibility modes for an HTML5 parser In-Reply-To: References: <494E0389-B04D-403B-ABBD-9063C288F239@iki.fi> <0C42402A-72DB-4855-B350-A9AE3A365D41@iki.fi> Message-ID: <67E54C85-4D41-41F3-AE44-686670097849@iki.fi> On Jun 28, 2007, at 17:19, Anne van Kesteren wrote: > On Thu, 28 Jun 2007 16:11:58 +0200, Henri Sivonen > wrote: >> On Jun 28, 2007, at 17:08, Anne van Kesteren wrote: >>> On Thu, 28 Jun 2007 16:05:44 +0200, Henri Sivonen >>> wrote: >>>> NO_DOCTYPE_ERRORS >>> >>> I think you do want every DOCTYPE that triggers quirks mode or >>> almost quirks mode (or whatever they're called these days) to >>> trigger an error in this mode. >> >> The definition would delegate actions on a quirky doctype to the >> app so that survey spiders could note the situation specifically. >> Do you expect people to do bad things with this mode if I leave >> even quirkiness reporting to the app? > > My thought was that it would be more useful for surveys if you > don't have to manually add such errors. However, I suppose you can > argue either way and I don't feel strongly about it. There's also the option of adding yet another mode. Make it a pref. ;-) ALLOW_ANY_STANDARDS_MODE -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From t.broyer at gmail.com Fri Jun 8 11:38:29 2007 From: t.broyer at gmail.com (Thomas Broyer) Date: Fri, 8 Jun 2007 20:38:29 +0200 Subject: [Imps] [whatwg] HTML5Lib Test Suite vs. PHP 5 HTML Parser In-Reply-To: <2A1705BE-95FD-4A9C-9DFC-E25786F0AF4B@michelf.com> References: <2A1705BE-95FD-4A9C-9DFC-E25786F0AF4B@michelf.com> Message-ID: 2007/6/8, Michel Fortin: > Perhaps someone will find this raw data interesting. I've made a > script to run the HTML5Lib test cases against the built-in HTML > parser in PHP 5. And here's the result: > > Have you tried PH5P (pure PHP HTML5 parser)? http://jero.net/lab/ph5p/ [CC'd Implementors list, please follow-up there] -- Thomas Broyer From michel.fortin at michelf.com Fri Jun 8 12:12:56 2007 From: michel.fortin at michelf.com (Michel Fortin) Date: Fri, 08 Jun 2007 15:12:56 -0400 Subject: [Imps] [whatwg] HTML5Lib Test Suite vs. PHP 5 HTML Parser In-Reply-To: References: <2A1705BE-95FD-4A9C-9DFC-E25786F0AF4B@michelf.com> Message-ID: <06747511-EFAC-4563-875E-FB206B89B0F3@michelf.com> Le 2007-06-08 ? 14:38, Thomas Broyer a ?crit : > Have you tried PH5P (pure PHP HTML5 parser)? > > http://jero.net/lab/ph5p/ Just did, but unfortunately at test case 29 of the first file it falls into this piece of code: exit('

select not yet supported.

'); which ends the process. So the result isn't much interesting. If I remove it -- and I also remove the two others `exit` statements found in the code -- it falls into an infinite loop at test2.dat, case 4 (
test
). Here's the result anyway: Note that there is nothing right now to tell appart failing and passing tests. Michel Fortin michel.fortin at michelf.com http://www.michelf.com/ From hsivonen at iki.fi Sun Jun 10 13:42:38 2007 From: hsivonen at iki.fi (Henri Sivonen) Date: Sun, 10 Jun 2007 23:42:38 +0300 Subject: [Imps] Percentages of HTML tags with a given number of attributes Message-ID: <9599AA61-4391-4B8B-9B50-9C7CA6D09809@iki.fi> Has anyone researched what percentage of HTML tags out there has <= 1 attributes, <= 2 attributes, etc.? If yes, are the numbers public somewhere? This kind of data is of interest when choosing a memory allocation policy for attributes. -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From hsivonen at iki.fi Fri Jun 15 06:32:19 2007 From: hsivonen at iki.fi (Henri Sivonen) Date: Fri, 15 Jun 2007 16:32:19 +0300 Subject: [Imps] Character token coalescing in tokenizer tests Message-ID: <310EA4C7-D4DB-440F-98EC-2B3608D867ED@iki.fi> > {"description":"Ampersand, number sign", > "input":"&#", > "output":["ParseError", ["Character", "&"], ["Character", "#"]]}, > > {"description":"Unfinished numeric entity", > "input":"&#x", > "output":["ParseError", ["Character", "&#x"]]}, Would it work for html5lib if consistent coalescing was used in the test format? That is, could the first of these two changed to {"description":"Ampersand, number sign", "input":"&#", "output":["ParseError", ["Character", "&#"]]}, ? -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From jg307 at cam.ac.uk Fri Jun 15 06:47:49 2007 From: jg307 at cam.ac.uk (James Graham) Date: Fri, 15 Jun 2007 14:47:49 +0100 Subject: [Imps] Character token coalescing in tokenizer tests In-Reply-To: <310EA4C7-D4DB-440F-98EC-2B3608D867ED@iki.fi> References: <310EA4C7-D4DB-440F-98EC-2B3608D867ED@iki.fi> Message-ID: <46729885.20706@cam.ac.uk> Henri Sivonen wrote: >> {"description":"Ampersand, number sign", >> "input":"&#", >> "output":["ParseError", ["Character", "&"], ["Character", "#"]]}, >> >> {"description":"Unfinished numeric entity", >> "input":"&#x", >> "output":["ParseError", ["Character", "&#x"]]}, > > Would it work for html5lib if consistent coalescing was used in the > test format? That is, could the first of these two changed to > {"description":"Ampersand, number sign", > "input":"&#", > "output":["ParseError", ["Character", "&#"]]}, > ? > Yes. -- "Eternity's a terrible thought. I mean, where's it all going to end?" -- Tom Stoppard, Rosencrantz and Guildenstern are Dead From hsivonen at iki.fi Sun Jun 17 09:21:56 2007 From: hsivonen at iki.fi (Henri Sivonen) Date: Sun, 17 Jun 2007 19:21:56 +0300 Subject: [Imps] Character token coalescing in tokenizer tests In-Reply-To: <46729885.20706@cam.ac.uk> References: <310EA4C7-D4DB-440F-98EC-2B3608D867ED@iki.fi> <46729885.20706@cam.ac.uk> Message-ID: On Jun 15, 2007, at 16:47, James Graham wrote: > Yes. Great. Filed bug with test patch: http://code.google.com/p/html5lib/issues/detail?id=45 -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From t.broyer at gmail.com Wed Jun 20 00:38:31 2007 From: t.broyer at gmail.com (Thomas Broyer) Date: Wed, 20 Jun 2007 09:38:31 +0200 Subject: [Imps] [whatwg] html5 parsing/tokenizing In-Reply-To: <8ad71be30706191620t43a6ab88v51037e1bf6c49f6@mail.gmail.com> References: <8ad71be30706191620t43a6ab88v51037e1bf6c49f6@mail.gmail.com> Message-ID: > When the tokenization state machine is defined, every state first > "consumes" and then potentially "emits". Some of the states transfer to > another state with an order to "re-consume the character in the next > state". This means that what you do in the new state is dependant on > what you did in the last state and that the "comsume" is necessarily an > inconsistent operation. A much better wording would be "look at the next > character" and on state transition "consume and emit" or just "emit > without consumption" making it clear when the input cursor moves. I did the same in Twintsam with PeekChar/PeekChars and EatChar/EatChars methods. http://twintsam.googlecode.com/svn/trunk/Twintsam/Html/HtmlReader.StreamHandling.cs (beware, Twintsam hasn't been updated since January so it's not in sync with the spec as it is now) though actually you could just use a character queue into which you push back characters that needs to be "re-consumed" (i.e. you "un-read" the character and then you switch to the other state). This is what html5lib does: http://html5lib.googlecode.com/svn/trunk/python/src/tokenizer.py (search for self.stream.queue; this needs to be refactored with an unread() method on the HTMLInputStream) That is to say, I don't think the spec should be changed at all. It's just a matter of how you implement it. You just have to know that the "queue" won't ever be larger than 9 characters as there are tweaks for 0-prefixed numeric entities and/or numeric entities greater 1114111. > It would be nice if all tags (except comments) were considered > "declarations" instead of bogus comments. Then DOCTYPE wouldn't need > special handling by the tokenizer, just special handling by the parser. > (Too much of the parser seems to have gotten into the tokenizer; with > CDATA and RCDATA, this is a necessary evil. With it > isn't.) I can't see the problem here; plus DOCTYPE parsing is special because we need the DOCTYPE name. Moreover, the spec has changed recently so that DOCTYPE parsing takes care of PUBLIC and SYSTEM identifiers. -- Thomas Broyer From hsivonen at iki.fi Thu Jun 28 05:53:33 2007 From: hsivonen at iki.fi (Henri Sivonen) Date: Thu, 28 Jun 2007 15:53:33 +0300 Subject: [Imps] HTML 4.01 compatibility modes for an HTML5 parser Message-ID: <494E0389-B04D-403B-ABBD-9063C288F239@iki.fi> Recognizing that there are people who want to treat the HTML 4.01 doctypes as non-errors for the time being, my old prototype parser had four modes for dealing with the HTML 4.01 legacy as interesting to users today. To avoid regressing on functionality with my current replacement parser project, I've been thinking that I should retain those four modes. The modes are now drafted as follows: /** * Be a pure HTML5 parser. */ HTML5, /** * Require the HTML 4.01 Transitional public id. Turn on HTML4- specific * additional errors regardless of doctype. */ HTML401_TRANSITIONAL, /** * Require the HTML 4.01 Transitional public id and a system id. Turn on * HTML4-specific additional errors regardless of doctype. */ HTML401_STRICT, /** * Treat the HTML5 doctype, doctypes with the HTML 4.01 Strict public id and * doctypes with the HTML 4.01 Transitional public id and a system id as * non-errors. Turn of HTML4-specific additional errors if the public id is * the HTML 4.01 Strict or Transitional public id. */ AUTO Does this seem reasonable? Are there additional modes that would be such low-hanging fruit that I should offer more modes? On the other hand, is there something wrong with offering these modes? Note that not providing modes for Appendix C checking is a deliberate choice to better manage how I use my time. -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From hsivonen at iki.fi Thu Jun 28 07:05:44 2007 From: hsivonen at iki.fi (Henri Sivonen) Date: Thu, 28 Jun 2007 17:05:44 +0300 Subject: [Imps] HTML 4.01 compatibility modes for an HTML5 parser In-Reply-To: <494E0389-B04D-403B-ABBD-9063C288F239@iki.fi> References: <494E0389-B04D-403B-ABBD-9063C288F239@iki.fi> Message-ID: <0C42402A-72DB-4855-B350-A9AE3A365D41@iki.fi> On Jun 28, 2007, at 15:53, Henri Sivonen wrote: > Does this seem reasonable? Are there additional modes that would be > such low-hanging fruit that I should offer more modes? On the other > hand, is there something wrong with offering these modes? Revised per IRC discussion with Anne: /** * Be a pure HTML5 parser. */ HTML, /** * Require the HTML 4.01 Transitional public id. Turn on HTML4- specific * additional errors regardless of doctype. */ HTML401_TRANSITIONAL, /** * Require the HTML 4.01 Transitional public id and a system id. Turn on * HTML4-specific additional errors regardless of doctype. */ HTML401_STRICT, /** * Treat the doctype required by HTML 5, doctypes with the HTML 4.01 Strict * public id and doctypes with the HTML 4.01 Transitional public id and a * system id as non-errors. Turn on HTML4-specific additional errors if the * public id is the HTML 4.01 Strict or Transitional public id. */ AUTO, /** * Never enable HTML4-specific error checks. Never report any doctype * condition as an error. (Doctype tokens in wrong places will be * reported as errors, though.) The application may decide what to log * in response to calls to DocumentModeHanler. This mode * in meant for doing surveys on existing content. */ NO_DOCTYPE_ERRORS -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From annevk at opera.com Thu Jun 28 07:08:53 2007 From: annevk at opera.com (Anne van Kesteren) Date: Thu, 28 Jun 2007 16:08:53 +0200 Subject: [Imps] HTML 4.01 compatibility modes for an HTML5 parser In-Reply-To: <0C42402A-72DB-4855-B350-A9AE3A365D41@iki.fi> References: <494E0389-B04D-403B-ABBD-9063C288F239@iki.fi> <0C42402A-72DB-4855-B350-A9AE3A365D41@iki.fi> Message-ID: On Thu, 28 Jun 2007 16:05:44 +0200, Henri Sivonen wrote: > NO_DOCTYPE_ERRORS I think you do want every DOCTYPE that triggers quirks mode or almost quirks mode (or whatever they're called these days) to trigger an error in this mode. -- Anne van Kesteren From hsivonen at iki.fi Thu Jun 28 07:11:58 2007 From: hsivonen at iki.fi (Henri Sivonen) Date: Thu, 28 Jun 2007 17:11:58 +0300 Subject: [Imps] HTML 4.01 compatibility modes for an HTML5 parser In-Reply-To: References: <494E0389-B04D-403B-ABBD-9063C288F239@iki.fi> <0C42402A-72DB-4855-B350-A9AE3A365D41@iki.fi> Message-ID: On Jun 28, 2007, at 17:08, Anne van Kesteren wrote: > On Thu, 28 Jun 2007 16:05:44 +0200, Henri Sivonen > wrote: >> NO_DOCTYPE_ERRORS > > I think you do want every DOCTYPE that triggers quirks mode or > almost quirks mode (or whatever they're called these days) to > trigger an error in this mode. The definition would delegate actions on a quirky doctype to the app so that survey spiders could note the situation specifically. Do you expect people to do bad things with this mode if I leave even quirkiness reporting to the app? -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From annevk at opera.com Thu Jun 28 07:19:00 2007 From: annevk at opera.com (Anne van Kesteren) Date: Thu, 28 Jun 2007 16:19:00 +0200 Subject: [Imps] HTML 4.01 compatibility modes for an HTML5 parser In-Reply-To: References: <494E0389-B04D-403B-ABBD-9063C288F239@iki.fi> <0C42402A-72DB-4855-B350-A9AE3A365D41@iki.fi> Message-ID: On Thu, 28 Jun 2007 16:11:58 +0200, Henri Sivonen wrote: > On Jun 28, 2007, at 17:08, Anne van Kesteren wrote: >> On Thu, 28 Jun 2007 16:05:44 +0200, Henri Sivonen >> wrote: >>> NO_DOCTYPE_ERRORS >> >> I think you do want every DOCTYPE that triggers quirks mode or almost >> quirks mode (or whatever they're called these days) to trigger an error >> in this mode. > > The definition would delegate actions on a quirky doctype to the app so > that survey spiders could note the situation specifically. Do you expect > people to do bad things with this mode if I leave even quirkiness > reporting to the app? My thought was that it would be more useful for surveys if you don't have to manually add such errors. However, I suppose you can argue either way and I don't feel strongly about it. -- Anne van Kesteren From hsivonen at iki.fi Thu Jun 28 07:21:29 2007 From: hsivonen at iki.fi (Henri Sivonen) Date: Thu, 28 Jun 2007 17:21:29 +0300 Subject: [Imps] HTML 4.01 compatibility modes for an HTML5 parser In-Reply-To: References: <494E0389-B04D-403B-ABBD-9063C288F239@iki.fi> <0C42402A-72DB-4855-B350-A9AE3A365D41@iki.fi> Message-ID: <67E54C85-4D41-41F3-AE44-686670097849@iki.fi> On Jun 28, 2007, at 17:19, Anne van Kesteren wrote: > On Thu, 28 Jun 2007 16:11:58 +0200, Henri Sivonen > wrote: >> On Jun 28, 2007, at 17:08, Anne van Kesteren wrote: >>> On Thu, 28 Jun 2007 16:05:44 +0200, Henri Sivonen >>> wrote: >>>> NO_DOCTYPE_ERRORS >>> >>> I think you do want every DOCTYPE that triggers quirks mode or >>> almost quirks mode (or whatever they're called these days) to >>> trigger an error in this mode. >> >> The definition would delegate actions on a quirky doctype to the >> app so that survey spiders could note the situation specifically. >> Do you expect people to do bad things with this mode if I leave >> even quirkiness reporting to the app? > > My thought was that it would be more useful for surveys if you > don't have to manually add such errors. However, I suppose you can > argue either way and I don't feel strongly about it. There's also the option of adding yet another mode. Make it a pref. ;-) ALLOW_ANY_STANDARDS_MODE -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/