From hsivonen at iki.fi Wed Apr 2 07:58:06 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Wed, 2 Apr 2008 17:58:06 +0300 Subject: [imps] td fragments Message-ID: <011D8708-C1F1-46D8-BED4-D4F0A037297D@iki.fi> Can someone, please, explain to me, what part of the spec makes the insertion mode "in body" when the context is "td" or "th" in the fragment case? Specifically, this test case doesn't pass when I implement "reset the insertion mode" rigorously per spec as far as I can see. 1048 ryansking #data 1048 ryansking
1048 ryansking #errors 1062 ryansking Line: 1 Col: 8 Unexpected end tag (table). Ignored. 1062 ryansking Line: 1 Col: 16 Unexpected end tag (tbody). Ignored. 1062 ryansking Line: 1 Col: 24 Unexpected end tag (tfoot). Ignored. 1062 ryansking Line: 1 Col: 32 Unexpected end tag (thead). Ignored. 1062 ryansking Line: 1 Col: 37 Unexpected end tag (tr). Ignored. 1062 ryansking Line: 1 Col: 42 Expected closing tag. Unexpected end of file. 1114 hsivonen #document-fragment 1114 hsivonen td 1052 ryansking #document 1052 ryansking |
-- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From hsivonen at iki.fi Wed Apr 2 09:18:30 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Wed, 2 Apr 2008 19:18:30 +0300 Subject: [imps]
Message-ID: <46EDA2B4-781F-4FA2-AF1F-6E20644B4E0B@iki.fi> Data:
Expected: | | | | |
|
|
Got: | | | | | |
|
Expected errors: Line: 1 Col: 33 End tag (form) seen too early. Ignored. Line: 1 Col: 38 Expected closing tag. Unexpected end of file. Actual errors: 33: End tag ?form? seen but there were unclosed elements. 39: End of file seen and there were open elements. Can someone, please, explain to me, we the test case ignores the tag?
is not scoping per spec, so there is a in scope to close. -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From t.broyer at gmail.com Wed Apr 2 10:49:29 2008 From: t.broyer at gmail.com (Thomas Broyer) Date: Wed, 2 Apr 2008 19:49:29 +0200 Subject: [imps]
In-Reply-To: <46EDA2B4-781F-4FA2-AF1F-6E20644B4E0B@iki.fi> References: <46EDA2B4-781F-4FA2-AF1F-6E20644B4E0B@iki.fi> Message-ID: On Wed, Apr 2, 2008 at 6:18 PM, Henri Sivonen wrote: > Data: >
> Expected: > | > | > | > | > |
> |
> |
> Got: > | > | > | > | > | > |
> |
> Expected errors: > Line: 1 Col: 33 End tag (form) seen too early. Ignored. > Line: 1 Col: 38 Expected closing tag. Unexpected end of file. > Actual errors: > 33: End tag "form" seen but there were unclosed elements. > 39: End of file seen and there were open elements. > > Can someone, please, explain to me, we the test case ignores the form> tag? Because it hasn't been updated and the spec changed since it has been written ? It might be this one (judging from the commit log) http://html5.org/tools/web-apps-tracker?from=1319&to=1320 Instead of fixing tests from html5lib's repository, I suggest removing them and contributing new tests in html5's repository http://html5.googlecode.com/svn/trunk/tests/ -- Thomas Broyer From hsivonen at iki.fi Wed Apr 2 11:58:21 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Wed, 2 Apr 2008 21:58:21 +0300 Subject: [imps]
In-Reply-To: References: <46EDA2B4-781F-4FA2-AF1F-6E20644B4E0B@iki.fi> Message-ID: <137E8D25-9227-4620-ABC4-C70884F9DC23@iki.fi> On Apr 2, 2008, at 20:49, Thomas Broyer wrote: > Because it hasn't been updated and the spec changed since it has > been written ? > It might be this one (judging from the commit log) > http://html5.org/tools/web-apps-tracker?from=1319&to=1320 OK. I'll fix the test. > Instead of fixing tests from html5lib's repository, I suggest removing > them and contributing new tests in html5's repository > http://html5.googlecode.com/svn/trunk/tests/ That project has a different license and all. Can't we just keep the tests in the html5lib repo under the current license? (Especially since we even got Dan Connolly to escalate and get the MIT license OKed from the W3C point of view. Also, I'd hate to have to think about relicensing of the corporate contribution written by me.) -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From ian at hixie.ch Wed Apr 2 17:50:23 2008 From: ian at hixie.ch (Ian Hickson) Date: Thu, 3 Apr 2008 00:50:23 +0000 (UTC) Subject: [imps] td fragments In-Reply-To: <011D8708-C1F1-46D8-BED4-D4F0A037297D@iki.fi> References: <011D8708-C1F1-46D8-BED4-D4F0A037297D@iki.fi> Message-ID: On Wed, 2 Apr 2008, Henri Sivonen wrote: > > Can someone, please, explain to me, what part of the spec makes the > insertion mode "in body" when the context is "td" or "th" in the > fragment case? None. It's "in cell". However, the way that the "in cell" case is defined for the fragment case, it turns out (I just noticed) that it is indistinguishable from being "in body". > Specifically, this test case doesn't pass when I implement "reset the > insertion mode" rigorously per spec as far as I can see. > > 1048 ryansking #data > 1048 ryansking
> 1048 ryansking #errors > 1062 ryansking Line: 1 Col: 8 Unexpected end tag (table). Ignored. > 1062 ryansking Line: 1 Col: 16 Unexpected end tag (tbody). Ignored. > 1062 ryansking Line: 1 Col: 24 Unexpected end tag (tfoot). Ignored. > 1062 ryansking Line: 1 Col: 32 Unexpected end tag (thead). Ignored. > 1062 ryansking Line: 1 Col: 37 Unexpected end tag (tr). Ignored. > 1062 ryansking Line: 1 Col: 42 Expected closing tag. Unexpected > end of file. > 1114 hsivonen #document-fragment > 1114 hsivonen td > 1052 ryansking #document > 1052 ryansking |
What part doesn't pass? (i.e. what do you get?) -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' From ian at hixie.ch Wed Apr 2 17:52:31 2008 From: ian at hixie.ch (Ian Hickson) Date: Thu, 3 Apr 2008 00:52:31 +0000 (UTC) Subject: [imps]
In-Reply-To: <46EDA2B4-781F-4FA2-AF1F-6E20644B4E0B@iki.fi> References: <46EDA2B4-781F-4FA2-AF1F-6E20644B4E0B@iki.fi> Message-ID: On Wed, 2 Apr 2008, Henri Sivonen wrote: > > Data: >
> Expected: > | > | > | > | > |
> |
> |
> Got: > | > | > | > | > | > |
> |
> Expected errors: > Line: 1 Col: 33 End tag (form) seen too early. Ignored. > Line: 1 Col: 38 Expected closing tag. Unexpected end of file. > Actual errors: > 33: End tag ?form? seen but there were unclosed elements. > 39: End of file seen and there were open elements. > > Can someone, please, explain to me, we the test case ignores the form> tag?
is not scoping per spec, so there is a in scope > to close. This part of the spec changed recently, IIRC. Maybe the test wasn't updated? -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' From hsivonen at iki.fi Thu Apr 3 07:32:06 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Thu, 3 Apr 2008 17:32:06 +0300 Subject: [imps] td fragments In-Reply-To: References: <011D8708-C1F1-46D8-BED4-D4F0A037297D@iki.fi> Message-ID: <7366C75D-473C-4646-AFCF-DB134999A0FE@iki.fi> On Apr 3, 2008, at 03:50, Ian Hickson wrote: > On Wed, 2 Apr 2008, Henri Sivonen wrote: >> >> Can someone, please, explain to me, what part of the spec makes the >> insertion mode "in body" when the context is "td" or "th" in the >> fragment case? > > None. It's "in cell". Why? That is, based on what normative statements in the spec? "3 If node is the first node in the stack of open elements, then set last to true; if, in addition, the context element of the HTML fragment parsing algorithm is neither a td element nor a th element, then set node to the context element. (fragment case)" Now /node/ is "html"--not "td". "5 If node is a td or th element, then switch the insertion mode to "in cell" and abort these steps. " This doesn't match, since it is "html"--not "td". >> Specifically, this test case doesn't pass when I implement "reset the >> insertion mode" rigorously per spec as far as I can see. >> >> 1048 ryansking #data >> 1048 ryansking
>> 1048 ryansking #errors >> 1062 ryansking Line: 1 Col: 8 Unexpected end tag (table). Ignored. >> 1062 ryansking Line: 1 Col: 16 Unexpected end tag (tbody). >> Ignored. >> 1062 ryansking Line: 1 Col: 24 Unexpected end tag (tfoot). >> Ignored. >> 1062 ryansking Line: 1 Col: 32 Unexpected end tag (thead). >> Ignored. >> 1062 ryansking Line: 1 Col: 37 Unexpected end tag (tr). Ignored. >> 1062 ryansking Line: 1 Col: 42 Expected closing tag. Unexpected >> end of file. >> 1114 hsivonen #document-fragment >> 1114 hsivonen td >> 1052 ryansking #document >> 1052 ryansking |
> > What part doesn't pass? (i.e. what do you get?) | | |
-- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From ian at hixie.ch Thu Apr 3 11:34:41 2008 From: ian at hixie.ch (Ian Hickson) Date: Thu, 3 Apr 2008 18:34:41 +0000 (UTC) Subject: [imps] td fragments In-Reply-To: <7366C75D-473C-4646-AFCF-DB134999A0FE@iki.fi> References: <011D8708-C1F1-46D8-BED4-D4F0A037297D@iki.fi> <7366C75D-473C-4646-AFCF-DB134999A0FE@iki.fi> Message-ID: On Thu, 3 Apr 2008, Henri Sivonen wrote: > > > > > > Can someone, please, explain to me, what part of the spec makes the > > > insertion mode "in body" when the context is "td" or "th" in the > > > fragment case? > > > > None. It's "in cell". > > Why? That is, based on what normative statements in the spec? > > "3 If node is the first node in the stack of open elements, then set > last to true; if, in addition, the context element of the HTML fragment > parsing algorithm is neither a td element nor a th element, then set > node to the context element. (fragment case)" > > Now /node/ is "html"--not "td". > > "5 If node is a td or th element, then switch the insertion mode to "in > cell" and abort these steps. " > > This doesn't match, since it is "html"--not "td". Hm, I wonder what that line is doing there. It seems we should remove the "if, in addition, the context element of the HTML fragment parsing algorithm is neither a td element nor a th element" condition, or make that particular situation trigger "in body" rather than "before head". -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' From hsivonen at iki.fi Fri Apr 4 03:53:29 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Fri, 4 Apr 2008 13:53:29 +0300 Subject: [imps] Emulating the HTML DOM when actually parsed from XML Message-ID: <90924D0E-F13D-42C5-ABD1-82A1ABD62D74@iki.fi> I was thinking that especially with the MathML and SVG additions, it would be great to be able to test effect of the HTML5 parsing algorithm in current browsers. Since hooking up a parser written in Java or Python into a browser written in C++ is in itself non-trivial, I started considering an HTTP proxy that intercepted text/html and converted into application/xhtml+xml. (Jetty, Validator.nu parser, Commons HttpClient.) This approach might even work for static pages, as many people already write their selectors in lower case. However, a bit part of Web compat is script compat, and the proxy would make browsers put the DOM in the XML mode. Would it be possible to monkeypatch the features listed at http://wiki.whatwg.org/wiki/HtmlVsXhtml#Scripts using JS prototypes if the proxy injected a script into each document? Except for document.write(), of course. Might someone already have done this? Then there's the form pointer issue. With Opera, setting the WF2 form attribute would work, but what about Gecko and WebKit? And then there's the issue that some behavior depends on the character encoding and might break if the document is promoted to UTF-8. Would setting accept-charset on work around this sufficiently? Any ideas if quirks mode CSS and document.write() would make the whole exercise futile? -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From hsivonen at iki.fi Fri Apr 4 06:53:59 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Fri, 4 Apr 2008 16:53:59 +0300 Subject: [imps] Tree construction test in undocumented format Message-ID: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> > 1117 jgraham.html #data > 1117 jgraham.html EN"> > 1117 jgraham.html #errors > 1117 jgraham.html doctype-error > 1117 jgraham.html #document > 1117 jgraham.html | EN" ""> > 1117 jgraham.html | > 1117 jgraham.html | > 1117 jgraham.html | The expected output doesn't follow the documented format: http://wiki.whatwg.org/wiki/Parser_tests I grepped around a bit in the Python code unsuccessfully. It would be nice to have documentation on the wiki. (Particularly around the null, "" and SYSTEM cases.) -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From jg307 at cam.ac.uk Fri Apr 4 08:58:50 2008 From: jg307 at cam.ac.uk (James Graham) Date: Fri, 04 Apr 2008 16:58:50 +0100 Subject: [imps] Tree construction test in undocumented format In-Reply-To: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> References: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> Message-ID: <47F6503A.4040309@cam.ac.uk> Henri Sivonen wrote: >> 1117 jgraham.html #data >> 1117 jgraham.html > EN"> >> 1117 jgraham.html #errors >> 1117 jgraham.html doctype-error >> 1117 jgraham.html #document >> 1117 jgraham.html | > EN" ""> >> 1117 jgraham.html | >> 1117 jgraham.html | >> 1117 jgraham.html | Ah, this is indeed my fault > > The expected output doesn't follow the documented format: > http://wiki.whatwg.org/wiki/Parser_tests I don't have time to update the wiki right now (maybe later), but IIRC the format is (using %foo to represent the variable foo) if there is neither a system id or a public ID otherwise This may not be the most sane format ever. -- "Eternity's a terrible thought. I mean, where's it all going to end?" -- Tom Stoppard, Rosencrantz and Guildenstern are Dead From annevk at opera.com Fri Apr 4 09:07:46 2008 From: annevk at opera.com (Anne van Kesteren) Date: Fri, 04 Apr 2008 18:07:46 +0200 Subject: [imps] Tree construction test in undocumented format In-Reply-To: <47F6503A.4040309@cam.ac.uk> References: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> <47F6503A.4040309@cam.ac.uk> Message-ID: On Fri, 04 Apr 2008 17:58:50 +0200, James Graham wrote: > if there is neither a system id or a public ID > > otherwise > > This may not be the most sane format ever. Here is my proposal: no public or system ID: public ID, no system ID: no public ID, system ID: public and system ID: (We need to cover all these cases as either the public ID or system ID can be null ("missing").) -- Anne van Kesteren From annevk at opera.com Fri Apr 4 09:11:15 2008 From: annevk at opera.com (Anne van Kesteren) Date: Fri, 04 Apr 2008 18:11:15 +0200 Subject: [imps] Tree construction test in undocumented format In-Reply-To: References: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> <47F6503A.4040309@cam.ac.uk> Message-ID: On Fri, 04 Apr 2008 18:07:46 +0200, Anne van Kesteren wrote: > no public or system ID: > public ID, no system ID: > no public ID, system ID: > public and system ID: Instead of fooling around with spaces alternatively we could put a P or S before the "..." in the case of either a public ID or system ID. Whether or not the document is in quirks mode should probably be something else. Maybe: #document-mode nq|q|lq or something like that. -- Anne van Kesteren From hsivonen at iki.fi Fri Apr 4 09:14:13 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Fri, 4 Apr 2008 19:14:13 +0300 Subject: [imps] Tree construction test in undocumented format In-Reply-To: References: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> <47F6503A.4040309@cam.ac.uk> Message-ID: <74FF9010-9885-4905-8FB8-0477E097BDC5@iki.fi> On Apr 4, 2008, at 19:11, Anne van Kesteren wrote: > On Fri, 04 Apr 2008 18:07:46 +0200, Anne van Kesteren > wrote: >> no public or system ID: >> public ID, no system ID: >> no public ID, system ID: >> public and system ID: > > Instead of fooling around with spaces alternatively we could put a P > or S before the "..." in the case of either a public ID or system > ID. Whether or not the document is in quirks mode should probably be > something else. Maybe: > > #document-mode > nq|q|lq > > or something like that. How about where %foo_id is either a double-quoted string or the string 'null' without quotes. But if both are null, use -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From annevk at opera.com Fri Apr 4 09:16:08 2008 From: annevk at opera.com (Anne van Kesteren) Date: Fri, 04 Apr 2008 18:16:08 +0200 Subject: [imps] Tree construction test in undocumented format In-Reply-To: <74FF9010-9885-4905-8FB8-0477E097BDC5@iki.fi> References: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> <47F6503A.4040309@cam.ac.uk> <74FF9010-9885-4905-8FB8-0477E097BDC5@iki.fi> Message-ID: On Fri, 04 Apr 2008 18:14:13 +0200, Henri Sivonen wrote: > How about where %foo_id is > either a double-quoted string or the string 'null' without quotes. > > But if both are null, use > Sure, lets do that. -- Anne van Kesteren From jg307 at cam.ac.uk Fri Apr 4 09:44:40 2008 From: jg307 at cam.ac.uk (James Graham) Date: Fri, 04 Apr 2008 17:44:40 +0100 Subject: [imps] Tree construction test in undocumented format In-Reply-To: References: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> <47F6503A.4040309@cam.ac.uk> <74FF9010-9885-4905-8FB8-0477E097BDC5@iki.fi> Message-ID: <47F65AF8.80306@cam.ac.uk> Anne van Kesteren wrote: > On Fri, 04 Apr 2008 18:14:13 +0200, Henri Sivonen wrote: >> How about where %foo_id is >> either a double-quoted string or the string 'null' without quotes. >> >> But if both are null, use >> > > Sure, lets do that. I would prefer not to use the unmatched quote, since some editors insert double quotes automatically. Is there a problem with just doing: If we need to distinguish the null case from the empty string case we could just, as suggested, replace the whole quoted thing with unquoted null in those cases. (the no-public/system case should still be ) -- "Eternity's a terrible thought. I mean, where's it all going to end?" -- Tom Stoppard, Rosencrantz and Guildenstern are Dead From hsivonen at iki.fi Fri Apr 4 11:32:40 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Fri, 4 Apr 2008 21:32:40 +0300 Subject: [imps] Tree construction test in undocumented format In-Reply-To: <47F65AF8.80306@cam.ac.uk> References: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> <47F6503A.4040309@cam.ac.uk> <74FF9010-9885-4905-8FB8-0477E097BDC5@iki.fi> <47F65AF8.80306@cam.ac.uk> Message-ID: <119BD728-9651-49EB-98B1-20456CA4CAF4@iki.fi> On Apr 4, 2008, at 19:44, James Graham wrote: > Is there a problem with just doing: > > > If we need to distinguish the null case from the empty string case > we could just, as suggested, replace the whole quoted thing with > unquoted null in those cases. Oops. That's what I meant. Too many typos lately. -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From hsivonen at iki.fi Fri Apr 4 11:45:26 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Fri, 4 Apr 2008 21:45:26 +0300 Subject: [imps] Tree construction test in undocumented format In-Reply-To: <119BD728-9651-49EB-98B1-20456CA4CAF4@iki.fi> References: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> <47F6503A.4040309@cam.ac.uk> <74FF9010-9885-4905-8FB8-0477E097BDC5@iki.fi> <47F65AF8.80306@cam.ac.uk> <119BD728-9651-49EB-98B1-20456CA4CAF4@iki.fi> Message-ID: <52FE5396-925C-4581-9860-043CB148E003@iki.fi> On Apr 4, 2008, at 21:32, Henri Sivonen wrote: > On Apr 4, 2008, at 19:44, James Graham wrote: >> Is there a problem with just doing: >> >> >> If we need to distinguish the null case from the empty string case >> we could just, as suggested, replace the whole quoted thing with >> unquoted null in those cases. > > Oops. That's what I meant. Too many typos lately. Even more oops: Neither string can be null per the current spec, so the rule should be unless both are the empty string in which case -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From t.broyer at gmail.com Fri Apr 4 17:24:48 2008 From: t.broyer at gmail.com (Thomas Broyer) Date: Sat, 5 Apr 2008 02:24:48 +0200 Subject: [imps] Tree construction test in undocumented format In-Reply-To: References: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> <47F6503A.4040309@cam.ac.uk> Message-ID: On Fri, Apr 4, 2008 at 6:11 PM, Anne van Kesteren wrote: > On Fri, 04 Apr 2008 18:07:46 +0200, Anne van Kesteren > wrote: > > no public or system ID: > > public ID, no system ID: > > no public ID, system ID: > > public and system ID: > > Instead of fooling around with spaces alternatively we could put a P or S > before the "..." in the case of either a public ID or system ID. Whether > or not the document is in quirks mode should probably be something else. > Maybe: > > #document-mode > nq|q|lq > > or something like that. In http://html5.googlecode.com/svn/trunk/tests/tree-construction an XML-like/SGML-like DOCTYPE serialization: - - - - And in http://html5.googlecode.com/svn/trunk/tests/tree-construction/compatibility-mode.dat: #compatibility-mode no quirks ?or? #compatibility-mode quirks ?or? #compatibility-mode limited quirks Note that I've also use for comments instead of the current (note the spaces). -- Thomas Broyer From edwardzyang at thewritingpot.com Fri Apr 4 21:06:42 2008 From: edwardzyang at thewritingpot.com (Edward Z. Yang) Date: Sat, 05 Apr 2008 00:06:42 -0400 Subject: [imps] HTML5 and libxml2 Message-ID: <47F6FAD2.8010105@thewritingpot.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 As per the W3C 5 April 2008 working draft, elements not recognized by HTML5 in body are still added to the DOM using the "A start tag token not covered by the previous entries". HTML5 does not specify any validation mechanism in which to ensure the element has the form stipulated by tag name, i.e. [A-Za-z-]+ Unfortunately, certain tag names causes libxml2 to choke, and HTML5 doesn't specify any way to: 1. Munge the name into something libxml2 finds acceptable 2. Ignore the tag as invalid Without modifying the algorithms, (2) is not tenable, so I've been looking at (1). However, HTML5's tag name stipulations appear to be too restrictive: they do not allow digits as seen in

and friends, and aren't even a subset of the allowed XML tag names (XML specifies that a hyphen cannot lead in a tag name, and allows a greater variety of punctuation and international characters). So, in short, due to underlying library limitations I can't put arbitrary characters in a tag (which is what Firefox actually seems to do), and I don't know exactly what characters I need to get rid of. Advice? [1] http://www.w3.org/html/wg/html5/#tag-name [2] http://www.w3.org/TR/REC-xml/#NT-Name - -- Edward Z. Yang GnuPG: 0x869C48DA HTML Purifier Anti-XSS Filter [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH9vrSqTO+fYacSNoRAvt/AJ494M4fINnrRUAf/GbJgvvjoP6XqgCdGE4a /1CeZKB6aFjfU+CEBzhukXA= =nwJ1 -----END PGP SIGNATURE----- From ian at hixie.ch Fri Apr 4 22:34:24 2008 From: ian at hixie.ch (Ian Hickson) Date: Sat, 5 Apr 2008 05:34:24 +0000 (UTC) Subject: [imps] HTML5 and libxml2 In-Reply-To: <47F6FAD2.8010105@thewritingpot.com> References: <47F6FAD2.8010105@thewritingpot.com> Message-ID: On Sat, 5 Apr 2008, Edward Z. Yang wrote: > > HTML5 does not specify any validation mechanism in which to ensure the > element has the form stipulated by tag name, i.e. [A-Za-z-]+ That (erroneous, as it happens) paragraph is just describing a trend in the spec's tag names, it's not a conformance criteria of any kind. The conformance criteria is really just that the elements in the document have to be the elements defined by the spec. You may find this post helpful in determining how to read the HTML5 spec: http://ln.hixie.ch/?start=1140242962&count=1 > Unfortunately, certain tag names causes libxml2 to choke, and HTML5 > doesn't specify any way to: > > 1. Munge the name into something libxml2 finds acceptable > 2. Ignore the tag as invalid Indeed, both of these behaviours would be non-conforming. Can you change libxml2 to support more characters? Is there a real technical reason for the limitation, or is it just enforcing XML requirements? The characters allowed in tag names are by far not the only area where XML and HTML differ, so if it is just a matter of libxml2 enforcing XML's requirements, it will not work well. > So, in short, due to underlying library limitations I can't put > arbitrary characters in a tag (which is what Firefox actually seems to > do), and I don't know exactly what characters I need to get rid of. Advice? If you can't implement what the spec requires, then make sure to document the limitations clearly in your documentation. Meanwhile, you can probably get away with replacing unusable characters with U+FFFD, or at a pinch, "_", so long as you still use the full tag anems in the parser to determine which tags are open. However, make sure to document this as being a conformance problem in your documentation. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' From hsivonen at iki.fi Sat Apr 5 01:10:40 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Sat, 5 Apr 2008 11:10:40 +0300 Subject: [imps] HTML5 and libxml2 In-Reply-To: <47F6FAD2.8010105@thewritingpot.com> References: <47F6FAD2.8010105@thewritingpot.com> Message-ID: On Apr 5, 2008, at 07:06, Edward Z. Yang wrote: > Unfortunately, certain tag names causes libxml2 to choke, and HTML5 > doesn't specify any way to: > > 1. Munge the name into something libxml2 finds acceptable > 2. Ignore the tag as invalid > > Without modifying the algorithms, (2) is not tenable, so I've been > looking at (1). [...] > So, in short, due to underlying library limitations I can't put > arbitrary characters in a tag (which is what Firefox actually seems to > do), and I don't know exactly what characters I need to get rid of. > Advice? In the Validator.nu HTML parser, I've solved this by having three available policies: public enum XmlViolationPolicy { /** * Conform to HTML 5, allow XML 1.0 to be violated. */ ALLOW, /** * Halt when something cannot be mapped to XML 1.0. */ FATAL, /** * Be non-conforming and alter the infoset to fit * XML 1.0 when something would otherwise not be * mappable to XML 1.0. */ ALTER_INFOSET } It seems like ALLOW isn't a possibility for libxml2. With ALTER_INFOSET, tag tokens that do not match Namespaces in XML 1.0 NCName are ignored in the tokenizer. This is non-conforming but works most of the time. (There are many more similar situations you can find by searching for ALTER_INFOSET in the source.) -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From annevk at opera.com Sat Apr 5 05:17:21 2008 From: annevk at opera.com (Anne van Kesteren) Date: Sat, 05 Apr 2008 14:17:21 +0200 Subject: [imps] HTML5 parser test location Message-ID: Just an FYI since there seems to be some misinformation about this (for which I might have been responsible, not sure), but hereby a note that we'd like to keep all HTML5 parser tests in the html5lib project tree and not move them all to the html5 project tree. For licensing reasons, because the html5 project owner doesn't like it, and because it just isn't worth the trouble. (If I was unclear about this in the past or have given the impression of supporting the opposite view, my apologies.) -- Anne van Kesteren From t.broyer at gmail.com Sat Apr 5 06:22:48 2008 From: t.broyer at gmail.com (Thomas Broyer) Date: Sat, 5 Apr 2008 15:22:48 +0200 Subject: [imps] HTML5 parser test location In-Reply-To: References: Message-ID: On Sat, Apr 5, 2008 at 2:17 PM, Anne van Kesteren wrote: > Just an FYI since there seems to be some misinformation about this (for > which I might have been responsible, not sure), but hereby a note that we'd > like to keep all HTML5 parser tests in the html5lib project tree and not > move them all to the html5 project tree. For licensing reasons, because the > html5 project owner doesn't like it, and because it just isn't worth the > trouble. > > (If I was unclear about this in the past or have given the impression of > supporting the opposite view, my apologies.) FYI, I started the new tests (not really a "move" 'cause I've been rewriting them all from scratch) based on the following thread: http://lists.whatwg.org/pipermail/implementors-whatwg.org/2007-July/000126.html Particularly the following two messages: http://lists.whatwg.org/pipermail/implementors-whatwg.org/2007-July/000127.html http://lists.whatwg.org/pipermail/implementors-whatwg.org/2007-August/000129.html I personnaly have no preference for any license (MIT or Apache 2) so if people (including those who don't work on html5lib, like Henri Sivonen) prefer the tests to be MIT-licenced, I don't bother. But please note that the tests in the html5 repository are complete rewrites, not copies from html5lib's tests. Now, if Ian changed his mind, and to keep them "implementation-agnostic", how about an html5-tests project? (MIT-licensed if that's what implementors want) My main goal with the new tests is to keep them: - independant of any implementation, so that we can keep them in sync with the spec, not the software (see the above thread's first message) - organized wrt the spec, so that when the spec change it's easier to locate the tests that need to be updated Projects using those tests could specify a particular revision they aim to "pass", in their svn:external import for example; and change the svn:external revision when they update the implementation to follow the spec changes. -- Thomas Broyer From annevk at opera.com Sat Apr 5 08:01:17 2008 From: annevk at opera.com (Anne van Kesteren) Date: Sat, 05 Apr 2008 17:01:17 +0200 Subject: [imps] HTML5 parser test location In-Reply-To: References: Message-ID: On Sat, 05 Apr 2008 15:22:48 +0200, Thomas Broyer wrote: > My main goal with the new tests is to keep them: > - independant of any implementation, so that we can keep them in sync > with the spec, not the software (see the above thread's first message) I thought about this and agree that this is problematic. Given that the main code is more stable maybe we could move to a model where it is not problematic for the trunk code to fail tests that have been reviewed by several sources. To make a new release we basically need to pass all tests we currently fail on trunk instead of trying to develop tests and code side by side. This would remove the need for the tests to be independent of the implementation. If people still prefer stability for their implementation in the html5lib trunk tree they could make a copy of the test suite and merge that everytime the test suite changes while updating their implementation. -- Anne van Kesteren From edwardzyang at thewritingpot.com Sat Apr 5 08:04:13 2008 From: edwardzyang at thewritingpot.com (Edward Z. Yang) Date: Sat, 05 Apr 2008 11:04:13 -0400 Subject: [imps] HTML5 and libxml2 In-Reply-To: References: <47F6FAD2.8010105@thewritingpot.com> Message-ID: <47F794ED.3090801@thewritingpot.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Ian Hickson wrote: > That (erroneous, as it happens) paragraph is just describing a trend > in the spec's tag names, it's not a conformance criteria of any kind. > Should I submit a patch fixing the error? > The conformance criteria is really just that the elements in the > document have to be the elements defined by the spec. But the spec also defines behavior when elements are outside of the spec, i.e. an error-condition. I'd appreciate it if the allowed tag names is made a normative requirement for such elements. > You may find this post helpful in determining how to read the HTML5 > spec: Thanks. I've heard of RFC2119 before, but I didn't realize that statements of fact that don't use those keywords should not be considered normative. Many of W3C's specs explicitly state which statements are normative and which of informative. > Can you change libxml2 to support more characters? Is there a real > technical reason for the limitation, or is it just enforcing XML > requirements? I've pinged the libxml2 list, should have an answer back soon. > The characters allowed in tag names are by far not the only area > where XML and HTML differ, so if it is just a matter of libxml2 > enforcing XML's requirements, it will not work well. What are these differences explicitly? > If you can't implement what the spec requires, then make sure to > document the limitations clearly in your documentation. Meanwhile, > you can probably get away with replacing unusable characters with > U+FFFD, Unfortunately, U+FFFD is an invalid character too. :-) > or at a pinch, "_", so long as you still use the full tag anems in > the parser to determine which tags are open. However, make sure to > document this as being a conformance problem in your documentation. This might be tricky, and it occurs to me that as long as the substitution process works the same for the tags, t becomes t which is equivalent. I will, of course, document it. Henri Sivonen: > With ALTER_INFOSET, tag tokens that do not match Namespaces in XML > 1.0 NCName are ignored in the tokenizer. This is non-conforming but > works most of the time. (There are many more similar situations you > can find by searching for ALTER_INFOSET in the source.) This is what I had been considering with (2), but it looked like I'd have to make multiple modifications in the algorithm to get that to work. I would look at the source, but I can't seem to find it! All I can find is the build script, and I don't have Python. - -- Edward Z. Yang GnuPG: 0x869C48DA HTML Purifier Anti-XSS Filter [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH95TtqTO+fYacSNoRAk+IAJ4gPLXGHbSuAsUQBaO2Fgu4XMm5WQCfbSd/ JAcnZflMEh0uxRbJ2gwww9E= =t+U6 -----END PGP SIGNATURE----- From hsivonen at iki.fi Sat Apr 5 13:06:07 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Sat, 5 Apr 2008 23:06:07 +0300 Subject: [imps] HTML5 and libxml2 In-Reply-To: <47F794ED.3090801@thewritingpot.com> References: <47F6FAD2.8010105@thewritingpot.com> <47F794ED.3090801@thewritingpot.com> Message-ID: On Apr 5, 2008, at 18:04, Edward Z. Yang wrote: > Henri Sivonen: >> With ALTER_INFOSET, tag tokens that do not match Namespaces in XML >> 1.0 NCName are ignored in the tokenizer. This is non-conforming but >> works most of the time. (There are many more similar situations you >> can find by searching for ALTER_INFOSET in the source.) > > This is what I had been considering with (2), but it looked like I'd > have to make multiple modifications in the algorithm to get that to > work. I would look at the source, but I can't seem to find it! > All I can find is the build script, and I don't have Python. The parser source is also in the parser distribution package available from: http://about.validator.nu/htmlparser/ (Currently: http://about.validator.nu/htmlparser/htmlparser-1.0.7.zip ) -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From ian at hixie.ch Sat Apr 5 14:55:57 2008 From: ian at hixie.ch (Ian Hickson) Date: Sat, 5 Apr 2008 21:55:57 +0000 (UTC) Subject: [imps] HTML5 parser test location In-Reply-To: References: Message-ID: On Sat, 5 Apr 2008, Thomas Broyer wrote: > > Now, if Ian changed his mind I'm happy for the html5 project to be used if that's what people want, my only concern is that they already are known to be elsewhere now and I really don't want a mixture of tests in different places, as that will just fragment our progress. Sorry for flipflopping. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' From ian at hixie.ch Sat Apr 5 15:05:00 2008 From: ian at hixie.ch (Ian Hickson) Date: Sat, 5 Apr 2008 22:05:00 +0000 (UTC) Subject: [imps] HTML5 and libxml2 In-Reply-To: <47F794ED.3090801@thewritingpot.com> References: <47F6FAD2.8010105@thewritingpot.com> <47F794ED.3090801@thewritingpot.com> Message-ID: On Sat, 5 Apr 2008, Edward Z. Yang wrote: > > Ian Hickson wrote: > > That (erroneous, as it happens) paragraph is just describing a trend > > in the spec's tag names, it's not a conformance criteria of any kind. > > Should I submit a patch fixing the error? I have noted it and will fix it in due course. :-) > > The conformance criteria is really just that the elements in the > > document have to be the elements defined by the spec. > > But the spec also defines behavior when elements are outside of the > spec, i.e. an error-condition. I'd appreciate it if the allowed tag > names is made a normative requirement for such elements. The normative requirement for such elements is that they are _all_ invalid, even if they just use a-z characters. The range of characters that can be used by elements that aren't allowed is the empty range. > > The characters allowed in tag names are by far not the only area where > > XML and HTML differ, so if it is just a matter of libxml2 enforcing > > XML's requirements, it will not work well. > > What are these differences explicitly? Well for example an XML comment cannot contain the string "--". > > If you can't implement what the spec requires, then make sure to > > document the limitations clearly in your documentation. Meanwhile, you > > can probably get away with replacing unusable characters with U+FFFD, > > or at a pinch, "_", so long as you still use the full tag anems in the > > parser to determine which tags are open. However, make sure to > > document this as being a conformance problem in your documentation. > > This might be tricky, and it occurs to me that as long as the > substitution process works the same for the tags, t becomes > t which is equivalent. I will, of course, document it. What I meant is make sure that you code handles: X ...as creating a DOM tree where the third tag above closes the first one, not the second one. i.e. in your parser and the stack of elements you should keep the original tag names, and only give the munged tag names to the the DOM tree. HTH, -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' From edwardzyang at thewritingpot.com Sat Apr 5 15:48:30 2008 From: edwardzyang at thewritingpot.com (Edward Z. Yang) Date: Sat, 05 Apr 2008 18:48:30 -0400 Subject: [imps] HTML5 and libxml2 In-Reply-To: References: <47F6FAD2.8010105@thewritingpot.com> <47F794ED.3090801@thewritingpot.com> Message-ID: <47F801BE.2050404@thewritingpot.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 As an informative note, the tag limitation is with PHP's DOM extension and not libxml2. I'm probably going to do an implementation similar to Validator.nu's, although that's given that I'm interested enough in making major architectural changes to the unmaintained PH5P. Ian Hickson wrote: > The normative requirement for such elements is that they are _all_ > invalid, even if they just use a-z characters. The range of characters > that can be used by elements that aren't allowed is the empty range. I understand this; however, since HTML has graceful error handling, even though such elements are invalid we should still have well-defined handling for them. Which, I suppose, it does. :-) > Well for example an XML comment cannot contain the string "--". I took a look at the source code for Validator.nu and all the differences are there. > What I meant is make sure that you code handles: > > X > > ...as creating a DOM tree where the third tag above closes the first one, > not the second one. i.e. in your parser and the stack of elements you > should keep the original tag names, and only give the munged tag names to > the the DOM tree. Duly noted. - -- Edward Z. Yang GnuPG: 0x869C48DA HTML Purifier Anti-XSS Filter [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH+AG+qTO+fYacSNoRAjXNAJ4jDKNJ/tqPCGe6px+mWAY8yK/+nQCcD85N JsMyioGbTvC3OYdVnAPrus4= =e5ES -----END PGP SIGNATURE----- From foolistbar at googlemail.com Sun Apr 6 04:32:42 2008 From: foolistbar at googlemail.com (Geoffrey Sneddon) Date: Sun, 6 Apr 2008 12:32:42 +0100 Subject: [imps] HTML5 and libxml2 In-Reply-To: <47F801BE.2050404@thewritingpot.com> References: <47F6FAD2.8010105@thewritingpot.com> <47F794ED.3090801@thewritingpot.com> <47F801BE.2050404@thewritingpot.com> Message-ID: On 5 Apr 2008, at 23:48, Edward Z. Yang wrote: > As an informative note, the tag limitation is with PHP's DOM extension > and not libxml2. I'm probably going to do an implementation similar to > Validator.nu's, although that's given that I'm interested enough in > making major architectural changes to the unmaintained PH5P. Is there a bug report in the PHP bug database? This most certainly is a violation of the DOM specification. -- Geoffrey Sneddon From edwardzyang at thewritingpot.com Sun Apr 6 06:08:55 2008 From: edwardzyang at thewritingpot.com (Edward Z. Yang) Date: Sun, 06 Apr 2008 09:08:55 -0400 Subject: [imps] HTML5 and libxml2 In-Reply-To: References: <47F6FAD2.8010105@thewritingpot.com> <47F794ED.3090801@thewritingpot.com> <47F801BE.2050404@thewritingpot.com> Message-ID: <47F8CB67.6010905@thewritingpot.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Geoffrey Sneddon wrote: > Is there a bug report in the PHP bug database? This most certainly is a > violation of the DOM specification. http://bugs.php.net/bug.php?id=44648 Although, the DOM specification clearly states that an INVALID_CHARACTER_ERROR should be thrown when the tag name is "invalid". - -- Edward Z. Yang GnuPG: 0x869C48DA HTML Purifier Anti-XSS Filter [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH+MtmqTO+fYacSNoRAg6+AJ4vnReEXBH9eOGVMXhszYOgSFGBmQCfWYOs NFG5AmZu5qgzeM6aThWBaDA= =JrVQ -----END PGP SIGNATURE----- From foolistbar at googlemail.com Sun Apr 6 06:57:16 2008 From: foolistbar at googlemail.com (Geoffrey Sneddon) Date: Sun, 6 Apr 2008 14:57:16 +0100 Subject: [imps] HTML5 and libxml2 In-Reply-To: <47F8CB67.6010905@thewritingpot.com> References: <47F6FAD2.8010105@thewritingpot.com> <47F794ED.3090801@thewritingpot.com> <47F801BE.2050404@thewritingpot.com> <47F8CB67.6010905@thewritingpot.com> Message-ID: On 6 Apr 2008, at 14:08, Edward Z. Yang wrote: > Geoffrey Sneddon wrote: >> Is there a bug report in the PHP bug database? This most certainly >> is a >> violation of the DOM specification. > > http://bugs.php.net/bug.php?id=44648 > > Although, the DOM specification clearly states that an > INVALID_CHARACTER_ERROR should be thrown when the tag name is > "invalid". What happens when you set DOMDocument::$strictErrorChecking to false? In the DOM spec, behaviour then is undefined. -- Geoffrey Sneddon From edwardzyang at thewritingpot.com Sun Apr 6 10:16:02 2008 From: edwardzyang at thewritingpot.com (Edward Z. Yang) Date: Sun, 06 Apr 2008 13:16:02 -0400 Subject: [imps] HTML5 and libxml2 In-Reply-To: References: <47F6FAD2.8010105@thewritingpot.com> <47F794ED.3090801@thewritingpot.com> <47F801BE.2050404@thewritingpot.com> <47F8CB67.6010905@thewritingpot.com> Message-ID: <47F90552.2080905@thewritingpot.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Geoffrey Sneddon wrote: > What happens when you set DOMDocument::$strictErrorChecking to false? In > the DOM spec, behaviour then is undefined. This is a proprietary extension, and defines whether or not PHP should throw actual Exceptions with DOM errors, or emit warnings. The behavior with DOM, provided the exception is properly caught, remains the same, as after the C code invokes the exception or emits the error, control is passed back to the PHP interpreter. (sorry about the dupe; accidentally made an off-list post) - -- Edward Z. Yang GnuPG: 0x869C48DA HTML Purifier Anti-XSS Filter [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH+QVSqTO+fYacSNoRAqZnAJ4hmwcMZNLNW/sPEOs21uKiR73wAQCfdBcJ bmv2RYCgt2wjWwd4vDEZ3jY= =ris/ -----END PGP SIGNATURE----- From foolistbar at googlemail.com Sun Apr 6 10:36:56 2008 From: foolistbar at googlemail.com (Geoffrey Sneddon) Date: Sun, 6 Apr 2008 18:36:56 +0100 Subject: [imps] HTML5 and libxml2 In-Reply-To: <47F90552.2080905@thewritingpot.com> References: <47F6FAD2.8010105@thewritingpot.com> <47F794ED.3090801@thewritingpot.com> <47F801BE.2050404@thewritingpot.com> <47F8CB67.6010905@thewritingpot.com> <47F90552.2080905@thewritingpot.com> Message-ID: <0CAB6556-942F-42FC-80B4-50B1AD2819B0@googlemail.com> On 6 Apr 2008, at 18:16, Edward Z. Yang wrote: > Geoffrey Sneddon wrote: >> What happens when you set DOMDocument::$strictErrorChecking to >> false? In >> the DOM spec, behaviour then is undefined. > > This is a proprietary extension, and defines whether or not PHP should > throw actual Exceptions with DOM errors, or emit warnings. It's not proprietary: it's part of DOM Level 3 Core, as PHP claims to implement (see ). > The behavior > with DOM, provided the exception is properly caught, remains the same, > as after the C code invokes the exception or emits the error, > control is > passed back to the PHP interpreter. Ah. So therefore it doesn't actually allow the DOM to hold characters it otherwise wouldn't (like a a@ localName). -- Geoffrey Sneddon From edwardzyang at thewritingpot.com Sun Apr 6 10:47:46 2008 From: edwardzyang at thewritingpot.com (Edward Z. Yang) Date: Sun, 06 Apr 2008 13:47:46 -0400 Subject: [imps] HTML5 and libxml2 In-Reply-To: <0CAB6556-942F-42FC-80B4-50B1AD2819B0@googlemail.com> References: <47F6FAD2.8010105@thewritingpot.com> <47F794ED.3090801@thewritingpot.com> <47F801BE.2050404@thewritingpot.com> <47F8CB67.6010905@thewritingpot.com> <47F90552.2080905@thewritingpot.com> <0CAB6556-942F-42FC-80B4-50B1AD2819B0@googlemail.com> Message-ID: <47F90CC2.4050105@thewritingpot.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Geoffrey Sneddon wrote: > It's not proprietary: it's part of DOM Level 3 Core, as PHP claims to > implement (see > ). You're right---I was looking at DOM Level 2 Core. > Ah. So therefore it doesn't actually allow the DOM to hold characters it > otherwise wouldn't (like a a@ localName). Precisely. And since the behavior is undefined, the PHP developers are free to implement this however they want. To make these errors not stop execution without strictErrorChecking, one would probably have to macro-fy php_dom_throw_error to include the appropriate return values, and then remove any trailing RETURN_* macro-calls... which doesn't really seem worth it for them, although that makes strictErrorChecking slightly useless. :-) It also makes me slightly worried about cases where libxml2 *requires* that the validation is done (for example, libxml2 does not appear to be binary safe, whereas PHP strings are). - -- Edward Z. Yang GnuPG: 0x869C48DA HTML Purifier Anti-XSS Filter [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH+QzCqTO+fYacSNoRAoELAJwLG3NzFvvkXZK0fCgbVyjsvp+V6wCfYeDv a/6q8ySgwJ2TfjpKQSxSFr0= =/16o -----END PGP SIGNATURE----- From jg307 at cam.ac.uk Mon Apr 7 07:36:53 2008 From: jg307 at cam.ac.uk (James Graham) Date: Mon, 07 Apr 2008 15:36:53 +0100 Subject: [imps] HTML5 parser test location In-Reply-To: References: Message-ID: <47FA3185.9030302@cam.ac.uk> Thomas Broyer wrote: > On Sat, Apr 5, 2008 at 2:17 PM, Anne van Kesteren wrote: >> Just an FYI since there seems to be some misinformation about this (for >> which I might have been responsible, not sure), but hereby a note that we'd >> like to keep all HTML5 parser tests in the html5lib project tree and not >> move them all to the html5 project tree. For licensing reasons, because the >> html5 project owner doesn't like it, and because it just isn't worth the >> trouble. >> >> (If I was unclear about this in the past or have given the impression of >> supporting the opposite view, my apologies.) > > FYI, I started the new tests (not really a "move" 'cause I've been > rewriting them all from scratch) based on the following thread: > http://lists.whatwg.org/pipermail/implementors-whatwg.org/2007-July/000126.html > Particularly the following two messages: > http://lists.whatwg.org/pipermail/implementors-whatwg.org/2007-July/000127.html > http://lists.whatwg.org/pipermail/implementors-whatwg.org/2007-August/000129.html My view on this is: * The tests must remain MIT licensed. Therefore they can't go in the html5 repository * I'm -0 on moving the tests at all. That is, I'm not strongly against it but I'm not sure it represents a good investment of time. * If we do decide to move tests, we should consider the advantages of a distributed version control system like Mercurial. It seems like a situation where slightly disjoint groups of people are all editing a common set of files might play to the strengths of those systems. * I am totally against rewriting tests. The current tests have often been written in response to actual regressions in software. Throwing away all that knowledge of fragile points in the various implementations is unacceptable. Adding extra tests is of course fine. * One of the identified problems with the current test suite is that it is hard to determine which tests need to change when the spec changes. There are various ways to improve this without starting over. Specifically it is not hard to instrument html5lib to monitor which phase it is in at a given time. One can imagine using this to automatically identify testcases that cause the parser to go through altered parts of the spec. -- "Eternity's a terrible thought. I mean, where's it all going to end?" -- Tom Stoppard, Rosencrantz and Guildenstern are Dead From ryan at theryanking.com Mon Apr 7 12:49:18 2008 From: ryan at theryanking.com (Ryan King) Date: Mon, 7 Apr 2008 12:49:18 -0700 Subject: [imps] HTML5 parser test location In-Reply-To: <47FA3185.9030302@cam.ac.uk> References: <47FA3185.9030302@cam.ac.uk> Message-ID: <2640F92A-964E-494A-8574-EC557FCC4702@theryanking.com> On Apr 7, 2008, at 7:36 AM, James Graham wrote: > * I am totally against rewriting tests. The current tests have often > been > written in response to actual regressions in software. Throwing away > all that > knowledge of fragile points in the various implementations is > unacceptable. > Adding extra tests is of course fine. I want to second this point. Almost all the tests I've written have been to deal with issues discovered in the ruby implementation. -ryan From rubys at intertwingly.net Mon Apr 7 18:14:29 2008 From: rubys at intertwingly.net (Sam Ruby) Date: Mon, 07 Apr 2008 21:14:29 -0400 Subject: [imps] HTML5 parser test location In-Reply-To: <2640F92A-964E-494A-8574-EC557FCC4702@theryanking.com> References: <47FA3185.9030302@cam.ac.uk> <2640F92A-964E-494A-8574-EC557FCC4702@theryanking.com> Message-ID: <47FAC6F5.6010809@intertwingly.net> Ryan King wrote: > On Apr 7, 2008, at 7:36 AM, James Graham wrote: >> * I am totally against rewriting tests. The current tests have often >> been >> written in response to actual regressions in software. Throwing away >> all that >> knowledge of fragile points in the various implementations is >> unacceptable. >> Adding extra tests is of course fine. > > I want to second this point. Almost all the tests I've written have > been to deal with issues discovered in the ruby implementation. Is there some way we can segregate the tests into ones that we expect to pass and ones that we (currently) don't expect to pass? - Sam Ruby From edwardzyang at thewritingpot.com Mon Apr 7 15:55:29 2008 From: edwardzyang at thewritingpot.com (Edward Z. Yang) Date: Mon, 07 Apr 2008 18:55:29 -0400 Subject: [imps] HTML5 parser test location In-Reply-To: <47FAC6F5.6010809@intertwingly.net> References: <47FA3185.9030302@cam.ac.uk> <2640F92A-964E-494A-8574-EC557FCC4702@theryanking.com> <47FAC6F5.6010809@intertwingly.net> Message-ID: <47FAA661.5000207@thewritingpot.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Sam Ruby wrote: > Is there some way we can segregate the tests into ones that we expect to > pass and ones that we (currently) don't expect to pass? How about an extra boolean flag in the test structs? This has the added benefit of not needing to shuffle tests around once they do start passing, and also being able to test the inverse: whether or not the tests we don't expect to pass are failing. - -- Edward Z. Yang GnuPG: 0x869C48DA HTML Purifier Anti-XSS Filter [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD4DBQFH+qZhqTO+fYacSNoRArtkAJ9F5kNuoLaqjNhWed5Stb12+Eq0tACXXidx xAavEyPMOYGmhqLBy16ORA== =skuR -----END PGP SIGNATURE----- From edwardzyang at thewritingpot.com Mon Apr 7 16:17:39 2008 From: edwardzyang at thewritingpot.com (Edward Z. Yang) Date: Mon, 07 Apr 2008 19:17:39 -0400 Subject: [imps] HTML5 parser test location In-Reply-To: <47FAA7DC.9060201@cam.ac.uk> References: <47FA3185.9030302@cam.ac.uk> <2640F92A-964E-494A-8574-EC557FCC4702@theryanking.com> <47FAC6F5.6010809@intertwingly.net> <47FAA661.5000207@thewritingpot.com> <47FAA7DC.9060201@cam.ac.uk> Message-ID: <47FAAB93.8000400@thewritingpot.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 James Graham wrote: > That doesn't work because multiple people are using the tests and their > implementations may be at different levels of conformance. > > We'd have to have out-of-band metadata like a list of tests to skip. I don't think that was Sam Ruby's intent; I took it to mean tests for which the specification itself is faulty. If it has to do with differing levvels of conformance, we're SOL. - -- Edward Z. Yang GnuPG: 0x869C48DA HTML Purifier Anti-XSS Filter [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH+quTqTO+fYacSNoRAh+sAJ9LdyrcdCu1ho9Ec1o3pVdM3volJQCdHIx4 TXAURwb97EcQ/Zpt6LL4N9Q= =+7hX -----END PGP SIGNATURE----- From hsivonen at iki.fi Wed Apr 2 07:58:06 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Wed, 2 Apr 2008 17:58:06 +0300 Subject: [imps] td fragments Message-ID: <011D8708-C1F1-46D8-BED4-D4F0A037297D@iki.fi> Can someone, please, explain to me, what part of the spec makes the insertion mode "in body" when the context is "td" or "th" in the fragment case? Specifically, this test case doesn't pass when I implement "reset the insertion mode" rigorously per spec as far as I can see. 1048 ryansking #data 1048 ryansking
1048 ryansking #errors 1062 ryansking Line: 1 Col: 8 Unexpected end tag (table). Ignored. 1062 ryansking Line: 1 Col: 16 Unexpected end tag (tbody). Ignored. 1062 ryansking Line: 1 Col: 24 Unexpected end tag (tfoot). Ignored. 1062 ryansking Line: 1 Col: 32 Unexpected end tag (thead). Ignored. 1062 ryansking Line: 1 Col: 37 Unexpected end tag (tr). Ignored. 1062 ryansking Line: 1 Col: 42 Expected closing tag. Unexpected end of file. 1114 hsivonen #document-fragment 1114 hsivonen td 1052 ryansking #document 1052 ryansking |
-- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From hsivonen at iki.fi Wed Apr 2 09:18:30 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Wed, 2 Apr 2008 19:18:30 +0300 Subject: [imps]
Message-ID: <46EDA2B4-781F-4FA2-AF1F-6E20644B4E0B@iki.fi> Data:
Expected: | | | | |
|
|
Got: | | | | | |
|
Expected errors: Line: 1 Col: 33 End tag (form) seen too early. Ignored. Line: 1 Col: 38 Expected closing tag. Unexpected end of file. Actual errors: 33: End tag ?form? seen but there were unclosed elements. 39: End of file seen and there were open elements. Can someone, please, explain to me, we the test case ignores the tag?
is not scoping per spec, so there is a in scope to close. -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From t.broyer at gmail.com Wed Apr 2 10:49:29 2008 From: t.broyer at gmail.com (Thomas Broyer) Date: Wed, 2 Apr 2008 19:49:29 +0200 Subject: [imps]
In-Reply-To: <46EDA2B4-781F-4FA2-AF1F-6E20644B4E0B@iki.fi> References: <46EDA2B4-781F-4FA2-AF1F-6E20644B4E0B@iki.fi> Message-ID: On Wed, Apr 2, 2008 at 6:18 PM, Henri Sivonen wrote: > Data: >
> Expected: > | > | > | > | > |
> |
> |
> Got: > | > | > | > | > | > |
> |
> Expected errors: > Line: 1 Col: 33 End tag (form) seen too early. Ignored. > Line: 1 Col: 38 Expected closing tag. Unexpected end of file. > Actual errors: > 33: End tag "form" seen but there were unclosed elements. > 39: End of file seen and there were open elements. > > Can someone, please, explain to me, we the test case ignores the form> tag? Because it hasn't been updated and the spec changed since it has been written ? It might be this one (judging from the commit log) http://html5.org/tools/web-apps-tracker?from=1319&to=1320 Instead of fixing tests from html5lib's repository, I suggest removing them and contributing new tests in html5's repository http://html5.googlecode.com/svn/trunk/tests/ -- Thomas Broyer From hsivonen at iki.fi Wed Apr 2 11:58:21 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Wed, 2 Apr 2008 21:58:21 +0300 Subject: [imps]
In-Reply-To: References: <46EDA2B4-781F-4FA2-AF1F-6E20644B4E0B@iki.fi> Message-ID: <137E8D25-9227-4620-ABC4-C70884F9DC23@iki.fi> On Apr 2, 2008, at 20:49, Thomas Broyer wrote: > Because it hasn't been updated and the spec changed since it has > been written ? > It might be this one (judging from the commit log) > http://html5.org/tools/web-apps-tracker?from=1319&to=1320 OK. I'll fix the test. > Instead of fixing tests from html5lib's repository, I suggest removing > them and contributing new tests in html5's repository > http://html5.googlecode.com/svn/trunk/tests/ That project has a different license and all. Can't we just keep the tests in the html5lib repo under the current license? (Especially since we even got Dan Connolly to escalate and get the MIT license OKed from the W3C point of view. Also, I'd hate to have to think about relicensing of the corporate contribution written by me.) -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From ian at hixie.ch Wed Apr 2 17:50:23 2008 From: ian at hixie.ch (Ian Hickson) Date: Thu, 3 Apr 2008 00:50:23 +0000 (UTC) Subject: [imps] td fragments In-Reply-To: <011D8708-C1F1-46D8-BED4-D4F0A037297D@iki.fi> References: <011D8708-C1F1-46D8-BED4-D4F0A037297D@iki.fi> Message-ID: On Wed, 2 Apr 2008, Henri Sivonen wrote: > > Can someone, please, explain to me, what part of the spec makes the > insertion mode "in body" when the context is "td" or "th" in the > fragment case? None. It's "in cell". However, the way that the "in cell" case is defined for the fragment case, it turns out (I just noticed) that it is indistinguishable from being "in body". > Specifically, this test case doesn't pass when I implement "reset the > insertion mode" rigorously per spec as far as I can see. > > 1048 ryansking #data > 1048 ryansking
> 1048 ryansking #errors > 1062 ryansking Line: 1 Col: 8 Unexpected end tag (table). Ignored. > 1062 ryansking Line: 1 Col: 16 Unexpected end tag (tbody). Ignored. > 1062 ryansking Line: 1 Col: 24 Unexpected end tag (tfoot). Ignored. > 1062 ryansking Line: 1 Col: 32 Unexpected end tag (thead). Ignored. > 1062 ryansking Line: 1 Col: 37 Unexpected end tag (tr). Ignored. > 1062 ryansking Line: 1 Col: 42 Expected closing tag. Unexpected > end of file. > 1114 hsivonen #document-fragment > 1114 hsivonen td > 1052 ryansking #document > 1052 ryansking |
What part doesn't pass? (i.e. what do you get?) -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' From ian at hixie.ch Wed Apr 2 17:52:31 2008 From: ian at hixie.ch (Ian Hickson) Date: Thu, 3 Apr 2008 00:52:31 +0000 (UTC) Subject: [imps]
In-Reply-To: <46EDA2B4-781F-4FA2-AF1F-6E20644B4E0B@iki.fi> References: <46EDA2B4-781F-4FA2-AF1F-6E20644B4E0B@iki.fi> Message-ID: On Wed, 2 Apr 2008, Henri Sivonen wrote: > > Data: >
> Expected: > | > | > | > | > |
> |
> |
> Got: > | > | > | > | > | > |
> |
> Expected errors: > Line: 1 Col: 33 End tag (form) seen too early. Ignored. > Line: 1 Col: 38 Expected closing tag. Unexpected end of file. > Actual errors: > 33: End tag ?form? seen but there were unclosed elements. > 39: End of file seen and there were open elements. > > Can someone, please, explain to me, we the test case ignores the form> tag?
is not scoping per spec, so there is a in scope > to close. This part of the spec changed recently, IIRC. Maybe the test wasn't updated? -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' From hsivonen at iki.fi Thu Apr 3 07:32:06 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Thu, 3 Apr 2008 17:32:06 +0300 Subject: [imps] td fragments In-Reply-To: References: <011D8708-C1F1-46D8-BED4-D4F0A037297D@iki.fi> Message-ID: <7366C75D-473C-4646-AFCF-DB134999A0FE@iki.fi> On Apr 3, 2008, at 03:50, Ian Hickson wrote: > On Wed, 2 Apr 2008, Henri Sivonen wrote: >> >> Can someone, please, explain to me, what part of the spec makes the >> insertion mode "in body" when the context is "td" or "th" in the >> fragment case? > > None. It's "in cell". Why? That is, based on what normative statements in the spec? "3 If node is the first node in the stack of open elements, then set last to true; if, in addition, the context element of the HTML fragment parsing algorithm is neither a td element nor a th element, then set node to the context element. (fragment case)" Now /node/ is "html"--not "td". "5 If node is a td or th element, then switch the insertion mode to "in cell" and abort these steps. " This doesn't match, since it is "html"--not "td". >> Specifically, this test case doesn't pass when I implement "reset the >> insertion mode" rigorously per spec as far as I can see. >> >> 1048 ryansking #data >> 1048 ryansking
>> 1048 ryansking #errors >> 1062 ryansking Line: 1 Col: 8 Unexpected end tag (table). Ignored. >> 1062 ryansking Line: 1 Col: 16 Unexpected end tag (tbody). >> Ignored. >> 1062 ryansking Line: 1 Col: 24 Unexpected end tag (tfoot). >> Ignored. >> 1062 ryansking Line: 1 Col: 32 Unexpected end tag (thead). >> Ignored. >> 1062 ryansking Line: 1 Col: 37 Unexpected end tag (tr). Ignored. >> 1062 ryansking Line: 1 Col: 42 Expected closing tag. Unexpected >> end of file. >> 1114 hsivonen #document-fragment >> 1114 hsivonen td >> 1052 ryansking #document >> 1052 ryansking |
> > What part doesn't pass? (i.e. what do you get?) | | |
-- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From ian at hixie.ch Thu Apr 3 11:34:41 2008 From: ian at hixie.ch (Ian Hickson) Date: Thu, 3 Apr 2008 18:34:41 +0000 (UTC) Subject: [imps] td fragments In-Reply-To: <7366C75D-473C-4646-AFCF-DB134999A0FE@iki.fi> References: <011D8708-C1F1-46D8-BED4-D4F0A037297D@iki.fi> <7366C75D-473C-4646-AFCF-DB134999A0FE@iki.fi> Message-ID: On Thu, 3 Apr 2008, Henri Sivonen wrote: > > > > > > Can someone, please, explain to me, what part of the spec makes the > > > insertion mode "in body" when the context is "td" or "th" in the > > > fragment case? > > > > None. It's "in cell". > > Why? That is, based on what normative statements in the spec? > > "3 If node is the first node in the stack of open elements, then set > last to true; if, in addition, the context element of the HTML fragment > parsing algorithm is neither a td element nor a th element, then set > node to the context element. (fragment case)" > > Now /node/ is "html"--not "td". > > "5 If node is a td or th element, then switch the insertion mode to "in > cell" and abort these steps. " > > This doesn't match, since it is "html"--not "td". Hm, I wonder what that line is doing there. It seems we should remove the "if, in addition, the context element of the HTML fragment parsing algorithm is neither a td element nor a th element" condition, or make that particular situation trigger "in body" rather than "before head". -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' From hsivonen at iki.fi Fri Apr 4 03:53:29 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Fri, 4 Apr 2008 13:53:29 +0300 Subject: [imps] Emulating the HTML DOM when actually parsed from XML Message-ID: <90924D0E-F13D-42C5-ABD1-82A1ABD62D74@iki.fi> I was thinking that especially with the MathML and SVG additions, it would be great to be able to test effect of the HTML5 parsing algorithm in current browsers. Since hooking up a parser written in Java or Python into a browser written in C++ is in itself non-trivial, I started considering an HTTP proxy that intercepted text/html and converted into application/xhtml+xml. (Jetty, Validator.nu parser, Commons HttpClient.) This approach might even work for static pages, as many people already write their selectors in lower case. However, a bit part of Web compat is script compat, and the proxy would make browsers put the DOM in the XML mode. Would it be possible to monkeypatch the features listed at http://wiki.whatwg.org/wiki/HtmlVsXhtml#Scripts using JS prototypes if the proxy injected a script into each document? Except for document.write(), of course. Might someone already have done this? Then there's the form pointer issue. With Opera, setting the WF2 form attribute would work, but what about Gecko and WebKit? And then there's the issue that some behavior depends on the character encoding and might break if the document is promoted to UTF-8. Would setting accept-charset on work around this sufficiently? Any ideas if quirks mode CSS and document.write() would make the whole exercise futile? -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From hsivonen at iki.fi Fri Apr 4 06:53:59 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Fri, 4 Apr 2008 16:53:59 +0300 Subject: [imps] Tree construction test in undocumented format Message-ID: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> > 1117 jgraham.html #data > 1117 jgraham.html EN"> > 1117 jgraham.html #errors > 1117 jgraham.html doctype-error > 1117 jgraham.html #document > 1117 jgraham.html | EN" ""> > 1117 jgraham.html | > 1117 jgraham.html | > 1117 jgraham.html | The expected output doesn't follow the documented format: http://wiki.whatwg.org/wiki/Parser_tests I grepped around a bit in the Python code unsuccessfully. It would be nice to have documentation on the wiki. (Particularly around the null, "" and SYSTEM cases.) -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From jg307 at cam.ac.uk Fri Apr 4 08:58:50 2008 From: jg307 at cam.ac.uk (James Graham) Date: Fri, 04 Apr 2008 16:58:50 +0100 Subject: [imps] Tree construction test in undocumented format In-Reply-To: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> References: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> Message-ID: <47F6503A.4040309@cam.ac.uk> Henri Sivonen wrote: >> 1117 jgraham.html #data >> 1117 jgraham.html > EN"> >> 1117 jgraham.html #errors >> 1117 jgraham.html doctype-error >> 1117 jgraham.html #document >> 1117 jgraham.html | > EN" ""> >> 1117 jgraham.html | >> 1117 jgraham.html | >> 1117 jgraham.html | Ah, this is indeed my fault > > The expected output doesn't follow the documented format: > http://wiki.whatwg.org/wiki/Parser_tests I don't have time to update the wiki right now (maybe later), but IIRC the format is (using %foo to represent the variable foo) if there is neither a system id or a public ID otherwise This may not be the most sane format ever. -- "Eternity's a terrible thought. I mean, where's it all going to end?" -- Tom Stoppard, Rosencrantz and Guildenstern are Dead From annevk at opera.com Fri Apr 4 09:07:46 2008 From: annevk at opera.com (Anne van Kesteren) Date: Fri, 04 Apr 2008 18:07:46 +0200 Subject: [imps] Tree construction test in undocumented format In-Reply-To: <47F6503A.4040309@cam.ac.uk> References: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> <47F6503A.4040309@cam.ac.uk> Message-ID: On Fri, 04 Apr 2008 17:58:50 +0200, James Graham wrote: > if there is neither a system id or a public ID > > otherwise > > This may not be the most sane format ever. Here is my proposal: no public or system ID: public ID, no system ID: no public ID, system ID: public and system ID: (We need to cover all these cases as either the public ID or system ID can be null ("missing").) -- Anne van Kesteren From annevk at opera.com Fri Apr 4 09:11:15 2008 From: annevk at opera.com (Anne van Kesteren) Date: Fri, 04 Apr 2008 18:11:15 +0200 Subject: [imps] Tree construction test in undocumented format In-Reply-To: References: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> <47F6503A.4040309@cam.ac.uk> Message-ID: On Fri, 04 Apr 2008 18:07:46 +0200, Anne van Kesteren wrote: > no public or system ID: > public ID, no system ID: > no public ID, system ID: > public and system ID: Instead of fooling around with spaces alternatively we could put a P or S before the "..." in the case of either a public ID or system ID. Whether or not the document is in quirks mode should probably be something else. Maybe: #document-mode nq|q|lq or something like that. -- Anne van Kesteren From hsivonen at iki.fi Fri Apr 4 09:14:13 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Fri, 4 Apr 2008 19:14:13 +0300 Subject: [imps] Tree construction test in undocumented format In-Reply-To: References: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> <47F6503A.4040309@cam.ac.uk> Message-ID: <74FF9010-9885-4905-8FB8-0477E097BDC5@iki.fi> On Apr 4, 2008, at 19:11, Anne van Kesteren wrote: > On Fri, 04 Apr 2008 18:07:46 +0200, Anne van Kesteren > wrote: >> no public or system ID: >> public ID, no system ID: >> no public ID, system ID: >> public and system ID: > > Instead of fooling around with spaces alternatively we could put a P > or S before the "..." in the case of either a public ID or system > ID. Whether or not the document is in quirks mode should probably be > something else. Maybe: > > #document-mode > nq|q|lq > > or something like that. How about where %foo_id is either a double-quoted string or the string 'null' without quotes. But if both are null, use -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From annevk at opera.com Fri Apr 4 09:16:08 2008 From: annevk at opera.com (Anne van Kesteren) Date: Fri, 04 Apr 2008 18:16:08 +0200 Subject: [imps] Tree construction test in undocumented format In-Reply-To: <74FF9010-9885-4905-8FB8-0477E097BDC5@iki.fi> References: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> <47F6503A.4040309@cam.ac.uk> <74FF9010-9885-4905-8FB8-0477E097BDC5@iki.fi> Message-ID: On Fri, 04 Apr 2008 18:14:13 +0200, Henri Sivonen wrote: > How about where %foo_id is > either a double-quoted string or the string 'null' without quotes. > > But if both are null, use > Sure, lets do that. -- Anne van Kesteren From jg307 at cam.ac.uk Fri Apr 4 09:44:40 2008 From: jg307 at cam.ac.uk (James Graham) Date: Fri, 04 Apr 2008 17:44:40 +0100 Subject: [imps] Tree construction test in undocumented format In-Reply-To: References: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> <47F6503A.4040309@cam.ac.uk> <74FF9010-9885-4905-8FB8-0477E097BDC5@iki.fi> Message-ID: <47F65AF8.80306@cam.ac.uk> Anne van Kesteren wrote: > On Fri, 04 Apr 2008 18:14:13 +0200, Henri Sivonen wrote: >> How about where %foo_id is >> either a double-quoted string or the string 'null' without quotes. >> >> But if both are null, use >> > > Sure, lets do that. I would prefer not to use the unmatched quote, since some editors insert double quotes automatically. Is there a problem with just doing: If we need to distinguish the null case from the empty string case we could just, as suggested, replace the whole quoted thing with unquoted null in those cases. (the no-public/system case should still be ) -- "Eternity's a terrible thought. I mean, where's it all going to end?" -- Tom Stoppard, Rosencrantz and Guildenstern are Dead From hsivonen at iki.fi Fri Apr 4 11:32:40 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Fri, 4 Apr 2008 21:32:40 +0300 Subject: [imps] Tree construction test in undocumented format In-Reply-To: <47F65AF8.80306@cam.ac.uk> References: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> <47F6503A.4040309@cam.ac.uk> <74FF9010-9885-4905-8FB8-0477E097BDC5@iki.fi> <47F65AF8.80306@cam.ac.uk> Message-ID: <119BD728-9651-49EB-98B1-20456CA4CAF4@iki.fi> On Apr 4, 2008, at 19:44, James Graham wrote: > Is there a problem with just doing: > > > If we need to distinguish the null case from the empty string case > we could just, as suggested, replace the whole quoted thing with > unquoted null in those cases. Oops. That's what I meant. Too many typos lately. -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From hsivonen at iki.fi Fri Apr 4 11:45:26 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Fri, 4 Apr 2008 21:45:26 +0300 Subject: [imps] Tree construction test in undocumented format In-Reply-To: <119BD728-9651-49EB-98B1-20456CA4CAF4@iki.fi> References: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> <47F6503A.4040309@cam.ac.uk> <74FF9010-9885-4905-8FB8-0477E097BDC5@iki.fi> <47F65AF8.80306@cam.ac.uk> <119BD728-9651-49EB-98B1-20456CA4CAF4@iki.fi> Message-ID: <52FE5396-925C-4581-9860-043CB148E003@iki.fi> On Apr 4, 2008, at 21:32, Henri Sivonen wrote: > On Apr 4, 2008, at 19:44, James Graham wrote: >> Is there a problem with just doing: >> >> >> If we need to distinguish the null case from the empty string case >> we could just, as suggested, replace the whole quoted thing with >> unquoted null in those cases. > > Oops. That's what I meant. Too many typos lately. Even more oops: Neither string can be null per the current spec, so the rule should be unless both are the empty string in which case -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From t.broyer at gmail.com Fri Apr 4 17:24:48 2008 From: t.broyer at gmail.com (Thomas Broyer) Date: Sat, 5 Apr 2008 02:24:48 +0200 Subject: [imps] Tree construction test in undocumented format In-Reply-To: References: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> <47F6503A.4040309@cam.ac.uk> Message-ID: On Fri, Apr 4, 2008 at 6:11 PM, Anne van Kesteren wrote: > On Fri, 04 Apr 2008 18:07:46 +0200, Anne van Kesteren > wrote: > > no public or system ID: > > public ID, no system ID: > > no public ID, system ID: > > public and system ID: > > Instead of fooling around with spaces alternatively we could put a P or S > before the "..." in the case of either a public ID or system ID. Whether > or not the document is in quirks mode should probably be something else. > Maybe: > > #document-mode > nq|q|lq > > or something like that. In http://html5.googlecode.com/svn/trunk/tests/tree-construction an XML-like/SGML-like DOCTYPE serialization: - - - - And in http://html5.googlecode.com/svn/trunk/tests/tree-construction/compatibility-mode.dat: #compatibility-mode no quirks ?or? #compatibility-mode quirks ?or? #compatibility-mode limited quirks Note that I've also use for comments instead of the current (note the spaces). -- Thomas Broyer From edwardzyang at thewritingpot.com Fri Apr 4 21:06:42 2008 From: edwardzyang at thewritingpot.com (Edward Z. Yang) Date: Sat, 05 Apr 2008 00:06:42 -0400 Subject: [imps] HTML5 and libxml2 Message-ID: <47F6FAD2.8010105@thewritingpot.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 As per the W3C 5 April 2008 working draft, elements not recognized by HTML5 in body are still added to the DOM using the "A start tag token not covered by the previous entries". HTML5 does not specify any validation mechanism in which to ensure the element has the form stipulated by tag name, i.e. [A-Za-z-]+ Unfortunately, certain tag names causes libxml2 to choke, and HTML5 doesn't specify any way to: 1. Munge the name into something libxml2 finds acceptable 2. Ignore the tag as invalid Without modifying the algorithms, (2) is not tenable, so I've been looking at (1). However, HTML5's tag name stipulations appear to be too restrictive: they do not allow digits as seen in

and friends, and aren't even a subset of the allowed XML tag names (XML specifies that a hyphen cannot lead in a tag name, and allows a greater variety of punctuation and international characters). So, in short, due to underlying library limitations I can't put arbitrary characters in a tag (which is what Firefox actually seems to do), and I don't know exactly what characters I need to get rid of. Advice? [1] http://www.w3.org/html/wg/html5/#tag-name [2] http://www.w3.org/TR/REC-xml/#NT-Name - -- Edward Z. Yang GnuPG: 0x869C48DA HTML Purifier Anti-XSS Filter [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH9vrSqTO+fYacSNoRAvt/AJ494M4fINnrRUAf/GbJgvvjoP6XqgCdGE4a /1CeZKB6aFjfU+CEBzhukXA= =nwJ1 -----END PGP SIGNATURE----- From ian at hixie.ch Fri Apr 4 22:34:24 2008 From: ian at hixie.ch (Ian Hickson) Date: Sat, 5 Apr 2008 05:34:24 +0000 (UTC) Subject: [imps] HTML5 and libxml2 In-Reply-To: <47F6FAD2.8010105@thewritingpot.com> References: <47F6FAD2.8010105@thewritingpot.com> Message-ID: On Sat, 5 Apr 2008, Edward Z. Yang wrote: > > HTML5 does not specify any validation mechanism in which to ensure the > element has the form stipulated by tag name, i.e. [A-Za-z-]+ That (erroneous, as it happens) paragraph is just describing a trend in the spec's tag names, it's not a conformance criteria of any kind. The conformance criteria is really just that the elements in the document have to be the elements defined by the spec. You may find this post helpful in determining how to read the HTML5 spec: http://ln.hixie.ch/?start=1140242962&count=1 > Unfortunately, certain tag names causes libxml2 to choke, and HTML5 > doesn't specify any way to: > > 1. Munge the name into something libxml2 finds acceptable > 2. Ignore the tag as invalid Indeed, both of these behaviours would be non-conforming. Can you change libxml2 to support more characters? Is there a real technical reason for the limitation, or is it just enforcing XML requirements? The characters allowed in tag names are by far not the only area where XML and HTML differ, so if it is just a matter of libxml2 enforcing XML's requirements, it will not work well. > So, in short, due to underlying library limitations I can't put > arbitrary characters in a tag (which is what Firefox actually seems to > do), and I don't know exactly what characters I need to get rid of. Advice? If you can't implement what the spec requires, then make sure to document the limitations clearly in your documentation. Meanwhile, you can probably get away with replacing unusable characters with U+FFFD, or at a pinch, "_", so long as you still use the full tag anems in the parser to determine which tags are open. However, make sure to document this as being a conformance problem in your documentation. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' From hsivonen at iki.fi Sat Apr 5 01:10:40 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Sat, 5 Apr 2008 11:10:40 +0300 Subject: [imps] HTML5 and libxml2 In-Reply-To: <47F6FAD2.8010105@thewritingpot.com> References: <47F6FAD2.8010105@thewritingpot.com> Message-ID: On Apr 5, 2008, at 07:06, Edward Z. Yang wrote: > Unfortunately, certain tag names causes libxml2 to choke, and HTML5 > doesn't specify any way to: > > 1. Munge the name into something libxml2 finds acceptable > 2. Ignore the tag as invalid > > Without modifying the algorithms, (2) is not tenable, so I've been > looking at (1). [...] > So, in short, due to underlying library limitations I can't put > arbitrary characters in a tag (which is what Firefox actually seems to > do), and I don't know exactly what characters I need to get rid of. > Advice? In the Validator.nu HTML parser, I've solved this by having three available policies: public enum XmlViolationPolicy { /** * Conform to HTML 5, allow XML 1.0 to be violated. */ ALLOW, /** * Halt when something cannot be mapped to XML 1.0. */ FATAL, /** * Be non-conforming and alter the infoset to fit * XML 1.0 when something would otherwise not be * mappable to XML 1.0. */ ALTER_INFOSET } It seems like ALLOW isn't a possibility for libxml2. With ALTER_INFOSET, tag tokens that do not match Namespaces in XML 1.0 NCName are ignored in the tokenizer. This is non-conforming but works most of the time. (There are many more similar situations you can find by searching for ALTER_INFOSET in the source.) -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From annevk at opera.com Sat Apr 5 05:17:21 2008 From: annevk at opera.com (Anne van Kesteren) Date: Sat, 05 Apr 2008 14:17:21 +0200 Subject: [imps] HTML5 parser test location Message-ID: Just an FYI since there seems to be some misinformation about this (for which I might have been responsible, not sure), but hereby a note that we'd like to keep all HTML5 parser tests in the html5lib project tree and not move them all to the html5 project tree. For licensing reasons, because the html5 project owner doesn't like it, and because it just isn't worth the trouble. (If I was unclear about this in the past or have given the impression of supporting the opposite view, my apologies.) -- Anne van Kesteren From t.broyer at gmail.com Sat Apr 5 06:22:48 2008 From: t.broyer at gmail.com (Thomas Broyer) Date: Sat, 5 Apr 2008 15:22:48 +0200 Subject: [imps] HTML5 parser test location In-Reply-To: References: Message-ID: On Sat, Apr 5, 2008 at 2:17 PM, Anne van Kesteren wrote: > Just an FYI since there seems to be some misinformation about this (for > which I might have been responsible, not sure), but hereby a note that we'd > like to keep all HTML5 parser tests in the html5lib project tree and not > move them all to the html5 project tree. For licensing reasons, because the > html5 project owner doesn't like it, and because it just isn't worth the > trouble. > > (If I was unclear about this in the past or have given the impression of > supporting the opposite view, my apologies.) FYI, I started the new tests (not really a "move" 'cause I've been rewriting them all from scratch) based on the following thread: http://lists.whatwg.org/pipermail/implementors-whatwg.org/2007-July/000126.html Particularly the following two messages: http://lists.whatwg.org/pipermail/implementors-whatwg.org/2007-July/000127.html http://lists.whatwg.org/pipermail/implementors-whatwg.org/2007-August/000129.html I personnaly have no preference for any license (MIT or Apache 2) so if people (including those who don't work on html5lib, like Henri Sivonen) prefer the tests to be MIT-licenced, I don't bother. But please note that the tests in the html5 repository are complete rewrites, not copies from html5lib's tests. Now, if Ian changed his mind, and to keep them "implementation-agnostic", how about an html5-tests project? (MIT-licensed if that's what implementors want) My main goal with the new tests is to keep them: - independant of any implementation, so that we can keep them in sync with the spec, not the software (see the above thread's first message) - organized wrt the spec, so that when the spec change it's easier to locate the tests that need to be updated Projects using those tests could specify a particular revision they aim to "pass", in their svn:external import for example; and change the svn:external revision when they update the implementation to follow the spec changes. -- Thomas Broyer From annevk at opera.com Sat Apr 5 08:01:17 2008 From: annevk at opera.com (Anne van Kesteren) Date: Sat, 05 Apr 2008 17:01:17 +0200 Subject: [imps] HTML5 parser test location In-Reply-To: References: Message-ID: On Sat, 05 Apr 2008 15:22:48 +0200, Thomas Broyer wrote: > My main goal with the new tests is to keep them: > - independant of any implementation, so that we can keep them in sync > with the spec, not the software (see the above thread's first message) I thought about this and agree that this is problematic. Given that the main code is more stable maybe we could move to a model where it is not problematic for the trunk code to fail tests that have been reviewed by several sources. To make a new release we basically need to pass all tests we currently fail on trunk instead of trying to develop tests and code side by side. This would remove the need for the tests to be independent of the implementation. If people still prefer stability for their implementation in the html5lib trunk tree they could make a copy of the test suite and merge that everytime the test suite changes while updating their implementation. -- Anne van Kesteren From edwardzyang at thewritingpot.com Sat Apr 5 08:04:13 2008 From: edwardzyang at thewritingpot.com (Edward Z. Yang) Date: Sat, 05 Apr 2008 11:04:13 -0400 Subject: [imps] HTML5 and libxml2 In-Reply-To: References: <47F6FAD2.8010105@thewritingpot.com> Message-ID: <47F794ED.3090801@thewritingpot.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Ian Hickson wrote: > That (erroneous, as it happens) paragraph is just describing a trend > in the spec's tag names, it's not a conformance criteria of any kind. > Should I submit a patch fixing the error? > The conformance criteria is really just that the elements in the > document have to be the elements defined by the spec. But the spec also defines behavior when elements are outside of the spec, i.e. an error-condition. I'd appreciate it if the allowed tag names is made a normative requirement for such elements. > You may find this post helpful in determining how to read the HTML5 > spec: Thanks. I've heard of RFC2119 before, but I didn't realize that statements of fact that don't use those keywords should not be considered normative. Many of W3C's specs explicitly state which statements are normative and which of informative. > Can you change libxml2 to support more characters? Is there a real > technical reason for the limitation, or is it just enforcing XML > requirements? I've pinged the libxml2 list, should have an answer back soon. > The characters allowed in tag names are by far not the only area > where XML and HTML differ, so if it is just a matter of libxml2 > enforcing XML's requirements, it will not work well. What are these differences explicitly? > If you can't implement what the spec requires, then make sure to > document the limitations clearly in your documentation. Meanwhile, > you can probably get away with replacing unusable characters with > U+FFFD, Unfortunately, U+FFFD is an invalid character too. :-) > or at a pinch, "_", so long as you still use the full tag anems in > the parser to determine which tags are open. However, make sure to > document this as being a conformance problem in your documentation. This might be tricky, and it occurs to me that as long as the substitution process works the same for the tags, t becomes t which is equivalent. I will, of course, document it. Henri Sivonen: > With ALTER_INFOSET, tag tokens that do not match Namespaces in XML > 1.0 NCName are ignored in the tokenizer. This is non-conforming but > works most of the time. (There are many more similar situations you > can find by searching for ALTER_INFOSET in the source.) This is what I had been considering with (2), but it looked like I'd have to make multiple modifications in the algorithm to get that to work. I would look at the source, but I can't seem to find it! All I can find is the build script, and I don't have Python. - -- Edward Z. Yang GnuPG: 0x869C48DA HTML Purifier Anti-XSS Filter [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH95TtqTO+fYacSNoRAk+IAJ4gPLXGHbSuAsUQBaO2Fgu4XMm5WQCfbSd/ JAcnZflMEh0uxRbJ2gwww9E= =t+U6 -----END PGP SIGNATURE----- From hsivonen at iki.fi Sat Apr 5 13:06:07 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Sat, 5 Apr 2008 23:06:07 +0300 Subject: [imps] HTML5 and libxml2 In-Reply-To: <47F794ED.3090801@thewritingpot.com> References: <47F6FAD2.8010105@thewritingpot.com> <47F794ED.3090801@thewritingpot.com> Message-ID: On Apr 5, 2008, at 18:04, Edward Z. Yang wrote: > Henri Sivonen: >> With ALTER_INFOSET, tag tokens that do not match Namespaces in XML >> 1.0 NCName are ignored in the tokenizer. This is non-conforming but >> works most of the time. (There are many more similar situations you >> can find by searching for ALTER_INFOSET in the source.) > > This is what I had been considering with (2), but it looked like I'd > have to make multiple modifications in the algorithm to get that to > work. I would look at the source, but I can't seem to find it! > All I can find is the build script, and I don't have Python. The parser source is also in the parser distribution package available from: http://about.validator.nu/htmlparser/ (Currently: http://about.validator.nu/htmlparser/htmlparser-1.0.7.zip ) -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From ian at hixie.ch Sat Apr 5 14:55:57 2008 From: ian at hixie.ch (Ian Hickson) Date: Sat, 5 Apr 2008 21:55:57 +0000 (UTC) Subject: [imps] HTML5 parser test location In-Reply-To: References: Message-ID: On Sat, 5 Apr 2008, Thomas Broyer wrote: > > Now, if Ian changed his mind I'm happy for the html5 project to be used if that's what people want, my only concern is that they already are known to be elsewhere now and I really don't want a mixture of tests in different places, as that will just fragment our progress. Sorry for flipflopping. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' From ian at hixie.ch Sat Apr 5 15:05:00 2008 From: ian at hixie.ch (Ian Hickson) Date: Sat, 5 Apr 2008 22:05:00 +0000 (UTC) Subject: [imps] HTML5 and libxml2 In-Reply-To: <47F794ED.3090801@thewritingpot.com> References: <47F6FAD2.8010105@thewritingpot.com> <47F794ED.3090801@thewritingpot.com> Message-ID: On Sat, 5 Apr 2008, Edward Z. Yang wrote: > > Ian Hickson wrote: > > That (erroneous, as it happens) paragraph is just describing a trend > > in the spec's tag names, it's not a conformance criteria of any kind. > > Should I submit a patch fixing the error? I have noted it and will fix it in due course. :-) > > The conformance criteria is really just that the elements in the > > document have to be the elements defined by the spec. > > But the spec also defines behavior when elements are outside of the > spec, i.e. an error-condition. I'd appreciate it if the allowed tag > names is made a normative requirement for such elements. The normative requirement for such elements is that they are _all_ invalid, even if they just use a-z characters. The range of characters that can be used by elements that aren't allowed is the empty range. > > The characters allowed in tag names are by far not the only area where > > XML and HTML differ, so if it is just a matter of libxml2 enforcing > > XML's requirements, it will not work well. > > What are these differences explicitly? Well for example an XML comment cannot contain the string "--". > > If you can't implement what the spec requires, then make sure to > > document the limitations clearly in your documentation. Meanwhile, you > > can probably get away with replacing unusable characters with U+FFFD, > > or at a pinch, "_", so long as you still use the full tag anems in the > > parser to determine which tags are open. However, make sure to > > document this as being a conformance problem in your documentation. > > This might be tricky, and it occurs to me that as long as the > substitution process works the same for the tags, t becomes > t which is equivalent. I will, of course, document it. What I meant is make sure that you code handles: X ...as creating a DOM tree where the third tag above closes the first one, not the second one. i.e. in your parser and the stack of elements you should keep the original tag names, and only give the munged tag names to the the DOM tree. HTH, -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' From edwardzyang at thewritingpot.com Sat Apr 5 15:48:30 2008 From: edwardzyang at thewritingpot.com (Edward Z. Yang) Date: Sat, 05 Apr 2008 18:48:30 -0400 Subject: [imps] HTML5 and libxml2 In-Reply-To: References: <47F6FAD2.8010105@thewritingpot.com> <47F794ED.3090801@thewritingpot.com> Message-ID: <47F801BE.2050404@thewritingpot.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 As an informative note, the tag limitation is with PHP's DOM extension and not libxml2. I'm probably going to do an implementation similar to Validator.nu's, although that's given that I'm interested enough in making major architectural changes to the unmaintained PH5P. Ian Hickson wrote: > The normative requirement for such elements is that they are _all_ > invalid, even if they just use a-z characters. The range of characters > that can be used by elements that aren't allowed is the empty range. I understand this; however, since HTML has graceful error handling, even though such elements are invalid we should still have well-defined handling for them. Which, I suppose, it does. :-) > Well for example an XML comment cannot contain the string "--". I took a look at the source code for Validator.nu and all the differences are there. > What I meant is make sure that you code handles: > > X > > ...as creating a DOM tree where the third tag above closes the first one, > not the second one. i.e. in your parser and the stack of elements you > should keep the original tag names, and only give the munged tag names to > the the DOM tree. Duly noted. - -- Edward Z. Yang GnuPG: 0x869C48DA HTML Purifier Anti-XSS Filter [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH+AG+qTO+fYacSNoRAjXNAJ4jDKNJ/tqPCGe6px+mWAY8yK/+nQCcD85N JsMyioGbTvC3OYdVnAPrus4= =e5ES -----END PGP SIGNATURE----- From foolistbar at googlemail.com Sun Apr 6 04:32:42 2008 From: foolistbar at googlemail.com (Geoffrey Sneddon) Date: Sun, 6 Apr 2008 12:32:42 +0100 Subject: [imps] HTML5 and libxml2 In-Reply-To: <47F801BE.2050404@thewritingpot.com> References: <47F6FAD2.8010105@thewritingpot.com> <47F794ED.3090801@thewritingpot.com> <47F801BE.2050404@thewritingpot.com> Message-ID: On 5 Apr 2008, at 23:48, Edward Z. Yang wrote: > As an informative note, the tag limitation is with PHP's DOM extension > and not libxml2. I'm probably going to do an implementation similar to > Validator.nu's, although that's given that I'm interested enough in > making major architectural changes to the unmaintained PH5P. Is there a bug report in the PHP bug database? This most certainly is a violation of the DOM specification. -- Geoffrey Sneddon From edwardzyang at thewritingpot.com Sun Apr 6 06:08:55 2008 From: edwardzyang at thewritingpot.com (Edward Z. Yang) Date: Sun, 06 Apr 2008 09:08:55 -0400 Subject: [imps] HTML5 and libxml2 In-Reply-To: References: <47F6FAD2.8010105@thewritingpot.com> <47F794ED.3090801@thewritingpot.com> <47F801BE.2050404@thewritingpot.com> Message-ID: <47F8CB67.6010905@thewritingpot.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Geoffrey Sneddon wrote: > Is there a bug report in the PHP bug database? This most certainly is a > violation of the DOM specification. http://bugs.php.net/bug.php?id=44648 Although, the DOM specification clearly states that an INVALID_CHARACTER_ERROR should be thrown when the tag name is "invalid". - -- Edward Z. Yang GnuPG: 0x869C48DA HTML Purifier Anti-XSS Filter [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH+MtmqTO+fYacSNoRAg6+AJ4vnReEXBH9eOGVMXhszYOgSFGBmQCfWYOs NFG5AmZu5qgzeM6aThWBaDA= =JrVQ -----END PGP SIGNATURE----- From foolistbar at googlemail.com Sun Apr 6 06:57:16 2008 From: foolistbar at googlemail.com (Geoffrey Sneddon) Date: Sun, 6 Apr 2008 14:57:16 +0100 Subject: [imps] HTML5 and libxml2 In-Reply-To: <47F8CB67.6010905@thewritingpot.com> References: <47F6FAD2.8010105@thewritingpot.com> <47F794ED.3090801@thewritingpot.com> <47F801BE.2050404@thewritingpot.com> <47F8CB67.6010905@thewritingpot.com> Message-ID: On 6 Apr 2008, at 14:08, Edward Z. Yang wrote: > Geoffrey Sneddon wrote: >> Is there a bug report in the PHP bug database? This most certainly >> is a >> violation of the DOM specification. > > http://bugs.php.net/bug.php?id=44648 > > Although, the DOM specification clearly states that an > INVALID_CHARACTER_ERROR should be thrown when the tag name is > "invalid". What happens when you set DOMDocument::$strictErrorChecking to false? In the DOM spec, behaviour then is undefined. -- Geoffrey Sneddon From edwardzyang at thewritingpot.com Sun Apr 6 10:16:02 2008 From: edwardzyang at thewritingpot.com (Edward Z. Yang) Date: Sun, 06 Apr 2008 13:16:02 -0400 Subject: [imps] HTML5 and libxml2 In-Reply-To: References: <47F6FAD2.8010105@thewritingpot.com> <47F794ED.3090801@thewritingpot.com> <47F801BE.2050404@thewritingpot.com> <47F8CB67.6010905@thewritingpot.com> Message-ID: <47F90552.2080905@thewritingpot.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Geoffrey Sneddon wrote: > What happens when you set DOMDocument::$strictErrorChecking to false? In > the DOM spec, behaviour then is undefined. This is a proprietary extension, and defines whether or not PHP should throw actual Exceptions with DOM errors, or emit warnings. The behavior with DOM, provided the exception is properly caught, remains the same, as after the C code invokes the exception or emits the error, control is passed back to the PHP interpreter. (sorry about the dupe; accidentally made an off-list post) - -- Edward Z. Yang GnuPG: 0x869C48DA HTML Purifier Anti-XSS Filter [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH+QVSqTO+fYacSNoRAqZnAJ4hmwcMZNLNW/sPEOs21uKiR73wAQCfdBcJ bmv2RYCgt2wjWwd4vDEZ3jY= =ris/ -----END PGP SIGNATURE----- From foolistbar at googlemail.com Sun Apr 6 10:36:56 2008 From: foolistbar at googlemail.com (Geoffrey Sneddon) Date: Sun, 6 Apr 2008 18:36:56 +0100 Subject: [imps] HTML5 and libxml2 In-Reply-To: <47F90552.2080905@thewritingpot.com> References: <47F6FAD2.8010105@thewritingpot.com> <47F794ED.3090801@thewritingpot.com> <47F801BE.2050404@thewritingpot.com> <47F8CB67.6010905@thewritingpot.com> <47F90552.2080905@thewritingpot.com> Message-ID: <0CAB6556-942F-42FC-80B4-50B1AD2819B0@googlemail.com> On 6 Apr 2008, at 18:16, Edward Z. Yang wrote: > Geoffrey Sneddon wrote: >> What happens when you set DOMDocument::$strictErrorChecking to >> false? In >> the DOM spec, behaviour then is undefined. > > This is a proprietary extension, and defines whether or not PHP should > throw actual Exceptions with DOM errors, or emit warnings. It's not proprietary: it's part of DOM Level 3 Core, as PHP claims to implement (see ). > The behavior > with DOM, provided the exception is properly caught, remains the same, > as after the C code invokes the exception or emits the error, > control is > passed back to the PHP interpreter. Ah. So therefore it doesn't actually allow the DOM to hold characters it otherwise wouldn't (like a a@ localName). -- Geoffrey Sneddon From edwardzyang at thewritingpot.com Sun Apr 6 10:47:46 2008 From: edwardzyang at thewritingpot.com (Edward Z. Yang) Date: Sun, 06 Apr 2008 13:47:46 -0400 Subject: [imps] HTML5 and libxml2 In-Reply-To: <0CAB6556-942F-42FC-80B4-50B1AD2819B0@googlemail.com> References: <47F6FAD2.8010105@thewritingpot.com> <47F794ED.3090801@thewritingpot.com> <47F801BE.2050404@thewritingpot.com> <47F8CB67.6010905@thewritingpot.com> <47F90552.2080905@thewritingpot.com> <0CAB6556-942F-42FC-80B4-50B1AD2819B0@googlemail.com> Message-ID: <47F90CC2.4050105@thewritingpot.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Geoffrey Sneddon wrote: > It's not proprietary: it's part of DOM Level 3 Core, as PHP claims to > implement (see > ). You're right---I was looking at DOM Level 2 Core. > Ah. So therefore it doesn't actually allow the DOM to hold characters it > otherwise wouldn't (like a a@ localName). Precisely. And since the behavior is undefined, the PHP developers are free to implement this however they want. To make these errors not stop execution without strictErrorChecking, one would probably have to macro-fy php_dom_throw_error to include the appropriate return values, and then remove any trailing RETURN_* macro-calls... which doesn't really seem worth it for them, although that makes strictErrorChecking slightly useless. :-) It also makes me slightly worried about cases where libxml2 *requires* that the validation is done (for example, libxml2 does not appear to be binary safe, whereas PHP strings are). - -- Edward Z. Yang GnuPG: 0x869C48DA HTML Purifier Anti-XSS Filter [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH+QzCqTO+fYacSNoRAoELAJwLG3NzFvvkXZK0fCgbVyjsvp+V6wCfYeDv a/6q8ySgwJ2TfjpKQSxSFr0= =/16o -----END PGP SIGNATURE----- From jg307 at cam.ac.uk Mon Apr 7 07:36:53 2008 From: jg307 at cam.ac.uk (James Graham) Date: Mon, 07 Apr 2008 15:36:53 +0100 Subject: [imps] HTML5 parser test location In-Reply-To: References: Message-ID: <47FA3185.9030302@cam.ac.uk> Thomas Broyer wrote: > On Sat, Apr 5, 2008 at 2:17 PM, Anne van Kesteren wrote: >> Just an FYI since there seems to be some misinformation about this (for >> which I might have been responsible, not sure), but hereby a note that we'd >> like to keep all HTML5 parser tests in the html5lib project tree and not >> move them all to the html5 project tree. For licensing reasons, because the >> html5 project owner doesn't like it, and because it just isn't worth the >> trouble. >> >> (If I was unclear about this in the past or have given the impression of >> supporting the opposite view, my apologies.) > > FYI, I started the new tests (not really a "move" 'cause I've been > rewriting them all from scratch) based on the following thread: > http://lists.whatwg.org/pipermail/implementors-whatwg.org/2007-July/000126.html > Particularly the following two messages: > http://lists.whatwg.org/pipermail/implementors-whatwg.org/2007-July/000127.html > http://lists.whatwg.org/pipermail/implementors-whatwg.org/2007-August/000129.html My view on this is: * The tests must remain MIT licensed. Therefore they can't go in the html5 repository * I'm -0 on moving the tests at all. That is, I'm not strongly against it but I'm not sure it represents a good investment of time. * If we do decide to move tests, we should consider the advantages of a distributed version control system like Mercurial. It seems like a situation where slightly disjoint groups of people are all editing a common set of files might play to the strengths of those systems. * I am totally against rewriting tests. The current tests have often been written in response to actual regressions in software. Throwing away all that knowledge of fragile points in the various implementations is unacceptable. Adding extra tests is of course fine. * One of the identified problems with the current test suite is that it is hard to determine which tests need to change when the spec changes. There are various ways to improve this without starting over. Specifically it is not hard to instrument html5lib to monitor which phase it is in at a given time. One can imagine using this to automatically identify testcases that cause the parser to go through altered parts of the spec. -- "Eternity's a terrible thought. I mean, where's it all going to end?" -- Tom Stoppard, Rosencrantz and Guildenstern are Dead From ryan at theryanking.com Mon Apr 7 12:49:18 2008 From: ryan at theryanking.com (Ryan King) Date: Mon, 7 Apr 2008 12:49:18 -0700 Subject: [imps] HTML5 parser test location In-Reply-To: <47FA3185.9030302@cam.ac.uk> References: <47FA3185.9030302@cam.ac.uk> Message-ID: <2640F92A-964E-494A-8574-EC557FCC4702@theryanking.com> On Apr 7, 2008, at 7:36 AM, James Graham wrote: > * I am totally against rewriting tests. The current tests have often > been > written in response to actual regressions in software. Throwing away > all that > knowledge of fragile points in the various implementations is > unacceptable. > Adding extra tests is of course fine. I want to second this point. Almost all the tests I've written have been to deal with issues discovered in the ruby implementation. -ryan From rubys at intertwingly.net Mon Apr 7 18:14:29 2008 From: rubys at intertwingly.net (Sam Ruby) Date: Mon, 07 Apr 2008 21:14:29 -0400 Subject: [imps] HTML5 parser test location In-Reply-To: <2640F92A-964E-494A-8574-EC557FCC4702@theryanking.com> References: <47FA3185.9030302@cam.ac.uk> <2640F92A-964E-494A-8574-EC557FCC4702@theryanking.com> Message-ID: <47FAC6F5.6010809@intertwingly.net> Ryan King wrote: > On Apr 7, 2008, at 7:36 AM, James Graham wrote: >> * I am totally against rewriting tests. The current tests have often >> been >> written in response to actual regressions in software. Throwing away >> all that >> knowledge of fragile points in the various implementations is >> unacceptable. >> Adding extra tests is of course fine. > > I want to second this point. Almost all the tests I've written have > been to deal with issues discovered in the ruby implementation. Is there some way we can segregate the tests into ones that we expect to pass and ones that we (currently) don't expect to pass? - Sam Ruby From edwardzyang at thewritingpot.com Mon Apr 7 15:55:29 2008 From: edwardzyang at thewritingpot.com (Edward Z. Yang) Date: Mon, 07 Apr 2008 18:55:29 -0400 Subject: [imps] HTML5 parser test location In-Reply-To: <47FAC6F5.6010809@intertwingly.net> References: <47FA3185.9030302@cam.ac.uk> <2640F92A-964E-494A-8574-EC557FCC4702@theryanking.com> <47FAC6F5.6010809@intertwingly.net> Message-ID: <47FAA661.5000207@thewritingpot.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Sam Ruby wrote: > Is there some way we can segregate the tests into ones that we expect to > pass and ones that we (currently) don't expect to pass? How about an extra boolean flag in the test structs? This has the added benefit of not needing to shuffle tests around once they do start passing, and also being able to test the inverse: whether or not the tests we don't expect to pass are failing. - -- Edward Z. Yang GnuPG: 0x869C48DA HTML Purifier Anti-XSS Filter [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD4DBQFH+qZhqTO+fYacSNoRArtkAJ9F5kNuoLaqjNhWed5Stb12+Eq0tACXXidx xAavEyPMOYGmhqLBy16ORA== =skuR -----END PGP SIGNATURE----- From edwardzyang at thewritingpot.com Mon Apr 7 16:17:39 2008 From: edwardzyang at thewritingpot.com (Edward Z. Yang) Date: Mon, 07 Apr 2008 19:17:39 -0400 Subject: [imps] HTML5 parser test location In-Reply-To: <47FAA7DC.9060201@cam.ac.uk> References: <47FA3185.9030302@cam.ac.uk> <2640F92A-964E-494A-8574-EC557FCC4702@theryanking.com> <47FAC6F5.6010809@intertwingly.net> <47FAA661.5000207@thewritingpot.com> <47FAA7DC.9060201@cam.ac.uk> Message-ID: <47FAAB93.8000400@thewritingpot.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 James Graham wrote: > That doesn't work because multiple people are using the tests and their > implementations may be at different levels of conformance. > > We'd have to have out-of-band metadata like a list of tests to skip. I don't think that was Sam Ruby's intent; I took it to mean tests for which the specification itself is faulty. If it has to do with differing levvels of conformance, we're SOL. - -- Edward Z. Yang GnuPG: 0x869C48DA HTML Purifier Anti-XSS Filter [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH+quTqTO+fYacSNoRAh+sAJ9LdyrcdCu1ho9Ec1o3pVdM3volJQCdHIx4 TXAURwb97EcQ/Zpt6LL4N9Q= =+7hX -----END PGP SIGNATURE----- From hsivonen at iki.fi Wed Apr 2 07:58:06 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Wed, 2 Apr 2008 17:58:06 +0300 Subject: [imps] td fragments Message-ID: <011D8708-C1F1-46D8-BED4-D4F0A037297D@iki.fi> Can someone, please, explain to me, what part of the spec makes the insertion mode "in body" when the context is "td" or "th" in the fragment case? Specifically, this test case doesn't pass when I implement "reset the insertion mode" rigorously per spec as far as I can see. 1048 ryansking #data 1048 ryansking
1048 ryansking #errors 1062 ryansking Line: 1 Col: 8 Unexpected end tag (table). Ignored. 1062 ryansking Line: 1 Col: 16 Unexpected end tag (tbody). Ignored. 1062 ryansking Line: 1 Col: 24 Unexpected end tag (tfoot). Ignored. 1062 ryansking Line: 1 Col: 32 Unexpected end tag (thead). Ignored. 1062 ryansking Line: 1 Col: 37 Unexpected end tag (tr). Ignored. 1062 ryansking Line: 1 Col: 42 Expected closing tag. Unexpected end of file. 1114 hsivonen #document-fragment 1114 hsivonen td 1052 ryansking #document 1052 ryansking |
-- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From hsivonen at iki.fi Wed Apr 2 09:18:30 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Wed, 2 Apr 2008 19:18:30 +0300 Subject: [imps]
Message-ID: <46EDA2B4-781F-4FA2-AF1F-6E20644B4E0B@iki.fi> Data:
Expected: | | | | |
|
|
Got: | | | | | |
|
Expected errors: Line: 1 Col: 33 End tag (form) seen too early. Ignored. Line: 1 Col: 38 Expected closing tag. Unexpected end of file. Actual errors: 33: End tag ?form? seen but there were unclosed elements. 39: End of file seen and there were open elements. Can someone, please, explain to me, we the test case ignores the tag?
is not scoping per spec, so there is a in scope to close. -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From t.broyer at gmail.com Wed Apr 2 10:49:29 2008 From: t.broyer at gmail.com (Thomas Broyer) Date: Wed, 2 Apr 2008 19:49:29 +0200 Subject: [imps]
In-Reply-To: <46EDA2B4-781F-4FA2-AF1F-6E20644B4E0B@iki.fi> References: <46EDA2B4-781F-4FA2-AF1F-6E20644B4E0B@iki.fi> Message-ID: On Wed, Apr 2, 2008 at 6:18 PM, Henri Sivonen wrote: > Data: >
> Expected: > | > | > | > | > |
> |
> |
> Got: > | > | > | > | > | > |
> |
> Expected errors: > Line: 1 Col: 33 End tag (form) seen too early. Ignored. > Line: 1 Col: 38 Expected closing tag. Unexpected end of file. > Actual errors: > 33: End tag "form" seen but there were unclosed elements. > 39: End of file seen and there were open elements. > > Can someone, please, explain to me, we the test case ignores the form> tag? Because it hasn't been updated and the spec changed since it has been written ? It might be this one (judging from the commit log) http://html5.org/tools/web-apps-tracker?from=1319&to=1320 Instead of fixing tests from html5lib's repository, I suggest removing them and contributing new tests in html5's repository http://html5.googlecode.com/svn/trunk/tests/ -- Thomas Broyer From hsivonen at iki.fi Wed Apr 2 11:58:21 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Wed, 2 Apr 2008 21:58:21 +0300 Subject: [imps]
In-Reply-To: References: <46EDA2B4-781F-4FA2-AF1F-6E20644B4E0B@iki.fi> Message-ID: <137E8D25-9227-4620-ABC4-C70884F9DC23@iki.fi> On Apr 2, 2008, at 20:49, Thomas Broyer wrote: > Because it hasn't been updated and the spec changed since it has > been written ? > It might be this one (judging from the commit log) > http://html5.org/tools/web-apps-tracker?from=1319&to=1320 OK. I'll fix the test. > Instead of fixing tests from html5lib's repository, I suggest removing > them and contributing new tests in html5's repository > http://html5.googlecode.com/svn/trunk/tests/ That project has a different license and all. Can't we just keep the tests in the html5lib repo under the current license? (Especially since we even got Dan Connolly to escalate and get the MIT license OKed from the W3C point of view. Also, I'd hate to have to think about relicensing of the corporate contribution written by me.) -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From ian at hixie.ch Wed Apr 2 17:50:23 2008 From: ian at hixie.ch (Ian Hickson) Date: Thu, 3 Apr 2008 00:50:23 +0000 (UTC) Subject: [imps] td fragments In-Reply-To: <011D8708-C1F1-46D8-BED4-D4F0A037297D@iki.fi> References: <011D8708-C1F1-46D8-BED4-D4F0A037297D@iki.fi> Message-ID: On Wed, 2 Apr 2008, Henri Sivonen wrote: > > Can someone, please, explain to me, what part of the spec makes the > insertion mode "in body" when the context is "td" or "th" in the > fragment case? None. It's "in cell". However, the way that the "in cell" case is defined for the fragment case, it turns out (I just noticed) that it is indistinguishable from being "in body". > Specifically, this test case doesn't pass when I implement "reset the > insertion mode" rigorously per spec as far as I can see. > > 1048 ryansking #data > 1048 ryansking
> 1048 ryansking #errors > 1062 ryansking Line: 1 Col: 8 Unexpected end tag (table). Ignored. > 1062 ryansking Line: 1 Col: 16 Unexpected end tag (tbody). Ignored. > 1062 ryansking Line: 1 Col: 24 Unexpected end tag (tfoot). Ignored. > 1062 ryansking Line: 1 Col: 32 Unexpected end tag (thead). Ignored. > 1062 ryansking Line: 1 Col: 37 Unexpected end tag (tr). Ignored. > 1062 ryansking Line: 1 Col: 42 Expected closing tag. Unexpected > end of file. > 1114 hsivonen #document-fragment > 1114 hsivonen td > 1052 ryansking #document > 1052 ryansking |
What part doesn't pass? (i.e. what do you get?) -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' From ian at hixie.ch Wed Apr 2 17:52:31 2008 From: ian at hixie.ch (Ian Hickson) Date: Thu, 3 Apr 2008 00:52:31 +0000 (UTC) Subject: [imps]
In-Reply-To: <46EDA2B4-781F-4FA2-AF1F-6E20644B4E0B@iki.fi> References: <46EDA2B4-781F-4FA2-AF1F-6E20644B4E0B@iki.fi> Message-ID: On Wed, 2 Apr 2008, Henri Sivonen wrote: > > Data: >
> Expected: > | > | > | > | > |
> |
> |
> Got: > | > | > | > | > | > |
> |
> Expected errors: > Line: 1 Col: 33 End tag (form) seen too early. Ignored. > Line: 1 Col: 38 Expected closing tag. Unexpected end of file. > Actual errors: > 33: End tag ?form? seen but there were unclosed elements. > 39: End of file seen and there were open elements. > > Can someone, please, explain to me, we the test case ignores the form> tag?
is not scoping per spec, so there is a in scope > to close. This part of the spec changed recently, IIRC. Maybe the test wasn't updated? -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' From hsivonen at iki.fi Thu Apr 3 07:32:06 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Thu, 3 Apr 2008 17:32:06 +0300 Subject: [imps] td fragments In-Reply-To: References: <011D8708-C1F1-46D8-BED4-D4F0A037297D@iki.fi> Message-ID: <7366C75D-473C-4646-AFCF-DB134999A0FE@iki.fi> On Apr 3, 2008, at 03:50, Ian Hickson wrote: > On Wed, 2 Apr 2008, Henri Sivonen wrote: >> >> Can someone, please, explain to me, what part of the spec makes the >> insertion mode "in body" when the context is "td" or "th" in the >> fragment case? > > None. It's "in cell". Why? That is, based on what normative statements in the spec? "3 If node is the first node in the stack of open elements, then set last to true; if, in addition, the context element of the HTML fragment parsing algorithm is neither a td element nor a th element, then set node to the context element. (fragment case)" Now /node/ is "html"--not "td". "5 If node is a td or th element, then switch the insertion mode to "in cell" and abort these steps. " This doesn't match, since it is "html"--not "td". >> Specifically, this test case doesn't pass when I implement "reset the >> insertion mode" rigorously per spec as far as I can see. >> >> 1048 ryansking #data >> 1048 ryansking
>> 1048 ryansking #errors >> 1062 ryansking Line: 1 Col: 8 Unexpected end tag (table). Ignored. >> 1062 ryansking Line: 1 Col: 16 Unexpected end tag (tbody). >> Ignored. >> 1062 ryansking Line: 1 Col: 24 Unexpected end tag (tfoot). >> Ignored. >> 1062 ryansking Line: 1 Col: 32 Unexpected end tag (thead). >> Ignored. >> 1062 ryansking Line: 1 Col: 37 Unexpected end tag (tr). Ignored. >> 1062 ryansking Line: 1 Col: 42 Expected closing tag. Unexpected >> end of file. >> 1114 hsivonen #document-fragment >> 1114 hsivonen td >> 1052 ryansking #document >> 1052 ryansking |
> > What part doesn't pass? (i.e. what do you get?) | | |
-- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From ian at hixie.ch Thu Apr 3 11:34:41 2008 From: ian at hixie.ch (Ian Hickson) Date: Thu, 3 Apr 2008 18:34:41 +0000 (UTC) Subject: [imps] td fragments In-Reply-To: <7366C75D-473C-4646-AFCF-DB134999A0FE@iki.fi> References: <011D8708-C1F1-46D8-BED4-D4F0A037297D@iki.fi> <7366C75D-473C-4646-AFCF-DB134999A0FE@iki.fi> Message-ID: On Thu, 3 Apr 2008, Henri Sivonen wrote: > > > > > > Can someone, please, explain to me, what part of the spec makes the > > > insertion mode "in body" when the context is "td" or "th" in the > > > fragment case? > > > > None. It's "in cell". > > Why? That is, based on what normative statements in the spec? > > "3 If node is the first node in the stack of open elements, then set > last to true; if, in addition, the context element of the HTML fragment > parsing algorithm is neither a td element nor a th element, then set > node to the context element. (fragment case)" > > Now /node/ is "html"--not "td". > > "5 If node is a td or th element, then switch the insertion mode to "in > cell" and abort these steps. " > > This doesn't match, since it is "html"--not "td". Hm, I wonder what that line is doing there. It seems we should remove the "if, in addition, the context element of the HTML fragment parsing algorithm is neither a td element nor a th element" condition, or make that particular situation trigger "in body" rather than "before head". -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' From hsivonen at iki.fi Fri Apr 4 03:53:29 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Fri, 4 Apr 2008 13:53:29 +0300 Subject: [imps] Emulating the HTML DOM when actually parsed from XML Message-ID: <90924D0E-F13D-42C5-ABD1-82A1ABD62D74@iki.fi> I was thinking that especially with the MathML and SVG additions, it would be great to be able to test effect of the HTML5 parsing algorithm in current browsers. Since hooking up a parser written in Java or Python into a browser written in C++ is in itself non-trivial, I started considering an HTTP proxy that intercepted text/html and converted into application/xhtml+xml. (Jetty, Validator.nu parser, Commons HttpClient.) This approach might even work for static pages, as many people already write their selectors in lower case. However, a bit part of Web compat is script compat, and the proxy would make browsers put the DOM in the XML mode. Would it be possible to monkeypatch the features listed at http://wiki.whatwg.org/wiki/HtmlVsXhtml#Scripts using JS prototypes if the proxy injected a script into each document? Except for document.write(), of course. Might someone already have done this? Then there's the form pointer issue. With Opera, setting the WF2 form attribute would work, but what about Gecko and WebKit? And then there's the issue that some behavior depends on the character encoding and might break if the document is promoted to UTF-8. Would setting accept-charset on work around this sufficiently? Any ideas if quirks mode CSS and document.write() would make the whole exercise futile? -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From hsivonen at iki.fi Fri Apr 4 06:53:59 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Fri, 4 Apr 2008 16:53:59 +0300 Subject: [imps] Tree construction test in undocumented format Message-ID: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> > 1117 jgraham.html #data > 1117 jgraham.html EN"> > 1117 jgraham.html #errors > 1117 jgraham.html doctype-error > 1117 jgraham.html #document > 1117 jgraham.html | EN" ""> > 1117 jgraham.html | > 1117 jgraham.html | > 1117 jgraham.html | The expected output doesn't follow the documented format: http://wiki.whatwg.org/wiki/Parser_tests I grepped around a bit in the Python code unsuccessfully. It would be nice to have documentation on the wiki. (Particularly around the null, "" and SYSTEM cases.) -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From jg307 at cam.ac.uk Fri Apr 4 08:58:50 2008 From: jg307 at cam.ac.uk (James Graham) Date: Fri, 04 Apr 2008 16:58:50 +0100 Subject: [imps] Tree construction test in undocumented format In-Reply-To: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> References: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> Message-ID: <47F6503A.4040309@cam.ac.uk> Henri Sivonen wrote: >> 1117 jgraham.html #data >> 1117 jgraham.html > EN"> >> 1117 jgraham.html #errors >> 1117 jgraham.html doctype-error >> 1117 jgraham.html #document >> 1117 jgraham.html | > EN" ""> >> 1117 jgraham.html | >> 1117 jgraham.html | >> 1117 jgraham.html | Ah, this is indeed my fault > > The expected output doesn't follow the documented format: > http://wiki.whatwg.org/wiki/Parser_tests I don't have time to update the wiki right now (maybe later), but IIRC the format is (using %foo to represent the variable foo) if there is neither a system id or a public ID otherwise This may not be the most sane format ever. -- "Eternity's a terrible thought. I mean, where's it all going to end?" -- Tom Stoppard, Rosencrantz and Guildenstern are Dead From annevk at opera.com Fri Apr 4 09:07:46 2008 From: annevk at opera.com (Anne van Kesteren) Date: Fri, 04 Apr 2008 18:07:46 +0200 Subject: [imps] Tree construction test in undocumented format In-Reply-To: <47F6503A.4040309@cam.ac.uk> References: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> <47F6503A.4040309@cam.ac.uk> Message-ID: On Fri, 04 Apr 2008 17:58:50 +0200, James Graham wrote: > if there is neither a system id or a public ID > > otherwise > > This may not be the most sane format ever. Here is my proposal: no public or system ID: public ID, no system ID: no public ID, system ID: public and system ID: (We need to cover all these cases as either the public ID or system ID can be null ("missing").) -- Anne van Kesteren From annevk at opera.com Fri Apr 4 09:11:15 2008 From: annevk at opera.com (Anne van Kesteren) Date: Fri, 04 Apr 2008 18:11:15 +0200 Subject: [imps] Tree construction test in undocumented format In-Reply-To: References: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> <47F6503A.4040309@cam.ac.uk> Message-ID: On Fri, 04 Apr 2008 18:07:46 +0200, Anne van Kesteren wrote: > no public or system ID: > public ID, no system ID: > no public ID, system ID: > public and system ID: Instead of fooling around with spaces alternatively we could put a P or S before the "..." in the case of either a public ID or system ID. Whether or not the document is in quirks mode should probably be something else. Maybe: #document-mode nq|q|lq or something like that. -- Anne van Kesteren From hsivonen at iki.fi Fri Apr 4 09:14:13 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Fri, 4 Apr 2008 19:14:13 +0300 Subject: [imps] Tree construction test in undocumented format In-Reply-To: References: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> <47F6503A.4040309@cam.ac.uk> Message-ID: <74FF9010-9885-4905-8FB8-0477E097BDC5@iki.fi> On Apr 4, 2008, at 19:11, Anne van Kesteren wrote: > On Fri, 04 Apr 2008 18:07:46 +0200, Anne van Kesteren > wrote: >> no public or system ID: >> public ID, no system ID: >> no public ID, system ID: >> public and system ID: > > Instead of fooling around with spaces alternatively we could put a P > or S before the "..." in the case of either a public ID or system > ID. Whether or not the document is in quirks mode should probably be > something else. Maybe: > > #document-mode > nq|q|lq > > or something like that. How about where %foo_id is either a double-quoted string or the string 'null' without quotes. But if both are null, use -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From annevk at opera.com Fri Apr 4 09:16:08 2008 From: annevk at opera.com (Anne van Kesteren) Date: Fri, 04 Apr 2008 18:16:08 +0200 Subject: [imps] Tree construction test in undocumented format In-Reply-To: <74FF9010-9885-4905-8FB8-0477E097BDC5@iki.fi> References: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> <47F6503A.4040309@cam.ac.uk> <74FF9010-9885-4905-8FB8-0477E097BDC5@iki.fi> Message-ID: On Fri, 04 Apr 2008 18:14:13 +0200, Henri Sivonen wrote: > How about where %foo_id is > either a double-quoted string or the string 'null' without quotes. > > But if both are null, use > Sure, lets do that. -- Anne van Kesteren From jg307 at cam.ac.uk Fri Apr 4 09:44:40 2008 From: jg307 at cam.ac.uk (James Graham) Date: Fri, 04 Apr 2008 17:44:40 +0100 Subject: [imps] Tree construction test in undocumented format In-Reply-To: References: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> <47F6503A.4040309@cam.ac.uk> <74FF9010-9885-4905-8FB8-0477E097BDC5@iki.fi> Message-ID: <47F65AF8.80306@cam.ac.uk> Anne van Kesteren wrote: > On Fri, 04 Apr 2008 18:14:13 +0200, Henri Sivonen wrote: >> How about where %foo_id is >> either a double-quoted string or the string 'null' without quotes. >> >> But if both are null, use >> > > Sure, lets do that. I would prefer not to use the unmatched quote, since some editors insert double quotes automatically. Is there a problem with just doing: If we need to distinguish the null case from the empty string case we could just, as suggested, replace the whole quoted thing with unquoted null in those cases. (the no-public/system case should still be ) -- "Eternity's a terrible thought. I mean, where's it all going to end?" -- Tom Stoppard, Rosencrantz and Guildenstern are Dead From hsivonen at iki.fi Fri Apr 4 11:32:40 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Fri, 4 Apr 2008 21:32:40 +0300 Subject: [imps] Tree construction test in undocumented format In-Reply-To: <47F65AF8.80306@cam.ac.uk> References: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> <47F6503A.4040309@cam.ac.uk> <74FF9010-9885-4905-8FB8-0477E097BDC5@iki.fi> <47F65AF8.80306@cam.ac.uk> Message-ID: <119BD728-9651-49EB-98B1-20456CA4CAF4@iki.fi> On Apr 4, 2008, at 19:44, James Graham wrote: > Is there a problem with just doing: > > > If we need to distinguish the null case from the empty string case > we could just, as suggested, replace the whole quoted thing with > unquoted null in those cases. Oops. That's what I meant. Too many typos lately. -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From hsivonen at iki.fi Fri Apr 4 11:45:26 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Fri, 4 Apr 2008 21:45:26 +0300 Subject: [imps] Tree construction test in undocumented format In-Reply-To: <119BD728-9651-49EB-98B1-20456CA4CAF4@iki.fi> References: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> <47F6503A.4040309@cam.ac.uk> <74FF9010-9885-4905-8FB8-0477E097BDC5@iki.fi> <47F65AF8.80306@cam.ac.uk> <119BD728-9651-49EB-98B1-20456CA4CAF4@iki.fi> Message-ID: <52FE5396-925C-4581-9860-043CB148E003@iki.fi> On Apr 4, 2008, at 21:32, Henri Sivonen wrote: > On Apr 4, 2008, at 19:44, James Graham wrote: >> Is there a problem with just doing: >> >> >> If we need to distinguish the null case from the empty string case >> we could just, as suggested, replace the whole quoted thing with >> unquoted null in those cases. > > Oops. That's what I meant. Too many typos lately. Even more oops: Neither string can be null per the current spec, so the rule should be unless both are the empty string in which case -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From t.broyer at gmail.com Fri Apr 4 17:24:48 2008 From: t.broyer at gmail.com (Thomas Broyer) Date: Sat, 5 Apr 2008 02:24:48 +0200 Subject: [imps] Tree construction test in undocumented format In-Reply-To: References: <73D9AE56-B237-45AC-9E75-EB08659F7DAD@iki.fi> <47F6503A.4040309@cam.ac.uk> Message-ID: On Fri, Apr 4, 2008 at 6:11 PM, Anne van Kesteren wrote: > On Fri, 04 Apr 2008 18:07:46 +0200, Anne van Kesteren > wrote: > > no public or system ID: > > public ID, no system ID: > > no public ID, system ID: > > public and system ID: > > Instead of fooling around with spaces alternatively we could put a P or S > before the "..." in the case of either a public ID or system ID. Whether > or not the document is in quirks mode should probably be something else. > Maybe: > > #document-mode > nq|q|lq > > or something like that. In http://html5.googlecode.com/svn/trunk/tests/tree-construction an XML-like/SGML-like DOCTYPE serialization: - - - - And in http://html5.googlecode.com/svn/trunk/tests/tree-construction/compatibility-mode.dat: #compatibility-mode no quirks ?or? #compatibility-mode quirks ?or? #compatibility-mode limited quirks Note that I've also use for comments instead of the current (note the spaces). -- Thomas Broyer From edwardzyang at thewritingpot.com Fri Apr 4 21:06:42 2008 From: edwardzyang at thewritingpot.com (Edward Z. Yang) Date: Sat, 05 Apr 2008 00:06:42 -0400 Subject: [imps] HTML5 and libxml2 Message-ID: <47F6FAD2.8010105@thewritingpot.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 As per the W3C 5 April 2008 working draft, elements not recognized by HTML5 in body are still added to the DOM using the "A start tag token not covered by the previous entries". HTML5 does not specify any validation mechanism in which to ensure the element has the form stipulated by tag name, i.e. [A-Za-z-]+ Unfortunately, certain tag names causes libxml2 to choke, and HTML5 doesn't specify any way to: 1. Munge the name into something libxml2 finds acceptable 2. Ignore the tag as invalid Without modifying the algorithms, (2) is not tenable, so I've been looking at (1). However, HTML5's tag name stipulations appear to be too restrictive: they do not allow digits as seen in

and friends, and aren't even a subset of the allowed XML tag names (XML specifies that a hyphen cannot lead in a tag name, and allows a greater variety of punctuation and international characters). So, in short, due to underlying library limitations I can't put arbitrary characters in a tag (which is what Firefox actually seems to do), and I don't know exactly what characters I need to get rid of. Advice? [1] http://www.w3.org/html/wg/html5/#tag-name [2] http://www.w3.org/TR/REC-xml/#NT-Name - -- Edward Z. Yang GnuPG: 0x869C48DA HTML Purifier Anti-XSS Filter [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH9vrSqTO+fYacSNoRAvt/AJ494M4fINnrRUAf/GbJgvvjoP6XqgCdGE4a /1CeZKB6aFjfU+CEBzhukXA= =nwJ1 -----END PGP SIGNATURE----- From ian at hixie.ch Fri Apr 4 22:34:24 2008 From: ian at hixie.ch (Ian Hickson) Date: Sat, 5 Apr 2008 05:34:24 +0000 (UTC) Subject: [imps] HTML5 and libxml2 In-Reply-To: <47F6FAD2.8010105@thewritingpot.com> References: <47F6FAD2.8010105@thewritingpot.com> Message-ID: On Sat, 5 Apr 2008, Edward Z. Yang wrote: > > HTML5 does not specify any validation mechanism in which to ensure the > element has the form stipulated by tag name, i.e. [A-Za-z-]+ That (erroneous, as it happens) paragraph is just describing a trend in the spec's tag names, it's not a conformance criteria of any kind. The conformance criteria is really just that the elements in the document have to be the elements defined by the spec. You may find this post helpful in determining how to read the HTML5 spec: http://ln.hixie.ch/?start=1140242962&count=1 > Unfortunately, certain tag names causes libxml2 to choke, and HTML5 > doesn't specify any way to: > > 1. Munge the name into something libxml2 finds acceptable > 2. Ignore the tag as invalid Indeed, both of these behaviours would be non-conforming. Can you change libxml2 to support more characters? Is there a real technical reason for the limitation, or is it just enforcing XML requirements? The characters allowed in tag names are by far not the only area where XML and HTML differ, so if it is just a matter of libxml2 enforcing XML's requirements, it will not work well. > So, in short, due to underlying library limitations I can't put > arbitrary characters in a tag (which is what Firefox actually seems to > do), and I don't know exactly what characters I need to get rid of. Advice? If you can't implement what the spec requires, then make sure to document the limitations clearly in your documentation. Meanwhile, you can probably get away with replacing unusable characters with U+FFFD, or at a pinch, "_", so long as you still use the full tag anems in the parser to determine which tags are open. However, make sure to document this as being a conformance problem in your documentation. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' From hsivonen at iki.fi Sat Apr 5 01:10:40 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Sat, 5 Apr 2008 11:10:40 +0300 Subject: [imps] HTML5 and libxml2 In-Reply-To: <47F6FAD2.8010105@thewritingpot.com> References: <47F6FAD2.8010105@thewritingpot.com> Message-ID: On Apr 5, 2008, at 07:06, Edward Z. Yang wrote: > Unfortunately, certain tag names causes libxml2 to choke, and HTML5 > doesn't specify any way to: > > 1. Munge the name into something libxml2 finds acceptable > 2. Ignore the tag as invalid > > Without modifying the algorithms, (2) is not tenable, so I've been > looking at (1). [...] > So, in short, due to underlying library limitations I can't put > arbitrary characters in a tag (which is what Firefox actually seems to > do), and I don't know exactly what characters I need to get rid of. > Advice? In the Validator.nu HTML parser, I've solved this by having three available policies: public enum XmlViolationPolicy { /** * Conform to HTML 5, allow XML 1.0 to be violated. */ ALLOW, /** * Halt when something cannot be mapped to XML 1.0. */ FATAL, /** * Be non-conforming and alter the infoset to fit * XML 1.0 when something would otherwise not be * mappable to XML 1.0. */ ALTER_INFOSET } It seems like ALLOW isn't a possibility for libxml2. With ALTER_INFOSET, tag tokens that do not match Namespaces in XML 1.0 NCName are ignored in the tokenizer. This is non-conforming but works most of the time. (There are many more similar situations you can find by searching for ALTER_INFOSET in the source.) -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From annevk at opera.com Sat Apr 5 05:17:21 2008 From: annevk at opera.com (Anne van Kesteren) Date: Sat, 05 Apr 2008 14:17:21 +0200 Subject: [imps] HTML5 parser test location Message-ID: Just an FYI since there seems to be some misinformation about this (for which I might have been responsible, not sure), but hereby a note that we'd like to keep all HTML5 parser tests in the html5lib project tree and not move them all to the html5 project tree. For licensing reasons, because the html5 project owner doesn't like it, and because it just isn't worth the trouble. (If I was unclear about this in the past or have given the impression of supporting the opposite view, my apologies.) -- Anne van Kesteren From t.broyer at gmail.com Sat Apr 5 06:22:48 2008 From: t.broyer at gmail.com (Thomas Broyer) Date: Sat, 5 Apr 2008 15:22:48 +0200 Subject: [imps] HTML5 parser test location In-Reply-To: References: Message-ID: On Sat, Apr 5, 2008 at 2:17 PM, Anne van Kesteren wrote: > Just an FYI since there seems to be some misinformation about this (for > which I might have been responsible, not sure), but hereby a note that we'd > like to keep all HTML5 parser tests in the html5lib project tree and not > move them all to the html5 project tree. For licensing reasons, because the > html5 project owner doesn't like it, and because it just isn't worth the > trouble. > > (If I was unclear about this in the past or have given the impression of > supporting the opposite view, my apologies.) FYI, I started the new tests (not really a "move" 'cause I've been rewriting them all from scratch) based on the following thread: http://lists.whatwg.org/pipermail/implementors-whatwg.org/2007-July/000126.html Particularly the following two messages: http://lists.whatwg.org/pipermail/implementors-whatwg.org/2007-July/000127.html http://lists.whatwg.org/pipermail/implementors-whatwg.org/2007-August/000129.html I personnaly have no preference for any license (MIT or Apache 2) so if people (including those who don't work on html5lib, like Henri Sivonen) prefer the tests to be MIT-licenced, I don't bother. But please note that the tests in the html5 repository are complete rewrites, not copies from html5lib's tests. Now, if Ian changed his mind, and to keep them "implementation-agnostic", how about an html5-tests project? (MIT-licensed if that's what implementors want) My main goal with the new tests is to keep them: - independant of any implementation, so that we can keep them in sync with the spec, not the software (see the above thread's first message) - organized wrt the spec, so that when the spec change it's easier to locate the tests that need to be updated Projects using those tests could specify a particular revision they aim to "pass", in their svn:external import for example; and change the svn:external revision when they update the implementation to follow the spec changes. -- Thomas Broyer From annevk at opera.com Sat Apr 5 08:01:17 2008 From: annevk at opera.com (Anne van Kesteren) Date: Sat, 05 Apr 2008 17:01:17 +0200 Subject: [imps] HTML5 parser test location In-Reply-To: References: Message-ID: On Sat, 05 Apr 2008 15:22:48 +0200, Thomas Broyer wrote: > My main goal with the new tests is to keep them: > - independant of any implementation, so that we can keep them in sync > with the spec, not the software (see the above thread's first message) I thought about this and agree that this is problematic. Given that the main code is more stable maybe we could move to a model where it is not problematic for the trunk code to fail tests that have been reviewed by several sources. To make a new release we basically need to pass all tests we currently fail on trunk instead of trying to develop tests and code side by side. This would remove the need for the tests to be independent of the implementation. If people still prefer stability for their implementation in the html5lib trunk tree they could make a copy of the test suite and merge that everytime the test suite changes while updating their implementation. -- Anne van Kesteren From edwardzyang at thewritingpot.com Sat Apr 5 08:04:13 2008 From: edwardzyang at thewritingpot.com (Edward Z. Yang) Date: Sat, 05 Apr 2008 11:04:13 -0400 Subject: [imps] HTML5 and libxml2 In-Reply-To: References: <47F6FAD2.8010105@thewritingpot.com> Message-ID: <47F794ED.3090801@thewritingpot.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Ian Hickson wrote: > That (erroneous, as it happens) paragraph is just describing a trend > in the spec's tag names, it's not a conformance criteria of any kind. > Should I submit a patch fixing the error? > The conformance criteria is really just that the elements in the > document have to be the elements defined by the spec. But the spec also defines behavior when elements are outside of the spec, i.e. an error-condition. I'd appreciate it if the allowed tag names is made a normative requirement for such elements. > You may find this post helpful in determining how to read the HTML5 > spec: Thanks. I've heard of RFC2119 before, but I didn't realize that statements of fact that don't use those keywords should not be considered normative. Many of W3C's specs explicitly state which statements are normative and which of informative. > Can you change libxml2 to support more characters? Is there a real > technical reason for the limitation, or is it just enforcing XML > requirements? I've pinged the libxml2 list, should have an answer back soon. > The characters allowed in tag names are by far not the only area > where XML and HTML differ, so if it is just a matter of libxml2 > enforcing XML's requirements, it will not work well. What are these differences explicitly? > If you can't implement what the spec requires, then make sure to > document the limitations clearly in your documentation. Meanwhile, > you can probably get away with replacing unusable characters with > U+FFFD, Unfortunately, U+FFFD is an invalid character too. :-) > or at a pinch, "_", so long as you still use the full tag anems in > the parser to determine which tags are open. However, make sure to > document this as being a conformance problem in your documentation. This might be tricky, and it occurs to me that as long as the substitution process works the same for the tags, t becomes t which is equivalent. I will, of course, document it. Henri Sivonen: > With ALTER_INFOSET, tag tokens that do not match Namespaces in XML > 1.0 NCName are ignored in the tokenizer. This is non-conforming but > works most of the time. (There are many more similar situations you > can find by searching for ALTER_INFOSET in the source.) This is what I had been considering with (2), but it looked like I'd have to make multiple modifications in the algorithm to get that to work. I would look at the source, but I can't seem to find it! All I can find is the build script, and I don't have Python. - -- Edward Z. Yang GnuPG: 0x869C48DA HTML Purifier Anti-XSS Filter [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH95TtqTO+fYacSNoRAk+IAJ4gPLXGHbSuAsUQBaO2Fgu4XMm5WQCfbSd/ JAcnZflMEh0uxRbJ2gwww9E= =t+U6 -----END PGP SIGNATURE----- From hsivonen at iki.fi Sat Apr 5 13:06:07 2008 From: hsivonen at iki.fi (Henri Sivonen) Date: Sat, 5 Apr 2008 23:06:07 +0300 Subject: [imps] HTML5 and libxml2 In-Reply-To: <47F794ED.3090801@thewritingpot.com> References: <47F6FAD2.8010105@thewritingpot.com> <47F794ED.3090801@thewritingpot.com> Message-ID: On Apr 5, 2008, at 18:04, Edward Z. Yang wrote: > Henri Sivonen: >> With ALTER_INFOSET, tag tokens that do not match Namespaces in XML >> 1.0 NCName are ignored in the tokenizer. This is non-conforming but >> works most of the time. (There are many more similar situations you >> can find by searching for ALTER_INFOSET in the source.) > > This is what I had been considering with (2), but it looked like I'd > have to make multiple modifications in the algorithm to get that to > work. I would look at the source, but I can't seem to find it! > All I can find is the build script, and I don't have Python. The parser source is also in the parser distribution package available from: http://about.validator.nu/htmlparser/ (Currently: http://about.validator.nu/htmlparser/htmlparser-1.0.7.zip ) -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/ From ian at hixie.ch Sat Apr 5 14:55:57 2008 From: ian at hixie.ch (Ian Hickson) Date: Sat, 5 Apr 2008 21:55:57 +0000 (UTC) Subject: [imps] HTML5 parser test location In-Reply-To: References: Message-ID: On Sat, 5 Apr 2008, Thomas Broyer wrote: > > Now, if Ian changed his mind I'm happy for the html5 project to be used if that's what people want, my only concern is that they already are known to be elsewhere now and I really don't want a mixture of tests in different places, as that will just fragment our progress. Sorry for flipflopping. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' From ian at hixie.ch Sat Apr 5 15:05:00 2008 From: ian at hixie.ch (Ian Hickson) Date: Sat, 5 Apr 2008 22:05:00 +0000 (UTC) Subject: [imps] HTML5 and libxml2 In-Reply-To: <47F794ED.3090801@thewritingpot.com> References: <47F6FAD2.8010105@thewritingpot.com> <47F794ED.3090801@thewritingpot.com> Message-ID: On Sat, 5 Apr 2008, Edward Z. Yang wrote: > > Ian Hickson wrote: > > That (erroneous, as it happens) paragraph is just describing a trend > > in the spec's tag names, it's not a conformance criteria of any kind. > > Should I submit a patch fixing the error? I have noted it and will fix it in due course. :-) > > The conformance criteria is really just that the elements in the > > document have to be the elements defined by the spec. > > But the spec also defines behavior when elements are outside of the > spec, i.e. an error-condition. I'd appreciate it if the allowed tag > names is made a normative requirement for such elements. The normative requirement for such elements is that they are _all_ invalid, even if they just use a-z characters. The range of characters that can be used by elements that aren't allowed is the empty range. > > The characters allowed in tag names are by far not the only area where > > XML and HTML differ, so if it is just a matter of libxml2 enforcing > > XML's requirements, it will not work well. > > What are these differences explicitly? Well for example an XML comment cannot contain the string "--". > > If you can't implement what the spec requires, then make sure to > > document the limitations clearly in your documentation. Meanwhile, you > > can probably get away with replacing unusable characters with U+FFFD, > > or at a pinch, "_", so long as you still use the full tag anems in the > > parser to determine which tags are open. However, make sure to > > document this as being a conformance problem in your documentation. > > This might be tricky, and it occurs to me that as long as the > substitution process works the same for the tags, t becomes > t which is equivalent. I will, of course, document it. What I meant is make sure that you code handles: X ...as creating a DOM tree where the third tag above closes the first one, not the second one. i.e. in your parser and the stack of elements you should keep the original tag names, and only give the munged tag names to the the DOM tree. HTH, -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' From edwardzyang at thewritingpot.com Sat Apr 5 15:48:30 2008 From: edwardzyang at thewritingpot.com (Edward Z. Yang) Date: Sat, 05 Apr 2008 18:48:30 -0400 Subject: [imps] HTML5 and libxml2 In-Reply-To: References: <47F6FAD2.8010105@thewritingpot.com> <47F794ED.3090801@thewritingpot.com> Message-ID: <47F801BE.2050404@thewritingpot.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 As an informative note, the tag limitation is with PHP's DOM extension and not libxml2. I'm probably going to do an implementation similar to Validator.nu's, although that's given that I'm interested enough in making major architectural changes to the unmaintained PH5P. Ian Hickson wrote: > The normative requirement for such elements is that they are _all_ > invalid, even if they just use a-z characters. The range of characters > that can be used by elements that aren't allowed is the empty range. I understand this; however, since HTML has graceful error handling, even though such elements are invalid we should still have well-defined handling for them. Which, I suppose, it does. :-) > Well for example an XML comment cannot contain the string "--". I took a look at the source code for Validator.nu and all the differences are there. > What I meant is make sure that you code handles: > > X > > ...as creating a DOM tree where the third tag above closes the first one, > not the second one. i.e. in your parser and the stack of elements you > should keep the original tag names, and only give the munged tag names to > the the DOM tree. Duly noted. - -- Edward Z. Yang GnuPG: 0x869C48DA HTML Purifier Anti-XSS Filter [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH+AG+qTO+fYacSNoRAjXNAJ4jDKNJ/tqPCGe6px+mWAY8yK/+nQCcD85N JsMyioGbTvC3OYdVnAPrus4= =e5ES -----END PGP SIGNATURE----- From foolistbar at googlemail.com Sun Apr 6 04:32:42 2008 From: foolistbar at googlemail.com (Geoffrey Sneddon) Date: Sun, 6 Apr 2008 12:32:42 +0100 Subject: [imps] HTML5 and libxml2 In-Reply-To: <47F801BE.2050404@thewritingpot.com> References: <47F6FAD2.8010105@thewritingpot.com> <47F794ED.3090801@thewritingpot.com> <47F801BE.2050404@thewritingpot.com> Message-ID: On 5 Apr 2008, at 23:48, Edward Z. Yang wrote: > As an informative note, the tag limitation is with PHP's DOM extension > and not libxml2. I'm probably going to do an implementation similar to > Validator.nu's, although that's given that I'm interested enough in > making major architectural changes to the unmaintained PH5P. Is there a bug report in the PHP bug database? This most certainly is a violation of the DOM specification. -- Geoffrey Sneddon From edwardzyang at thewritingpot.com Sun Apr 6 06:08:55 2008 From: edwardzyang at thewritingpot.com (Edward Z. Yang) Date: Sun, 06 Apr 2008 09:08:55 -0400 Subject: [imps] HTML5 and libxml2 In-Reply-To: References: <47F6FAD2.8010105@thewritingpot.com> <47F794ED.3090801@thewritingpot.com> <47F801BE.2050404@thewritingpot.com> Message-ID: <47F8CB67.6010905@thewritingpot.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Geoffrey Sneddon wrote: > Is there a bug report in the PHP bug database? This most certainly is a > violation of the DOM specification. http://bugs.php.net/bug.php?id=44648 Although, the DOM specification clearly states that an INVALID_CHARACTER_ERROR should be thrown when the tag name is "invalid". - -- Edward Z. Yang GnuPG: 0x869C48DA HTML Purifier Anti-XSS Filter [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH+MtmqTO+fYacSNoRAg6+AJ4vnReEXBH9eOGVMXhszYOgSFGBmQCfWYOs NFG5AmZu5qgzeM6aThWBaDA= =JrVQ -----END PGP SIGNATURE----- From foolistbar at googlemail.com Sun Apr 6 06:57:16 2008 From: foolistbar at googlemail.com (Geoffrey Sneddon) Date: Sun, 6 Apr 2008 14:57:16 +0100 Subject: [imps] HTML5 and libxml2 In-Reply-To: <47F8CB67.6010905@thewritingpot.com> References: <47F6FAD2.8010105@thewritingpot.com> <47F794ED.3090801@thewritingpot.com> <47F801BE.2050404@thewritingpot.com> <47F8CB67.6010905@thewritingpot.com> Message-ID: On 6 Apr 2008, at 14:08, Edward Z. Yang wrote: > Geoffrey Sneddon wrote: >> Is there a bug report in the PHP bug database? This most certainly >> is a >> violation of the DOM specification. > > http://bugs.php.net/bug.php?id=44648 > > Although, the DOM specification clearly states that an > INVALID_CHARACTER_ERROR should be thrown when the tag name is > "invalid". What happens when you set DOMDocument::$strictErrorChecking to false? In the DOM spec, behaviour then is undefined. -- Geoffrey Sneddon From edwardzyang at thewritingpot.com Sun Apr 6 10:16:02 2008 From: edwardzyang at thewritingpot.com (Edward Z. Yang) Date: Sun, 06 Apr 2008 13:16:02 -0400 Subject: [imps] HTML5 and libxml2 In-Reply-To: References: <47F6FAD2.8010105@thewritingpot.com> <47F794ED.3090801@thewritingpot.com> <47F801BE.2050404@thewritingpot.com> <47F8CB67.6010905@thewritingpot.com> Message-ID: <47F90552.2080905@thewritingpot.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Geoffrey Sneddon wrote: > What happens when you set DOMDocument::$strictErrorChecking to false? In > the DOM spec, behaviour then is undefined. This is a proprietary extension, and defines whether or not PHP should throw actual Exceptions with DOM errors, or emit warnings. The behavior with DOM, provided the exception is properly caught, remains the same, as after the C code invokes the exception or emits the error, control is passed back to the PHP interpreter. (sorry about the dupe; accidentally made an off-list post) - -- Edward Z. Yang GnuPG: 0x869C48DA HTML Purifier Anti-XSS Filter [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH+QVSqTO+fYacSNoRAqZnAJ4hmwcMZNLNW/sPEOs21uKiR73wAQCfdBcJ bmv2RYCgt2wjWwd4vDEZ3jY= =ris/ -----END PGP SIGNATURE----- From foolistbar at googlemail.com Sun Apr 6 10:36:56 2008 From: foolistbar at googlemail.com (Geoffrey Sneddon) Date: Sun, 6 Apr 2008 18:36:56 +0100 Subject: [imps] HTML5 and libxml2 In-Reply-To: <47F90552.2080905@thewritingpot.com> References: <47F6FAD2.8010105@thewritingpot.com> <47F794ED.3090801@thewritingpot.com> <47F801BE.2050404@thewritingpot.com> <47F8CB67.6010905@thewritingpot.com> <47F90552.2080905@thewritingpot.com> Message-ID: <0CAB6556-942F-42FC-80B4-50B1AD2819B0@googlemail.com> On 6 Apr 2008, at 18:16, Edward Z. Yang wrote: > Geoffrey Sneddon wrote: >> What happens when you set DOMDocument::$strictErrorChecking to >> false? In >> the DOM spec, behaviour then is undefined. > > This is a proprietary extension, and defines whether or not PHP should > throw actual Exceptions with DOM errors, or emit warnings. It's not proprietary: it's part of DOM Level 3 Core, as PHP claims to implement (see ). > The behavior > with DOM, provided the exception is properly caught, remains the same, > as after the C code invokes the exception or emits the error, > control is > passed back to the PHP interpreter. Ah. So therefore it doesn't actually allow the DOM to hold characters it otherwise wouldn't (like a a@ localName). -- Geoffrey Sneddon From edwardzyang at thewritingpot.com Sun Apr 6 10:47:46 2008 From: edwardzyang at thewritingpot.com (Edward Z. Yang) Date: Sun, 06 Apr 2008 13:47:46 -0400 Subject: [imps] HTML5 and libxml2 In-Reply-To: <0CAB6556-942F-42FC-80B4-50B1AD2819B0@googlemail.com> References: <47F6FAD2.8010105@thewritingpot.com> <47F794ED.3090801@thewritingpot.com> <47F801BE.2050404@thewritingpot.com> <47F8CB67.6010905@thewritingpot.com> <47F90552.2080905@thewritingpot.com> <0CAB6556-942F-42FC-80B4-50B1AD2819B0@googlemail.com> Message-ID: <47F90CC2.4050105@thewritingpot.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Geoffrey Sneddon wrote: > It's not proprietary: it's part of DOM Level 3 Core, as PHP claims to > implement (see > ). You're right---I was looking at DOM Level 2 Core. > Ah. So therefore it doesn't actually allow the DOM to hold characters it > otherwise wouldn't (like a a@ localName). Precisely. And since the behavior is undefined, the PHP developers are free to implement this however they want. To make these errors not stop execution without strictErrorChecking, one would probably have to macro-fy php_dom_throw_error to include the appropriate return values, and then remove any trailing RETURN_* macro-calls... which doesn't really seem worth it for them, although that makes strictErrorChecking slightly useless. :-) It also makes me slightly worried about cases where libxml2 *requires* that the validation is done (for example, libxml2 does not appear to be binary safe, whereas PHP strings are). - -- Edward Z. Yang GnuPG: 0x869C48DA HTML Purifier Anti-XSS Filter [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH+QzCqTO+fYacSNoRAoELAJwLG3NzFvvkXZK0fCgbVyjsvp+V6wCfYeDv a/6q8ySgwJ2TfjpKQSxSFr0= =/16o -----END PGP SIGNATURE----- From jg307 at cam.ac.uk Mon Apr 7 07:36:53 2008 From: jg307 at cam.ac.uk (James Graham) Date: Mon, 07 Apr 2008 15:36:53 +0100 Subject: [imps] HTML5 parser test location In-Reply-To: References: Message-ID: <47FA3185.9030302@cam.ac.uk> Thomas Broyer wrote: > On Sat, Apr 5, 2008 at 2:17 PM, Anne van Kesteren wrote: >> Just an FYI since there seems to be some misinformation about this (for >> which I might have been responsible, not sure), but hereby a note that we'd >> like to keep all HTML5 parser tests in the html5lib project tree and not >> move them all to the html5 project tree. For licensing reasons, because the >> html5 project owner doesn't like it, and because it just isn't worth the >> trouble. >> >> (If I was unclear about this in the past or have given the impression of >> supporting the opposite view, my apologies.) > > FYI, I started the new tests (not really a "move" 'cause I've been > rewriting them all from scratch) based on the following thread: > http://lists.whatwg.org/pipermail/implementors-whatwg.org/2007-July/000126.html > Particularly the following two messages: > http://lists.whatwg.org/pipermail/implementors-whatwg.org/2007-July/000127.html > http://lists.whatwg.org/pipermail/implementors-whatwg.org/2007-August/000129.html My view on this is: * The tests must remain MIT licensed. Therefore they can't go in the html5 repository * I'm -0 on moving the tests at all. That is, I'm not strongly against it but I'm not sure it represents a good investment of time. * If we do decide to move tests, we should consider the advantages of a distributed version control system like Mercurial. It seems like a situation where slightly disjoint groups of people are all editing a common set of files might play to the strengths of those systems. * I am totally against rewriting tests. The current tests have often been written in response to actual regressions in software. Throwing away all that knowledge of fragile points in the various implementations is unacceptable. Adding extra tests is of course fine. * One of the identified problems with the current test suite is that it is hard to determine which tests need to change when the spec changes. There are various ways to improve this without starting over. Specifically it is not hard to instrument html5lib to monitor which phase it is in at a given time. One can imagine using this to automatically identify testcases that cause the parser to go through altered parts of the spec. -- "Eternity's a terrible thought. I mean, where's it all going to end?" -- Tom Stoppard, Rosencrantz and Guildenstern are Dead From ryan at theryanking.com Mon Apr 7 12:49:18 2008 From: ryan at theryanking.com (Ryan King) Date: Mon, 7 Apr 2008 12:49:18 -0700 Subject: [imps] HTML5 parser test location In-Reply-To: <47FA3185.9030302@cam.ac.uk> References: <47FA3185.9030302@cam.ac.uk> Message-ID: <2640F92A-964E-494A-8574-EC557FCC4702@theryanking.com> On Apr 7, 2008, at 7:36 AM, James Graham wrote: > * I am totally against rewriting tests. The current tests have often > been > written in response to actual regressions in software. Throwing away > all that > knowledge of fragile points in the various implementations is > unacceptable. > Adding extra tests is of course fine. I want to second this point. Almost all the tests I've written have been to deal with issues discovered in the ruby implementation. -ryan From rubys at intertwingly.net Mon Apr 7 18:14:29 2008 From: rubys at intertwingly.net (Sam Ruby) Date: Mon, 07 Apr 2008 21:14:29 -0400 Subject: [imps] HTML5 parser test location In-Reply-To: <2640F92A-964E-494A-8574-EC557FCC4702@theryanking.com> References: <47FA3185.9030302@cam.ac.uk> <2640F92A-964E-494A-8574-EC557FCC4702@theryanking.com> Message-ID: <47FAC6F5.6010809@intertwingly.net> Ryan King wrote: > On Apr 7, 2008, at 7:36 AM, James Graham wrote: >> * I am totally against rewriting tests. The current tests have often >> been >> written in response to actual regressions in software. Throwing away >> all that >> knowledge of fragile points in the various implementations is >> unacceptable. >> Adding extra tests is of course fine. > > I want to second this point. Almost all the tests I've written have > been to deal with issues discovered in the ruby implementation. Is there some way we can segregate the tests into ones that we expect to pass and ones that we (currently) don't expect to pass? - Sam Ruby From edwardzyang at thewritingpot.com Mon Apr 7 15:55:29 2008 From: edwardzyang at thewritingpot.com (Edward Z. Yang) Date: Mon, 07 Apr 2008 18:55:29 -0400 Subject: [imps] HTML5 parser test location In-Reply-To: <47FAC6F5.6010809@intertwingly.net> References: <47FA3185.9030302@cam.ac.uk> <2640F92A-964E-494A-8574-EC557FCC4702@theryanking.com> <47FAC6F5.6010809@intertwingly.net> Message-ID: <47FAA661.5000207@thewritingpot.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Sam Ruby wrote: > Is there some way we can segregate the tests into ones that we expect to > pass and ones that we (currently) don't expect to pass? How about an extra boolean flag in the test structs? This has the added benefit of not needing to shuffle tests around once they do start passing, and also being able to test the inverse: whether or not the tests we don't expect to pass are failing. - -- Edward Z. Yang GnuPG: 0x869C48DA HTML Purifier Anti-XSS Filter [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD4DBQFH+qZhqTO+fYacSNoRArtkAJ9F5kNuoLaqjNhWed5Stb12+Eq0tACXXidx xAavEyPMOYGmhqLBy16ORA== =skuR -----END PGP SIGNATURE----- From edwardzyang at thewritingpot.com Mon Apr 7 16:17:39 2008 From: edwardzyang at thewritingpot.com (Edward Z. Yang) Date: Mon, 07 Apr 2008 19:17:39 -0400 Subject: [imps] HTML5 parser test location In-Reply-To: <47FAA7DC.9060201@cam.ac.uk> References: <47FA3185.9030302@cam.ac.uk> <2640F92A-964E-494A-8574-EC557FCC4702@theryanking.com> <47FAC6F5.6010809@intertwingly.net> <47FAA661.5000207@thewritingpot.com> <47FAA7DC.9060201@cam.ac.uk> Message-ID: <47FAAB93.8000400@thewritingpot.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 James Graham wrote: > That doesn't work because multiple people are using the tests and their > implementations may be at different levels of conformance. > > We'd have to have out-of-band metadata like a list of tests to skip. I don't think that was Sam Ruby's intent; I took it to mean tests for which the specification itself is faulty. If it has to do with differing levvels of conformance, we're SOL. - -- Edward Z. Yang GnuPG: 0x869C48DA HTML Purifier Anti-XSS Filter [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH+quTqTO+fYacSNoRAh+sAJ9LdyrcdCu1ho9Ec1o3pVdM3volJQCdHIx4 TXAURwb97EcQ/Zpt6LL4N9Q= =+7hX -----END PGP SIGNATURE-----