[whatwg] Parser-related feedback
Ian Hickson
ian at hixie.ch
Wed Feb 10 18:40:28 PST 2010
On Thu, 29 Oct 2009, Matt Hall wrote:
>
> Prior to r4177, the matching of tag names for exiting the RCDATA/RAWTEXT
> states was done as follows:
>
> "...and the next few characters do no match the tag name of the last
> start tag token emitted (compared in an ASCII case-insensitive manner)"
>
> However, the current revision doesn't include any comment on character
> casing in its discussion of "Appropriate End Tags." Similarly, certain
> tokenizer states require that you check the contents of the "temporary
> buffer" against the string "script" but there is no indication of
> whether or not to do this in a case-insensitive manner.
>
> In both cases, should this comparison be done in an ASCII
> case-insensitive manner or not? It might be helpful to clarify the spec
> in both places in either case.
On Thu, 29 Oct 2009, Geoffrey Sneddon wrote:
>
> It is already case-insensitive as you lowercase the characters when
> creating the token name and when adding them to the buffer.
Indeed.
On Fri, 30 Oct 2009, Matt Hall wrote:
>
> When the "script data" state was added to the tokenizer, the tree
> construction algorithm was updated to switch the tokenizer into this
> state upon finding a start tag named "script" while in the "in head"
> insertion mode (9.2.5.7). I see that a corresponding change was not made
> to 9.5 about "Parsing HTML Fragments" as it still says to switch into
> the RAWTEXT state upon finding a "script" tag. Does anyone know if this
> difference is intentional, or did someone just forget to update the
> fragment parsing case?
There's a comment now mentioning this explicitly. Is it ok?
On Tue, 10 Nov 2009, Kartikaya Gupta wrote:
>
> If you have a page like this:
>
> <!DOCTYPE HTML>
> <html><body>
> <font size="2" face="Verdana">
> <p align="left">Some text
> <font size="2" face="Verdana">
> <p align="left">Some text
> </body></html>
>
> according to the HTML5 parser rules, I believe this should create a DOM with 3 font elements that looks something like this:
>
> <!DOCTYPE HTML><HTML><HEAD></HEAD><BODY>
> <FONT face="Verdana" size="2">
> <P align="left">Some text
> <FONT face="Verdana" size="2">
> </FONT></P><P align="left"><FONT size="2" face="Verdana">Some text
>
> </FONT></P></FONT></BODY></HTML>
>
> However, if you add extend the original source with another font/p combination, like so:
>
> <!DOCTYPE HTML>
> <html><body>
> <font size="2" face="Verdana">
> <p align="left">Some text
> <font size="2" face="Verdana">
> <p align="left">Some text
> <font size="2" face="Verdana">
> <p align="left">Some text
> </body></html>
>
> You end up with a DOM which has 6 font elements:
>
> <!DOCTYPE HTML><HTML><HEAD></HEAD><BODY>
> <FONT face="Verdana" size="2">
> <P align="left">Some text
> <FONT face="Verdana" size="2">
> </FONT></P><P align="left"><FONT size="2" face="Verdana">Some text
> <FONT face="Verdana" size="2">
> </FONT></FONT></P><P align="left"><FONT face="Verdana" size="2"><FONT size="2" face="Verdana">Some text
>
> </FONT></FONT></P></FONT></BODY></HTML>
>
> .. and so on. In general the number of font elements in the DOM grows
> polynomially, with the result that pages like [1] and [2] end up with
> hundreds of thousands of font elements. I haven't even been able to
> successfully parse [3] with either our own HTML5 parser or the one at
> validator.nu, it just gobbles up all available memory and asks for more.
>
> [1] http://www.miprepzone.com/past.asp?Category=%27news%27
> [2] http://info4.juridicas.unam.mx/ijure/tcfed/8.htm?s=
> [3] http://info4.juridicas.unam.mx/ijure/tcfed/1.htm?s=
>
> Is this behavior expected, or is it a bug in the spec? Obviously
> shipping browsers don't demonstrate this behavior (nor does Firefox's
> HTML5 parser - see bugzilla 525960) so I'm wondering if the spec could
> be modified to not have this polynomial-growth behavior.
I haven't checked if the exact behaviour you describe is what the spec
currently requires, but in general, there will always be cases where input
has a disproportional result on output, because backwards-compatible fixup
is basically contrained to very few possibilities, all of which have this
behaviour in certain cases.
In practice, it's not a huge issue, because you have to cope with these
cases even just to handle regular valid documents -- consider for example
an infinite document whose body is just <font><font><font><font>... with
no close tags. There are a number of pages on the Web that approximate
this on the Web, for example:
http://www.frikis.org/images/ascii/tux.html
On Tue, 24 Nov 2009, Daniel Glazman wrote:
>
> I think that insertAdjacentHTML as defined in current section 3.5.7 [1]
> could be much cleaner and clearer if
>
> 1 - "Adjacent" was dropped. It's useless. The name could be insertHTML.
>
> 2. if the values were "before", "firstchild", "lastchild", after"
> instead of the current "beforebegin", "afterbegin", "beforend" and
> "afterend" that seem to me visually related to start and end tags
> and not the element itself. Consistency with the existing DOM
> phraseology seems to me useful.
>
> [1] http://www.whatwg.org/specs/web-apps/current-work/multipage/apis-in-html-documents.html#insertadjacenthtml%28%29
On Tue, 24 Nov 2009, Anne van Kesteren wrote:
>
> The problem is that it is a legacy feature, much like innerHTML.
On Tue, 24 Nov 2009, Daniel Glazman wrote:
>
> That's not a problem. Make insertHTML with the new values and make
> insertAdjacentHTML with the old values just an alias to the new ones. Or
> the contrary. Or whatever. But it's not because it's shipped by MS that
> way that we must stick forever to such an horrible definition...
On Tue, 24 Nov 2009, Anne van Kesteren wrote:
>
> That is actually pretty much how we do it for every feature (consider
> e.g. XMLHttpRequest) because otherwise we have to duplicate too much
> functionality which only increases complexity. I.e. more tests, more
> APIs floating around, more documentation, backwards compatibility issues
> (new duplicate APIs won't work, but the ones they reflect do), etc.
On Wed, 25 Nov 2009, Dean Edwards wrote:
>
> Adding aliases does not reduce the horribleness of an API.
On Wed, 25 Nov 2009, Daniel Glazman wrote:
>
> Correct. But at least it eases a bit the pain and allows future
> deprecation.
We can basically never drop anything, so aliases in practice don't really
help. I haven't added an alias here, as I don't see much advantage to
doing so. The proposed alternative names aren't much better.
On Wed, 2 Dec 2009, ATSUSHI TAKAYAMA wrote:
>
> This was posted by Akatsuki Kitamura on the W3C Japanese Interest Group
> Mailing List.
>
> [...quoting the syntax section...]
>
> As far as I understand, if I want to write a void element with no
> attribute, such as the br, I do steps 1 ("<" character) and 2 (tag
> name), then ignore 3 and 4. In the step 5, since I don't have any
> attributes, the "after the attribute" situation does not apply here, so
> I ignore it too. Then I close the tag by going through step 6 ("/"
> character) and step 7 (">" character).
>
> Akatsuki's question was that if you write space characters before
> closing the tag like the following, if they are still valid or not.
>
> <br >
> <br />
>
> I think the step 5 should be written as;
>
> After the attributes, or after the element's tag name if there are no
> attributes, then there may be one or more space characters.
Done.
On Tue, 9 Feb 2010, Biju wrote:
>
> What should a user agent display when html content is...
>
> <html><body>
> <%@ page language="java" %>
> </body></html>
>
> [...]
On Tue, 9 Feb 2010, Tab Atkins Jr. wrote:
>
> All of these cases appear to be an ASP or PHP page that is accidentally
> being sent as ordinary html. You shouldn't be seeing these tags at all
> in the source of the page unless a server is misconfigured.
>
> That said, given that you *are* seeing them, I'm not certain what the
> correct behavior is, but it's definitely strictly defined in HTML5. Can
> someone else with more familiarity with the parser algorithm help out
> here?
On Wed, 10 Feb 2010, Boris Zbarsky wrote:
>
> For the "<%@" case, it looks like the state machine will go through the
> following states:
>
> Data state -> Tag open state
>
> When encountering a '%' in the "Tag open" state, the specification says:
>
> Parse error. Emit a U+003C LESS-THAN SIGN character token
> and reconsume the current input character in the data state.
>
> So the state will then remain "Data state" until the next '&' or '<' or EOF is
> seen, so the entire string up to the </body> will be treated as literal text.
>
> For the "<?" case, the state transitions will be:
>
> Data state -> Tag open state -> Bogus comment state
>
> Then the specification says to:
>
> Consume every character up to and including the first U+003E
> GREATER-THAN SIGN character (>) or the end of the file (EOF),
> whichever comes first. Emit a comment token whose data is the
> concatenation of all the characters starting from and including
> the character that caused the state machine to switch into the bogus
> comment state, up to and including the character immediately before
> the last consumed character (i.e. up to the character just before the
> U+003E or EOF character). (If the comment was started by the end of
> the file (EOF), the token is empty.)
>
> Switch to the data state.
>
> Or in other words, stop the bogus comment at the first '>' you see and
> then start parsing normally again. In this case, that means treating
> everything up to the next '<' or '&' or EOF as literal text.
>
> So the currently-specified behavior in fact matches the observed Firefox
> behavior (with either parser) on these simple testcases.
Sounds right.
On Wed, 10 Feb 2010, Biju wrote:
>
> At least in one page I saw, which was Case 1 and page was originally
> from a JSP or ASP template later modified and saved as a *.html
I recommend fixing the page. :-)
> So will IE and Safari (may be chrome also, i have not tested it) follow
> Firefox way?
Hard to say. You'd have to ask Microsoft.
> Personally I prefer the IE way as I think one may able to make a simple
> PHP or JSP editor just using contentEditable feature.
Unfortunately the <%...%> stuff wouldn't round-trip correctly, since
there's no way to represent it in the DOM. So you couldn't really make a
PHP or JSP editor using contentEditable that way.
--
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
More information about the whatwg
mailing list