[whatwg] Allowing ">" in attribute values
ash at ashleysheridan.co.uk
Fri Jun 25 05:39:13 PDT 2010
On Fri, 2010-06-25 at 13:28 +0100, Kornel Lesinski wrote:
> > A agree disallowing ">" chars in attributes greatly simplifies parsing. Not
> > only with regular expressions, but any parsing.
> > If ">" are allowed, it means that in order to found the end of the element
> > you do have to read all attributes before. This is very costy.
> You just need two extra states in the parser (toggled on " or '). I wouldn't call that "very costly".
> > Just an
> > example but they are many others: let's image you'd like to convert an HTML
> > document into flat text. To simplify you're algorithm you've chosen to
> > retrieve the content of the <body> element and then to delete all elements
> > in it. This is very fast if ">" are not allowed in attributes because you're
> > able found elements bounds just by searching "<" and then ">". But if ">"
> > are allowed, the operation gets much more complicated, and you spend much
> > more time to scan all elements.
> Conversion of HTML to text is more complicated than that - e.g. you shouldn't turn foo<br>bar into foobar, but you have to keep foo<b>bar as foobar. Implied <body> is allowed, you should extract <img alt>, you have to decode entities, etc. I think check for a single character is just a drop in the ocean in such code.
> And if you're not concerned about accuracy of conversion, you can ignore the fact that ">" is allowed too. It's just going to be yet another tradeoff among many other, much bigger ones.
> >> Also take into consideration that even if ">" was forbidden in the spec,
> > it wouldn't mean it doesn't happen in
> >> the wild. Since it works in browsers, you'd still have to support it if
> > you wanted to parse markup from the web.
> > Allowing it in the spec and how the browser should behave if it is anyway
> > are two different things.
> If you're parsing markup from the web, you have to support invalid markup that browsers accept, not merely pure markup that spec allows.
> There are reasons to disallow ">", but I'm not convinced that parsing performance is one of them.
I think maybe the best reason for disallowing it I've seen is where
attributes aren't correctly quoted:
Which could potentially break everything. At the moment, most browsers
deal with this as a missing quote, but allowing > in the value, they
should include content after the >.
Parsing-wise, I don't see it being any more difficult except for very
basic parsing methods, and any time difference should be negligible.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the whatwg