[whatwg] Allowing ">" in attribute values

Fri Jun 25 05:39:13 PDT 2010

On Fri, 2010-06-25 at 13:28 +0100, Kornel Lesinski wrote:

> > A agree disallowing ">" chars in attributes greatly simplifies parsing. Not
> > only with regular expressions, but any parsing.
> > If ">" are allowed, it means that in order to found the end of the element
> > you do have to read all attributes before. This is very costy.
> 
> You just need two extra states in the parser (toggled on " or '). I wouldn't call that "very costly".
> 
> > Just an
> > example but they are many others:  let's image you'd like to convert an HTML
> > document into flat text. To simplify you're algorithm you've chosen  to
> > retrieve the content of the <body> element and then to delete all elements
> > in it. This is very fast if ">" are not allowed in attributes because you're
> > able found elements bounds just by searching "<" and then ">".  But if ">"
> > are allowed, the operation gets much more complicated, and you spend much
> > more time to scan all elements.
> 
> Conversion of HTML to text is more complicated than that - e.g. you shouldn't turn foo<br>bar into foobar, but you have to keep foo<b>bar as foobar. Implied <body> is allowed, you should extract <img alt>, you have to decode entities, etc. I think check for a single character is just a drop in the ocean in such code.
> 
> And if you're not concerned about accuracy of conversion, you can ignore the fact that ">" is allowed too. It's just going to be yet another tradeoff among many other, much bigger ones.
> 
> >> Also take into consideration that even if ">" was forbidden in the spec,
> > it wouldn't mean it doesn't happen in
> >> the wild. Since it works in browsers, you'd still have to support it if
> > you wanted to parse markup from the web. 
> > 
> > Allowing it in the spec and how the browser should  behave if it is anyway
> > are two different things.
> 
> If you're parsing markup from the web, you have to support invalid markup that browsers accept, not merely pure markup that spec allows.
> 
> There are reasons to disallow ">", but I'm not convinced that parsing performance is one of them.
> 

I think maybe the best reason for disallowing it I've seen is where
attributes aren't correctly quoted:

<foo bar="foobar>

Which could potentially break everything. At the moment, most browsers
deal with this as a missing quote, but allowing > in the value, they
should include content after the >.

Parsing-wise, I don't see it being any more difficult except for very
basic parsing methods, and any time difference should be negligible.

Thanks,
Ash
http://www.ashleysheridan.co.uk

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.whatwg.org/pipermail/whatwg-whatwg.org/attachments/20100625/e7cc8665/attachment-0002.htm>