<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 TRANSITIONAL//EN">

<HTML>

<HEAD>

  <META HTTP-EQUIV="Content-Type" CONTENT="text/html; CHARSET=UTF-8">

  <META NAME="GENERATOR" CONTENT="GtkHTML/3.26.3">

</HEAD>

<BODY>

On Fri, 2010-06-25 at 13:28 +0100, Kornel Lesinski wrote:

<BLOCKQUOTE TYPE=CITE>

<PRE>

> A agree disallowing ">" chars in attributes greatly simplifies parsing. Not

> only with regular expressions, but any parsing.

> If ">" are allowed, it means that in order to found the end of the element

> you do have to read all attributes before. This is very costy.

You just need two extra states in the parser (toggled on " or '). I wouldn't call that "very costly".

> Just an

> example but they are many others:  let's image you'd like to convert an HTML

> document into flat text. To simplify you're algorithm you've chosen  to

> retrieve the content of the <body> element and then to delete all elements

> in it. This is very fast if ">" are not allowed in attributes because you're

> able found elements bounds just by searching "<" and then ">".  But if ">"

> are allowed, the operation gets much more complicated, and you spend much

> more time to scan all elements.

Conversion of HTML to text is more complicated than that - e.g. you shouldn't turn foo<br>bar into foobar, but you have to keep foo<b>bar as foobar. Implied <body> is allowed, you should extract <img alt>, you have to decode entities, etc. I think check for a single character is just a drop in the ocean in such code.

And if you're not concerned about accuracy of conversion, you can ignore the fact that ">" is allowed too. It's just going to be yet another tradeoff among many other, much bigger ones.

>> Also take into consideration that even if ">" was forbidden in the spec,

> it wouldn't mean it doesn't happen in

>> the wild. Since it works in browsers, you'd still have to support it if

> you wanted to parse markup from the web. 

> 

> Allowing it in the spec and how the browser should  behave if it is anyway

> are two different things.

If you're parsing markup from the web, you have to support invalid markup that browsers accept, not merely pure markup that spec allows.

There are reasons to disallow ">", but I'm not convinced that parsing performance is one of them.

</PRE>

</BLOCKQUOTE>

<BR>

I think maybe the best reason for disallowing it I've seen is where attributes aren't correctly quoted:<BR>

<BR>

<foo bar="foobar><BR>

<BR>

Which could potentially break everything. At the moment, most browsers deal with this as a missing quote, but allowing > in the value, they should include content after the >.<BR>

<BR>

Parsing-wise, I don't see it being any more difficult except for very basic parsing methods, and any time difference should be negligible.<BR>

<BR>

<TABLE CELLSPACING="0" CELLPADDING="0" WIDTH="100%">

<TR>

<TD>

Thanks,<BR>

Ash<BR>

<A HREF="http://www.ashleysheridan.co.uk">http://www.ashleysheridan.co.uk</A><BR>

<BR>

<BR>

</TD>

</TR>

</TABLE>

</BODY>

</HTML>