[whatwg] Allowing ">" in attribute values

Fri Jun 25 06:52:05 PDT 2010

-----Message d'origine-----
De : Lachln Hunt [mailto:lachlan.hunt at lachy.id.au] 
Envoyé : vendredi 25 juin 2010 14:18
À : Skrol29
Cc : 'WHAT Working Group'; bens at alum.mit.edu
Objet : Re: [whatwg] Allowing ">" in attribute values

On 2010-06-25 11:46, Skrol29 wrote:
>> A agree disallowing ">" chars in attributes greatly simplifies 
>> parsing. Not only with regular expressions, but any parsing.
>> If ">" are allowed, it means that in order to found the end of the 
>> element you do have to read all attributes before. This is very costy. 
>> Just an example but they are many others:  let's image you'd like to 
>> convert an HTML document into flat text. To simplify you're algorithm 
>> you've chosen  to retrieve the content of the<body>  element and then 
>> to delete all elements in it. This is very fast if ">" are not allowed 
>> in attributes because you're able found elements bounds just by searching "<" and then">".  But if">"
>> are allowed, the operation gets much more complicated, and you spend 
>> much more time to scan all elements.

> You seem to be conflating document conformance requirements with parsing requirements.
> Even if '>' was disallowed in attribute values for document conformance, parsers would still be
> required to handle it if it were present.  If your parser doesn't handle it because it just assumes
>  that '>' is the end of the tag name, then your paser is broken. Changing the parsing requirements
>  such that '>' is treated as the end of a tag, in places where it's currently treated as part of an
> attribute value, would break backwards compatibility.

If the only purpose of an HTML contents was to be displayed by a browser, I would agree with you.
We have to consider that XML contents, and therefore HTML contents, has many finalities in the world including in the industry. It is an evidence for XML, but it is also true for HTML.
In the industry, parsing the content is essential, whatever the method is.
I understand that for the browser finality, the tolerance to the specification must be large in order to display something even if the webmaster has made mistakes in the source. You also don't mind to be obliged to parse all attributes because quite all of them have an impact for the browser. 
In another hand, in the industry the tolerance to the spec is often very low in order build simple, fast and robust processes. They are also many parsing purposes that care about some elements and don't care about others. I can give examples if needed, but we can foresee this is true.

Allowing ">" in attributes is a small gift of tolerance for webmasters, but implies major complications for the industry.
Disallowing ">" falls within the purpose of simplifying the grammar, like when  XHTML disallowed the uppercase for element and attribute names, like when XHTML  disallowed attribute values without  quotes....

PS: Of course my previous example is not a realistic one. That was a technical illustration of the issue.

Regards,
Skrol29