[whatwg] Allowing ">" in attribute values

Ian Hickson ian at hixie.ch
Mon Aug 9 17:34:19 PDT 2010


On Wed, 23 Jun 2010, Benjamin M. Schwartz wrote:
>
> The HTML5 spec appears to allow ">" inside an attribute value.  For 
> example, the following page (note the body tag) passes the experimental 
> HTML5 validator at w3c.org:
> 
> <!DOCTYPE HTML><html><head><title></title></head>
> <body class="3>2">
> </body></html>
> 
> I think ">" should be disallowed inside attribute values.

What problem would this solve?


> It is disallowed in XHTML

(I assume you mean "<" is disallowed in XHTML.)

Convergence with XML syntax rules is not a goal. Being a superset of those 
rules where possible is a minor secondary goal, but that is achieved 
either way in this case.


> It is disallowed in HTML 4.01

Convergence with HTML4 is not a goal.


> Disallowing it in HTML5 would avoid unnecessary divergence, and also 
> sometimes simplify parsing.

It wouldn't affect parsing.


A goal of HTML5 is to make the language have no useless restrictions. This 
argues for enabling people to put characters like ">" in attributes.



On Thu, 24 Jun 2010, Kornel Lesinski wrote:
> 
> I see other argument against allowing ">" in attributes: it helps to 
> catch unclosed attributes early:
> 
> <a href="foo>

This kind of error in practice gets caught soon enough anyway.


On Thu, 24 Jun 2010, Benjamin M. Schwartz wrote:
> 
> Worldwide, regarding HTML, I'm sure there is 100 times more regular 
> expression processing code than full-on lexing code.  Most code that 
> processes HTML is embedded in scripts, doing some small special-purpose 
> operation.  Those regular expressions aren't going away.  Helping them 
> break less is a noble cause.

On the contrary, the more they break the less likely it is they will 
continue to be used. In practice, there is no way to use regular 
expressions reliably with HTML. We shouldn't encourage it.


On Fri, 25 Jun 2010, Lachln Hunt wrote:
> 
> You seem to be conflating document conformance requirements with parsing 
> requirements.  Even if '>' was disallowed in attribute values for 
> document conformance, parsers would still be required to handle it if it 
> were present. If your parser doesn't handle it because it just assumes 
> that '>' is the end of the tag name, then your paser is broken. Changing 
> the parsing requirements such that '>' is treated as the end of a tag, 
> in places where it's currently treated as part of an attribute value, 
> would break backwards compatibility.

Indeed.


On Fri, 25 Jun 2010, Benjamin M. Schwartz wrote:
> 
> That's more or less how I feel.  The spec places requirements on how 
> "user agents, data mining tools, and conformance checkers" must handle 
> non-conforming input, but there are many other things in the world that 
> process HTML.  In other applications, it may be acceptable to have 
> undefined behavior on non-conforming input, like in ISO C.

I don't think it's every acceptable to have undefined behaviour on issues 
as critical as how to parse a document, valid or not.


> HTML5 has a very clear specification of conformance, and a validator is 
> widely available.  If I build a tool that guarantees correct behavior 
> only on conforming inputs, then users can easily check their documents 
> for conformance before using my tool.  If my tool has additional 
> restrictions, then I need to write my own validator, and answer a lot of 
> questions.

I recommend just using a conforming HTML parser.


> I was inspired to suggest this restriction after using mod_layout for 
> Apache, which inserts a banner at the top of a page.  It works by doing 
> a wildcard search for "<body*>".  There are a number of obvious ways to 
> break this [1]; one of them is by having ">" in an attribute value.  
> I'm sure there are many thousands of such programs around the world.

They should be fixed. :-)


> It sounds like most experts here would prefer to allow ">" in attribute 
> values in conforming documents, and that's fine.  I don't fully 
> understand the advantage, but I won't argue against consensus.

Expert opinions and consensus aren't the law of the land here, it's use 
cases, arguments, and most importantly data that count. See Philip`'s 
comment at the very bottom of this e-mail.


> [1] A javascript line like "width<bodywidth && height>bodyheight" would 
> also break it, as would an appropriately constructed comment.  It might 
> be possible to construct a regexp for this that functions correctly on 
> all conformant HTML5 documents.  Such a regexp would be considerably 
> simpler if ">" were disallowed in attribute values.

Regular expressions are the wrong tool for parsing HTML. HTML isn't 
regular.


On Tue, 29 Jun 2010, Skrol29 wrote:
> 
> Replacing ">" with ">" is already a good practice in XML and HTML. 

Why? ">" doesn't mean anything special where it could be confused for text 
except in unquoted attribute values, and good practice there is to quote 
attribute values whose values are free-form text.


> Some HTML attributes already forbid it (it is allowed in CDATA 
> attributes, forbidden in %Text attributes).

This doesn't apply anymore.


> Since XML 2 has been stopped, I think it is an occasion for HTML to make 
> the good practice replaced by a new restriction

I disagree with the premise of this statement, as noted above (it's not a 
good practice), so it doesn't make sense to add a restriction.


> and in the same time lighten parsing processes which are not browser 
> related.

Changing the syntax requirements has no effect on the parsing 
requirements.


> Why changing the HTML spec instead of adding a restriction when we want 
> ">" to be forbidden ? Because I think we should all want ">" to be 
> forbidden.

I don't think we do. :-)


> It is already quite deprecated to use it directly in HTML attribute 
> values. We can always use ">" instead of ">" as we already use "<" 
> instead of "<".

On the contrary, we've only just added an attribute where HTML in 
attribute values is the whole point (srcdoc="").


On Fri, 25 Jun 2010, Philip Taylor wrote:
> On Thu, Jun 24, 2010 at 2:34 PM, Benjamin M. Schwartz wrote:
> > [...]
> > HTML5 is about making a spec that matches common practice, right?  In
> > practice, no one puts ">" in attribute values.
> 
> The data disagrees: http://philip.html5.org/data/gt-in-attribute.txt

That's the most convincing argument.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


More information about the whatwg mailing list