[whatwg] Parsing: < in unquoted attribute values

Tue Jun 19 16:33:35 PDT 2007

On Wed, 25 Apr 2007, Simon Pieters wrote:
>
> The parsing section says that < in an unquoted attribute value 
> terminates the tag. However, according to my testing[1], IE7, Gecko, 
> Opera and Webkit don't do this -- they append the < to the attribute 
> value. So I think the parsing section is wrong here.

This was fixed recently.

> Additionally, the syntax section says that authors are not allowed to 
> use < in unquoted attribute values, which should probably be changed if 
> the parsing section is changed.

Oops, forgot to fix that last time. Fixed now.

On Wed, 25 Apr 2007, Anne van Kesteren wrote:
> 
> IE also lets < be an attribute. It can also be part of an attribute or 
> element name. This means that:
> 
>  <p</p>test
> 
> will become a 'p<' element with a 'p' attribute which has 'test' as 
> textContent. This basically means less exceptions in the tokenizer for 
> the '<' character which would be fine with me.

HTML5 requires this now.

On Wed, 25 Apr 2007, Anne van Kesteren wrote:
> 
> As I just mentioned on IRC, this essentially means removing the SHORTTAG 
> TAGC OMISSION feature of SGML which appears not be supported by Internet 
> Explorer, Opera and maybe Safari.

Indeed.

On Wed, 25 Apr 2007, Jonas Sicking wrote:
> >
> >   <p</p>test
> > 
> > will become a 'p<' element with a 'p' attribute which has 'test' as 
> > textContent. This basically means less exceptions in the tokenizer for 
> > the '<' character which would be fine with me.
> 
> We do no longer support this in mozilla (if we ever did). A reason we 
> now explicitly forbid this is we don't want it to ever be possible to 
> create elements with 'illegal' names. Same thing goes for attribute 
> names. This is partially for security reasons since some elements and 
> attributes carry very important security information.

On Thu, 26 Apr 2007, Anne van Kesteren wrote:
> 
> Could you elaborate on the security issues? Could you also give a definition
> of "illegal names" as it's not really clear to me what that means for HTML.

On Fri, 27 Apr 2007, Jonas Sicking wrote:
> 
> Basically, for <input< type=file value="/etc/passwd">, if part of the 
> code thinks that that is an "input<" element, where as other parts 
> thinks that is and "input" element, you might end up in a situation 
> where the browser sends the /etc/passwd file to the server without user 
> interaction.

That seems a bit specious given that for type=file you'd have to ignore 
value="" anyway. Furthermore, making the "<" be _not_ part of the tag name 
is what causes the security issue, as it's only when you _don't_ put it in 
the tag name that you end up with an <input> element.

Anyway, that's the advantage of having a single, well-defined tokeniser, 
you don't have to worry about differences in opinion. :-)

> It also seems like a bad idea to allow a document to be parsed such as 
> there is no way to serialize it without creating an invalid html5 
> serialization.

We are well past that point. Example:

   <p bogus="">

...can be parsed but can't be serialised legally.

> As far as element names go, i don't really see a reason to allow more, 
> or less, characters than the XML spec lets you use.

The main reason is that you have to define what happens to the characters 
you don't allow. We don't have the option of fatal failure.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'