[whatwg] Allow trailing slash in always-empty HTML5 elements?

Thu Nov 30 19:22:28 PST 2006

On Tue, 28 Nov 2006, Sam Ruby wrote:
> 
> In HTML5, there are a number of elements with a content model of empty: area,
> base, br, col, command, embed, hr, img, link, meta, and param.
> 
> If HTML5 were changed so that these elements -- and these elements alone -- 
> permitted an optional trailing slash character, what percentage of the web
> would be parsed differently?

0%. Allowing or disallowing something is completely orthogonal to how it 
is parsed.

> The basis for my question is the observation that the web browsers that 
> I am familiar with apparently already operate in this fashion, this 
> usage seems to have crept into quite a number of diverse places

Browsers don't do any sort of conformance reporting for HTML parsing, so 
they can't actually be said to be allowing it or disallowing it. As far as 
parsing goes, all browsers, as well as the HTML5 parsing specification, 
handle bogus trailing / characters in tags by ignoring them.

> As a side benefit of this change, I believe that I could modify my weblog to
> be simultaneously both HTML5 and XHTML5 compliant

Since the namespace declaration is required in XML and disallowed in HTML, 
this is not possible. In addition, while you may be right that a tiny 
subset of XML might be equivalent to a tiny subset of HTML, it is not, and 
will never be, generally true that you can take an arbitrary HTML5 
document and treat it as XML. HTML5 has very detailed parsing rules (at 
least as detailed as XML, and arguably more detailed, since the HTML 
parsing rules define the tree you obtain from parsing, whereas XML parsing 
rules only state what a conformant document looks like and how to detect 
conformance errors, not how to turn a conformant document into a tree).

I'm not sure I really understand the value of having a single common 
syntax subset, either. Now that there is an unambiguous way of parsing 
HTML, converting HTML to XML and back again in a lossless manner is easy. 
(Though not trivial -- there are things that can be represented in one 
syntax and not the other, like namespaces in XML, and the <noscript> 
element in HTML.)

Regarding your original suggestion: based on the arguments presented by 
the various people taking part in this discussion, I've now updated the 
specification to allow "/" characters at the end of void elements.

There were many e-mails on this thread. I have replied to the salient 
points below. Since much the discussion focused not on specific HTML5 
proposals but on the pros and cons of XML, WordPress, and other 
technologies, I've not replied to all the e-mails. If you feel I have 
failed to reply to an e-mail that I should have replied to, please bring 
it to my attention.

On Wed, 29 Nov 2006, Benjamin Hawkes-Lewis wrote:
>
> I think having /two/ different serializations of Web Forms 2.0/Web 
> Applications 1.0 is bad enough. To try and cater to what's effectively a 
> third serialization compatible with both parsing methods is to reinvent 
> the "XHTML 1.0 as text/html" mess. Serializing to multiple formats from 
> a single source is, I think, a better model. Especially as embedded 
> content may need different treatment too.

I strongly agree with this.

On Wed, 29 Nov 2006, Sam Ruby wrote:
> 
> That was not the intent of my suggestion.  I am suggesting that HTML5 
> standardize on *one* format.  One that comes as close as humanly 
> possible to capturing the web as it is practiced in all of its glorious 
> and often quite messy detail.  Those that wish to serialize the DOM in 
> other formats are certainly free to do so, but those formats aren't 
> HTML5.

This is already what we have -- the Web Apps 1.0 specification defines a 
single format, HTML5, with its syntax rules and parsing rules (including 
error handling). Serialisation to other formats is allowed, but not 
formally described by the Web Apps 1.0 specification. Due to its high 
profile, the serialisation that uses the XML syntax is explicitly 
addressed in the specification, and termed "XHTML5".

> But before they do, this work group certainly can anticipate that 
> question. What is the cost of accepting trailing slashes on elements 
> which are always defined with a content model of empty, except when 
> found in "Attribute value (unquoted) state"?  What sites would be parsed 
> differently based on this change?  Are those differences in line with 
> how existing browsers actually behave, or at odds with this behavior?

Again, these questions seem to betray a misunderstanding as to how the 
specification works. Trailing slashes were always ignored, and this has 
not changed. The only change is in whether such slashes are reported as 
errors in the validator or not. Whether something is an error has no 
effect on how it is parsed.

On Wed, 29 Nov 2006, Robert Sayre wrote:
> On 11/29/06, Lachlan Hunt <lachlan.hunt at lachy.id.au> wrote:
> > 
> > I do not think it's a good idea to make the trailing slash conforming. 
> > Although it is harmless, it provides no additional benefit at all and 
> > it creates the false impression that the syntax actually does 
> > something.
> 
> It does do something, in systems that think they are using XML
> (whether they actually are is another matter).

No, it really doesn't; unless you mean it provides them peace of mind.

> It's possible it will prevent many information-free validation errors, 
> and give the HTML5 more credibility as a result. Warning people about 
> <img /> in the validator is a waste of their time.

I agree. This is the only real argument, I think.

On Wed, 29 Nov 2006, Stewart Brodie wrote:
> >
> > [it works interoperably]
>
> [no it doesn't]
>
> For example, here's a fragment of hotmail.com's signup page, served as 
> "text/html".  It's the only example I've come across to date:
> 
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0
>   Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
> <html xmlns="http://www.w3.org/1999/xhtml" dir="ltr">
> ...
> <select id="iRegion" name="pff00000000010004" />
>   <script>...</script>
> </select>
> ...
> 
> The script just document.write's loads of option tags (it's the country 
> menu).  It's hard to know what the author thought was going on.

I couldn't find this page, but are you saying browsers handle this case 
differently from each other? If not, then it's still interoperable.

On Wed, 29 Nov 2006, Henri Sivonen wrote:
> 
> I am against blurring the distinction between the XML serialization and 
> the HTML serialization. The infamous Appendix C didn't bring about good 
> things.

Strongly agreed.

On Wed, 29 Nov 2006, Steve Runyon wrote:
>
> To me, '</' or '/>' mean the tag's done.  Therefore, '<select 
> />...</select>' (or anything similar) is just plain wrong -- that would 
> be a select list with nothing in it, then some options that are hanging 
> out somewhere on their own, then an unmatched closing select.  This 
> shouldn't validate, serializers shouldn't allow it, and deserializers 
> should simply ignore the options and '</select>' (or maybe dump the 
> options' text to the output and just ignore the '</select>').

We have no choice as to how it is handled -- <select/> has to be handled 
like <select>. It is an error. Parsers will ignore the "/".

> Now this, '<img src="..." />' -- which is what I thought this discussion 
> was about initially -- is perfectly valid; it's nothing more than a tag 
> without content.

This is now true.

On Wed, 29 Nov 2006, Sam Ruby wrote:
> > 
> > The fact is that authors already try things like <div/>, <p/> and even 
> > <a/>. I've seen all of those examples in the wild.  See, for instance, 
> > the source of the XML 1.0 spec (and many others) which claim to be 
> > XHTML as text/html, littered with plenty of <a/> tags all throughout.
> 
> If these are common, and implemented interoperably, then what is the 
> harm?

Well, <div/></div> is treated exactly the same as <div></div> (the / is 
ignored). And <img/> is treated exactly the same as <img> (again, the / is 
ignored). So the harm is that it doesn't do anything, and this would 
confuse authors.

> An example of something that is NOT implemented interoperably is 
> <script src="..."/>.

As far as I can tell, <script/> is handled by all browsers the same way as 
<script>. How is it not interoperable?

> In my book, a document that states that it always is a parse error to do 
> something despite abundant evidence to the contrary is not as useful as 
> one that says here are the places where it works, and here are the 
> places where it does not.

Please do this all the time:

   <b><p> ... </b>
   <b><p> ... </b>

...but that doesn't mean we should make it legal. Using data tables to 
achieve layout effects is another example.

> What percentage of pages use <img/> constructs?

About 50% have a trailing slash somewhere. It will be interesting to see 
what the statistics become with the modified tokenising requirements.

> Is there really any excuse for allowing "<b><i>OMG!</b></i>"?  No, but 
> HTML5 is willing to pinch its nose with thumb and forefinger and look 
> the other way. It literally is not a battle worth fighting.

The above is _not_ allowed. Parsing rules for it are defined, just like 
they have always been for trailing / characters, but it is _not_ allowed.

> I'd gladly put in a <!DOCTYPE html> in my page, the question is: would 
> the WHATWG be willing to meet me half way and allow xmlns attributes in 
> a very select and carefully prescribed set of locations?

This seems like a bad idea. If you have HTML, parse it as HTML. If you 
have XML, parse it as XML. Don't try to use an XML parser to parse HTML or 
vice versa. The syntaxes, although superficially similar to the extent 
that it is possible to make a single document that is parsable using 
either processor, are not similar enough to be treated equivalently.

On Wed, 29 Nov 2006, Rimantas Liubertas wrote:
> 
> I don't think that page claiming to be authored as HTML4.01 should 
> validate if it contains <br />, etc. which, at least in theory, has 
> entirely different meaning.

A page authored as HTML4 will validate if it contains <br />, since that 
means something qutie different, as you say, and what it means is valid.

In HTML5, and in the real world, <br /> is the same as <br>.

On Wed, 29 Nov 2006, Leons Petrazickis wrote:
> 
> The very idea of HTML5 is to not demand that the Web be scrapped and 
> rewritten. We need the people who have rewritten all their pages so that 
> they validate on the W3C validator -- they have the fire and the zeal 
> and the will to spread our format. We need to make the migration from 
> invalid XHTML to valid HTML5 very, very easy for them. We can't require 
> them to dig through PHP spaghetti. And that means that, no matter how 
> it's achieved, <br/> needs to be valid HTML5.

Fair enough.

On Wed, 29 Nov 2006, Sam Ruby wrote:
> 
> I am of the belief that that particular statistic is meaningless.  Even 
> if it were 15%, most aren't well formed.  Of those that are well formed, 
> most don't have the cojones to serve such documents with the appropriate 
> MIME type as they know that to do so would cause compliant UA to be 
> rather unforgiving.  And of the few insane enough to do so, it is rare 
> that the page in question is actually valid.

Yeah, application/xhtml+xml is about 0.0044% according to my studies. 
The error in my sample is probably at least that much.

> ... on the other hand, I am not of the belief that version numbers mean 
> what they are supposed to.  You will see HTTP 1.1 headers in HTTP 1.0 
> requests, RSS 2.0 elements in RSS 0.91 feeds, and HTML4 elements in 
> XHTML documents.

Indeed; that's one of the reason HTML5 drops the HTML version number.

> My theory is that we live in a cut and paste world, one based on partial 
> understanding.  Few understand DOCTYPEs and xmlns attributes, mostly 
> people crib from something that works.

Too true.

> > In general, people don't migrate to new versions of HTML. They only 
> > use new versions for new documents. Which is fine, since HTML5 UAs are 
> > going to be backwards-compatible (by design).
> 
> Now we are getting to the real question:  backwards compatible with 
> what? Only with compliant documents (i.e., at most 22% of the web) or 
> with pages like the one at mozilla.org?

With the overwhelming majority of existing content, and with legacy 
browsers.

> > I don't really understand this argument. Those who use XHTML1 because 
> > it's "the latest thing", are as likely to use HTML5 because it's "the 
> > latest thing", regardless of how complex that is. After all, they made 
> > the transition to XHTML, why wouldn't they make the transition to 
> > HTML5?
> 
> More likely, those that chose XHTML1 because it was the latest thing are 
> now jaded by the promises made - and largely unkept - and will take a 
> pass on HTML5.
>
> Unless, of course, HTML5 compliance is simultaneously both more 
> meaningful and easier to achieve than XHTML1 compliance.

Both, hopefully, will be true.

> Drawing lines in the sand and maintaining that "<br />" is invalid is 
> only going to make more busy work for a lot of people.  If you try to 
> explain why this decision was made, most won't understand, and 
> eventually most will decide that compliance isn't worth the bother.

Fair point.

> However, drawing lines in the sand that "<p /> doesn't mean what you 
> think it means" will affect few, and the reason for that particular line 
> is both sound and educational.

Makes sense, especially given how UAs act with this markup.

> I'm impressed that you are keeping an open mind.

There wouldn't be much point having an open mailing list accepting 
feedback and so forth if one did not keep an open mind. :-)

On Wed, 29 Nov 2006, James Graham wrote:
> 
> I tentatively support the idea that trailing slashes on "singleton"[1] 
> elements should not be a parse error. I don't think it has any actual 
> technical merit but I think it will be helpful in getting developer 
> mindshare; a lot of people have drunk the "Zeldman Koolaid" and have the 
> ideas of XHTML, clean markup, CSS, and conformance to standards in 
> general all mushed together in their brain[2]. For these people (who I 
> think represent the upper quartile of web developers in terms of 
> commitment to good markup) the trailing slash in empty elements is the 
> syntax of a new generation - it is a symbol that represents everything 
> that has changed in web design since 1996 - as intrinsically useless as 
> a fashionable designer label but just as seductive.

Fair point.

> [1] I find that name quite confusing as it suggests there should only be 
> one in the entire document.

This has now changed to "void element" on Henri's suggestion.

On Thu, 30 Nov 2006, Hallvord R M Steen wrote:
> 
> FWIW, it sounds sane to me to align validation as much as possible with 
> the UA parsing in a way that issues that aren't really problems for the 
> UA aren't flagged as invalid.

Well, nothing is a "problem" for the UA really...

> Closing slash on void elements sounds like a good example of "this is 
> invalid because we're sticking to our fixed ideas"[1] rather than "this 
> is invalid for technical reasons like causing ambiguities in DOM 
> parsing". So I support Sam's approach.

By that argument, almost anything should be legal.

On Thu, 30 Nov 2006, Thomas Broyer wrote:
> 
> How about: a slash is ignored in the start tag of a void element if it
> appears just before the closing > and it unambiguously is not part of
> an attribute value.
> - <br/> => no attribute, ignored
> - <base href="http://example.org/bar"/> => after the closing quote, ignored
> - <base href=http://example.org/bar /> => preceded by a space, so its
> not part of the attribute value => ignored
> - <base href=http://example.org/bar/> => could be part of the
> attribute value, so treated as *being* part of it

That's basically what the spec says now.

On Thu, 30 Nov 2006, Elliotte Harold wrote:
> > 
> > It's the core of the debate, namely if <img /> isn't technically a 
> > problem why are validators required to flag it as invalid? The counter 
> > examples are comparisons with <div /> which isn't parsed into the DOM 
> > most would expect when sent as HTML, and corner cases like
> > 
> > <base href=http://example.org/bar/>
> 
> That one's easy to fix. Just require quotes around attribute values like 
> HTML should have done 15 years ago.

How about the billions of documents that don't use quotes?

On Thu, 30 Nov 2006, Mike Schinkel wrote:
> 
> 1.) I read the FAQ http://blog.whatwg.org/faq/ and it seemed to imply 
> that HTML 5 and XHTML where not at odds with each other?  Did I misread 
> that, because from comments on this thread I get the impression that 
> might not be the case.

They're just differently serialisations. One is for text/html, the other 
for XML. You can use one or the other, it basically only depends on 
whether you want to send it as text/html or not.

> 2.) A similar question, but is the goal for HTML5 and XHTML to slowly
> converge, or is the goal for them to diverage?

HTML5 and XHTML5 are the same language, they're just different ways of 
writing it.

> [various reasons why trailing slash is ok]

Good arguments, thanks.

On Thu, 30 Nov 2006, Michel Fortin wrote:
> 
> For me, accepting /> in HTML could be an acceptable solution. It sure is 
> a departure from what was accepted as HTML previously, but I see no 
> point in trying to convince everyone to change (again) their markup for 
> cosmetic reasons.
> 
> What is really important is that authors understand better that HTML 
> must be served as text/html and that XHTML must be served with an xml 
> media type. If the validator enforce that, then I think it'll be 
> sufficient.

Agreed.

On Thu, 30 Nov 2006, Elliotte Harold wrote:
> 
> Given that fact of the installed base, I cannot accept that it is wrong 
> to serve XHTML as text/html, and I'm afraid any effort that depends 
> critically on that happening is doomed.

I don't really understand you logic, but for what it's worth, sending 
XHTML5 as text/html is non-conformant. You must send it as an XML MIME 
type. Of course, if you _do_ send XHTML5 as text/html, then it'll be 
treated as any other HTML, and all the various errors (like it being sent 
with the wrong MIME type) will be handled using the graceful error 
recovery rules of HTML5.

On Thu, 30 Nov 2006, Simon Pieters wrote:
> 
> So now I'm starting to think that trailing slashes for void elements 
> should be allowed in HTML5.

Apparently this is the majority opinion.

Thanks everyone!

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'