[whatwg] several messages about HTML5

Ian Hickson ian at hixie.ch
Tue Feb 20 16:05:47 PST 2007

On Tue, 20 Feb 2007, Vlad Alexander (xhtml.com) wrote:
> 4. One of the biggest problems with HTML is that content authors can get 
> away with writing "tag soup".

Is it really a problem? Or is it the reason the Web is so wildly 
successful? Would the Web have taken off in the same way if it worked like 
most other systems, showing error messages whenever something was the 
least bit wrong?

> As a result, most content authors don't feel the need to write markup to 
> specification. When markup is not written to specification, CSS may not 
> get applied correctly, JavaScript may not execute and some user-agents 
> may not be able to process content as the author intended.

Having draconian error handling -- the term we use for just not allowing 
errors instead of having silent error recovery like HTML does -- is not 
the only solution for getting consistent behaviour between browsers. The 
approach that we have taken with HTML5 is to define what _any_ document 
means, even if it is invalid -- down to the last detail, so that every 
browser will handle every document in an equivalent way, whether the 
document is conformant or not. (It's the same technique CSS uses.)

> Why not put an end to "tag soup" by requiring user-agents to only accept 
> markup written to specification?

There are literally dozens if not hundreds of billions of documents 
already on the Web. A study of a sample of several billion of those 
documents with a test implementation of the HTML5 Parser specification 
that I did at Google put a very conservative estimate of the fraction of 
those pages with markup errors at more than 78%. When I tweaked it a bit 
to look at a few more errors, the number was 93%. And those are only core 
syntax errors -- it didn't count misuse of HTML, like putting a <p> 
element inside an <ol> element.

If we required browsers to refuse those documents, then you couldn't 
browse over 90% of the Web.

But consider -- if one browser showed error messages on half the Web, and 
another browser showed no errors and instead showed the Web roughly as the 
author intended. Which browser would the average person use?

If we want to make HTML5 successful, we have to make sure the browser 
vendors pay attention to it. Any requirements that make their market share 
go down relative to browsers who aren't following the spec will 
immediately be ignored.

> 5. X/HTML 5 has a construct for adding additional semantics to existing 
> elements using predefined class names. Predefined class names could be 
> the most controversial part of X/HTML 5, because the implementation 
> overloads the class attribute. XHTML 2 provides similar functionality 
> using the role attribute. Which approach is better and why?

The proposal to have predefined class names is still very much in the air, 
we're mostly waiting for author and implementation feedback to see if it 
is workable. Currently the HTML5 spec leaves a number of things unanswered 
(like what happens if two classes on an element are contradictory), so 
it's definitely not finished.

I haven't been able to work out what the role="" attribute does or how it 
is supposed to be implemented. It has a spec, but that spec is really 
unclear. So I can't really comment on it.

> 6. The font element is a terrible construct, primarily because content 
> creators using authoring tools use the font element instead of semantic 
> markup. The X/HTML 5 spec supports the font element when content is 
> authored using WYSIWYG editors. What is the rationale for this? Why 
> would WYSIWYG editors get an exemption? And is this exemption going to 
> make the Web less accessible?

Again, the whole <font> element and WYSIWYG section is up in the air, the 
current text is just a strawman and we've received lots of good feedback 
on it which will need to be taken into consideration.

The main reason that WYSIWYG editors would be given an exception is that 
the state of the art in user interface today doesn't have a good solution 
for making a semantic editor usable by the average person. We could 
require editors to do this, but since nobody knows how to do it, it would 
be a stupid requirement. Again, we have to compromise on perfection to 
actually address real-world needs and constraints.

> 7. The XHTML 5 spec says that "generally speaking, authors are 
> discouraged from trying to use XML on the Web". Why write an XML spec 
> like XHTML 5 and then discourage authors from using it? Why not just 
> drop support for XML (XHTML 5)?

Some people are going to use XML with HTML5 whatever we do. It's a simple 
thing to do -- XML is a metalanguage for describing tree structures, HTML5 
is a tree structure, it's obvious that XML can be used to describe HTML5. 
The problem is that if we don't specify it, then everyone who thinks it is 
obvious and goes ahead and does it will do it in a slightly different way, 
and we'll have an interoperability nightmare. So instead we bite the 
bullet and define how it must work if people do it.

> 8. The chair of the HTML Working Group at W3C, Steven Pemberton, said 
> "HTML is a mess!" and "rather than being designed, HTML just grew, by 
> different people just adding stuff to it".

He's right! This continues to this day. We're trying to bring some level 
of sanity to the process, though!

> Since HTML is poorly designed, why is it worth preserving? Or is HTML 
> fixable? If so, how does X/HTML 5 fix it?

The original reason I got involved in this work is that I realised that 
the human race has written literally billions of electronic documents, but 
without ever actually saying how they should be processed. If, in a 
thousand years, someone found a trove of HTML documents and decided they
would right an HTML browser to view them, they couldn't do it! Even with 
the existing HTML specs -- HTML4, SGML, DOM2 HTML, etc -- a perfect 
implementation couldn't render the vast majority of documents as they were 
originally intended.

Every major Web browser vendor today spends at least half the resources 
allocated to developing the browser on reverse-engineering the _other_ 
browsers to work out how things should work. For example, if you have:

   <p> <b> Hello <i> Crurel </b> World! </i> </p>

...and then you run some JavaScript on that to show what elements there 
are and what their parents are, what should happen? It turns out that, 
before HTML5, there were no specs that defined this!

I decided that for the sake of our future generations we should document 
exactly how to process today's documents, so that when they look back, 
they can still reimplement HTML browsers and get our data back, even if 
they no longer have access to Microsoft Internet Explorer's source code. 
(Since IE is proprietary, it is unlikely that its source code will survive 
far into the future. We may have more luck with some of the other 
browsers, whose sources are more or less freely available.)

Once I got into actually documenting HTML for the future, I came to see 
that the effort could also have more immediate benefits, for example 
today's browser vendors would be able to use one spec instead of reverse 
engineering each other; and we could add new features for authors.

> 9. Supporters of X/HTML 5 call XHTML 2 radical. History has shown us 
> that radical technological change is often controversial, but in the end 
> is the best choice. For example, in the last 40 years, the technology 
> for delivering music has change radically, from vinyl, to cassette, to 
> CD, to purely digital. Why should the Web shy away from a radical 
> technological change?

If we look at music we see several key factors. The move from Vinyl to 
Cassette came with radical new benefits like the shock-resistance and the 
read-write nature of the media. Also, notice how cassette tape didn't 
replace vinyl; they co-existed with different audiences for a long time. 
The introduction of digital optical media (CDs) also introduced radical 
new benefits: significantly higher quality and loss-less reproduction. CDs 
replaced Vinyl, because they did the same thing, but radically better. 
However, tapes continued to be used for years, since CD-RW was not widely 
available straight away. We're now seeing a move from CD collections to 
audio stored on digital magnetic media (like iPods), but if you look 
carefully you'll see that a very clear migration paths exists from Audio 
CDs and tapes to portable media players: CDs can be ripped and stored on 
the new media, and the new media can play to tape adapters that plug into 
car stereos like old cassettes.

So we see that successful radical change requires two key features:

 * Radical new benefits to offset the pain of change
 * Backwards-compatibility with the old technology

I'm sure there will come a day where new technologies exist which do 
indeed shift the Web radically to new worlds, but I don't think we've seen 
them yet.

In a way, though, HTML5 itself is a radical change. Not for what it 
specifies, but for the way it's been created. It's the first really open, 
collaborative community that has taken on a task of this magnitude 
(nothing less than rewriting the book on the core Web language) -- 
hopefully, future technologies will also follow this model, instead of the 
more traditional behind-closed-doors, corporate-sponsored approach that 
most other technologies have used.

> 10. In the minds of most people, HTML is dead and X/HTML 5 is perceived 
> as an attempt to resurrect it. Given this perception, how can you 
> succeed in marketing HTML to consumers (those who build Web sites)?

According to some other research we did for the HTML5 effort, over 95% of 
the Web today is HTML, with the rest being mostly a smattering of PDF, 
Word, and plain text. Under what definition of the word could it be 
considered "dead"? Web designers never stopped writing HTML pages. I don't 
think we'll have any difficulty convincing them to continue doing so.

Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

More information about the whatwg mailing list