[html5] 4.01 vs XHTML

Ian Hickson ian at hixie.ch
Mon Oct 15 10:02:42 PDT 2012

This entire discussion is pretty much why we've dropped the version number
altogether. There is no longer HTML4 vs HTML5, there's just HTML. :-)

On Sun, 14 Oct 2012, David Osborne wrote:
> If I am writing in Html 4.01... is this really bad...??
> When I put in the declaration of which type of HTML - 4.01 transitional
> show how old my software is...?
> What disadvantages am I exposing myself to...
> XHTML - what is the difference between it and html 4.01??

The choice of XML vs HTML is mostly one of personal preference. If you 
pick XML (meaning, you label your documents with an XML MIME type like 
application/html+xml) then you have a more intuitive, if somewhat verbose, 
syntax, with the downside that if you make a typo in your syntax, the page 
will show an error message to the user. If you pick HTML (meaning, you 
label your documents with the text/html MIME type), then you have a more 
esoteric syntax but browsers will always display something, even if you 
make a typo.

In practice, pretty much everyone uses text/html. Some older browsers 
don't support XML.

Note that the DOCTYPE is irrelevant here. Whether you include one or not, 
and whether the one you include says "HTML" or "XHTML" or anything else, 
has no effect. In text/html, the DOCTYPE only affects some minor issues 
known as "quirks mode", and in XML, the DOCTYPE only affects what named 
character references you can use (like  ). The DOCTYPE doesn't select 
XML vs HTML, that's only done by the MIME type.

To repeat just so we're clear: if you send this as text/html, it's HTML, 
it is not XHTML, despite what it says:

   <?xml version="1.0" encoding="UTF-8"?>
   <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
   <html xmlns="http://www.w3.org/1999/xhtml">
    <head> <title> Demo </title> </head>
    <body> This is HTML. </body>

(It's actually non-conforming HTML, for various reasons we won't go into.)

Similarly, if you send this as application/xhtml+xml, then it's XML, not 
HTML, regardless of what it looks like:

   <html> <p> Hello </p> </html>

(It doesn't have a namespace declaration, so it's just non-namespaced XML, 
and contains no HTML elements, but that's also something for another day.)

The term "XHTML" just means "HTML elements in XML syntax".

On Mon, 15 Oct 2012, Jukka K. Korpela wrote:
> You can in practice use an HTML 4.01 doctype even if you use the new 
> features of HTML5, even though HTML5 tells that you should not do that. 
> Support to new features does not depend on the doctype declaration.

The spec does indeed say authors "should not" use legacy DOCTYPE strings, 
but it does list six such strings that authors can nonetheless use without 
conformance errors. They're only "should not" because they're unnecessary 
long compared to "<!DOCTYPE HTML>" which does the same thing.


> The practical problem is in validation. If you use new HTML5 features 
> and some features declared obsolete in HTML5, you cannot get a clean 
> validator report without making your own validator, or at least your own 
> DTD. But validation is not obligatory; it's a tool, not an end.

Of course, there's a reason they're obsolete, so ignoring the validation 
errors when you use obsolete features is risky.

> > > XHTML - what is the difference between it and html 4.01??
> > 
> > There are slight syntactical and extensibility differences. If you 
> > don't need the extensibility of XHTML (which you probably can't quite 
> > leverage anyway), you should probably stick with HTML. The WHATWG is 
> > essentially deprecating XHTML for most purposes.

For the record, XHTML isn't deprecated. It's just not expected to be used 
by most people. The HTML standard is agnostic about text/html vs XML. (It 
used to be more opinionated, but people made convincing arguments that 
this was unwarranted.)

On Mon, 15 Oct 2012, Prof. T.D. Wilson wrote:
> I'm not at all clear why you would choose to use html 4.01 instead of 
> html5?  True, the latter is not yet fully confirmed as a standard

Actually, it is. :-)


It is a living standard.

> but most new versions of browsers appear to be able to cope with it, so 
> why stick with 4.01?

In practice, browsers don't support HTML4 or HTML5, they just support 
"HTML", with new features being added as time goes on. There's no need to 
categorise documents as being one version of HTML or another.

> The main virtue of xhtml to my mind is the discipline it creates in the 
> use of tags - requiring end tags, for example, and in the formal nesting 
> of tags - if your site or page validates as xhtml then you can be pretty 
> sure it is going to be readable by anything - and of course, you can 
> retain that discipline in switching to html5, although my understanding 
> is that you do not need to do so.

Now that the HTML standard defines how to parse HTML, you can in fact be 
sure that a text/html page will be processed consistently even if you 
don't get its syntax right, for what it's worth. So it's the same as XML 
in this regard, except that it will never fail to parse.

> Some of html5 is clumsy for some purposes - but I guess that is always 
> going to be the case. For example, the syntax of <article> and <section> 
> will vary depending upon the nature of the page.

The syntax is the same, it's just the way it's used that can differ.

> The analogy often used is a paper that has sections composed of 
> articles, but an electronic scholarly journal is going to have articles 
> composed of sections - each article is the basic unit of the journal, 
> whereas in a newspaper the basic units are sections; if a strict syntax 
> was introduced that said, in effect, 'sections' can only be used within 
> 'articles', the journal editor would not be happy!  So the lack of 
> formal syntax here is actually an advantage, given the wide range of 
> uses to which html is going to be put.

This is content models, not syntax. But yes, you can have both articles in 
sections and sections in articles. (And indeed, articles in sections in 
articles in sections, e.g. comments on a particular section of an article 
in a particular part of a newspaper.)

> Another point to be wary about in html5 is that it is touted as having 
> 'semantic' tags - it doesn't.  'Semantic' has to do with representing 
> meaning (i.e., what the enclosed text etc. is *about*) and the so-called 
> semantic tags say nothing about the meaning of what they enclose but 
> only about the kind of information one might find there. So, 'footer' is 
> not semantic - it only tells you the location of the information, but 
> not what kind of information is there, because you can place in the 
> footer whatever you wish; similarly, the 'header' and 'nav' tags are 
> only location indicators - you could use the 'nav' tag, for example, to 
> put any kind of information in, rather than navigation pointers.

This is incorrect. It would be non-conforming to use <nav> for information 
other than "links to other pages or to parts within the page". The HTML 
standard does in fact define semantics for the elements. (The word used in 
the standard is "represents".)

Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

More information about the Help mailing list