[whatwg] Distinguishing XML and HTML by content sniffing

Sun Mar 4 04:19:53 PST 2007

On Sun, 04 Mar 2007 07:33:51 +0100, Michael Day <mikeday at yeslogic.com>  
wrote:

> For user agents like Prince that support XML and HTML content it is  
> sometimes necessary to distinguish whether a .html file is actually XML  
> or HTML in order for it to be processed correctly.
>
> I've written an article for XML.com explaining exactly how Prince  
> performs content sniffing to distinguish XML and HTML documents:
>
>      What Does XML Smell Like?
>      http://www.xml.com/pub/a/2007/02/28/what-does-xml-smell-like.html
>
> Any feedback would be greatly appreciated. No doubt at some point it  
> will be necessary to revise our heuristics for HTML5 :)

If you load a file from disk, then use any meta information the OS can  
provide. (I think Linux can store Content-Type information for files.) If  
the OS relies on file extensions (like Windows does) then use that.

.htm and .html are HTML. I know of lots of HTML documents that start with  
an "XML declaration" but are not well-formed if parsed as XML. (For  
starters, some version of DreamWeaver emitted XML declarations for  
documents, but did not ensure well-formedness and the result is often not  
well-formed.) Even if it was well-formed, it probably wasn't tested under  
XML conditions so it's likely that style sheets and scripts only work  
correctly under HTML conditions.

 From the article:

| It is common for XHTML files to be given an extension of .html or .htm,
| as .xhtml is rather long and .xht is rather obscure. This means that a
| file with an extension of .html may actually be an XML document and
| require an XML parser.

This is completely bogus. Those "XHTML" files are most likely inteded to  
be treated as HTML and not as XML. If an author wanted it to be treated as  
XML he/she would use .xhtml, .xht or .xml. Even if it would work correctly  
with an XML parser, it would likely also work correctly with an HTML  
parser (since all browsers would treat it as HTML, and authors mostly test  
their documents in some browser).

If an author authored a document and testing it with Prince, finding that  
XML-only features work even with a .html file extension, then it is likely  
that that document would break in browsers (because XML-only features  
don't work in HTML).

HTML5 has specified content-sniffing rules, FWIW:  
http://www.whatwg.org/specs/web-apps/current-work/#content-type-sniffing

See also: http://www.w3.org/Bugs/Public/show_bug.cgi?id=1500

-- 
Simon Pieters