[whatwg] Distinguishing XML and HTML by content sniffing
zcorpan at gmail.com
Sun Mar 4 04:19:53 PST 2007
On Sun, 04 Mar 2007 07:33:51 +0100, Michael Day <mikeday at yeslogic.com>
> For user agents like Prince that support XML and HTML content it is
> sometimes necessary to distinguish whether a .html file is actually XML
> or HTML in order for it to be processed correctly.
> I've written an article for XML.com explaining exactly how Prince
> performs content sniffing to distinguish XML and HTML documents:
> What Does XML Smell Like?
> Any feedback would be greatly appreciated. No doubt at some point it
> will be necessary to revise our heuristics for HTML5 :)
If you load a file from disk, then use any meta information the OS can
provide. (I think Linux can store Content-Type information for files.) If
the OS relies on file extensions (like Windows does) then use that.
.htm and .html are HTML. I know of lots of HTML documents that start with
an "XML declaration" but are not well-formed if parsed as XML. (For
starters, some version of DreamWeaver emitted XML declarations for
documents, but did not ensure well-formedness and the result is often not
well-formed.) Even if it was well-formed, it probably wasn't tested under
XML conditions so it's likely that style sheets and scripts only work
correctly under HTML conditions.
From the article:
| It is common for XHTML files to be given an extension of .html or .htm,
| as .xhtml is rather long and .xht is rather obscure. This means that a
| file with an extension of .html may actually be an XML document and
| require an XML parser.
This is completely bogus. Those "XHTML" files are most likely inteded to
be treated as HTML and not as XML. If an author wanted it to be treated as
XML he/she would use .xhtml, .xht or .xml. Even if it would work correctly
with an XML parser, it would likely also work correctly with an HTML
parser (since all browsers would treat it as HTML, and authors mostly test
their documents in some browser).
If an author authored a document and testing it with Prince, finding that
XML-only features work even with a .html file extension, then it is likely
that that document would break in browsers (because XML-only features
don't work in HTML).
HTML5 has specified content-sniffing rules, FWIW:
See also: http://www.w3.org/Bugs/Public/show_bug.cgi?id=1500
More information about the whatwg