[whatwg] Distinguishing XML and HTML by content sniffing

Mon Mar 5 15:25:24 PST 2007

Hi Simon,

> If you load a file from disk, then use any meta information the OS can 
> provide. (I think Linux can store Content-Type information for files.) 
> If the OS relies on file extensions (like Windows does) then use that.

Some Linux file systems might potentially be capable of storing extra
metadata in extended attributes, but in practice I haven't seen any
Linux distributions actually use this functionality for storing the
content type of files. This basically leaves us with file extensions,
just like Windows.

> .htm and .html are HTML. I know of lots of HTML documents that start 
> with an "XML declaration" but are not well-formed if parsed as XML. (For 
> starters, some version of DreamWeaver emitted XML declarations for 
> documents, but did not ensure well-formedness and the result is often 
> not well-formed.) Even if it was well-formed, it probably wasn't tested 
> under XML conditions so it's likely that style sheets and scripts only 
> work correctly under HTML conditions.

Given that Prince serves a different niche than most user agents, our 
users tend to be more likely to use XML with embedded SVG etc., and less 
likely to run Prince on documents created by DreamWeaver. When Prince is 
run on a document retrieved over HTTP it obeys the Content-Type header, 
so that documents on the web will be parsed as HTML.

However, it is true that if a document that appears to be XML but 
actually isn't is downloaded and saved as a file then Prince will try to 
load it as XML rather than HTML after sniffing the content in the 
absence of a Content-Type header. The user will then receive error 
messages if the document is not well-formed. In practice, this case does 
not seem to arise very often, but if it encourages people to either fix 
their XML and make it well-formed or stop pretending that their HTML is 
XML then that doesn't sound like such a bad thing :)

> If an author authored a document and testing it with Prince, finding 
> that XML-only features work even with a .html file extension, then it is 
> likely that that document would break in browsers (because XML-only 
> features don't work in HTML).

This comes back to the thorny issue of how MathML is supposed to work on 
the web. It seems to require that content be served up as XHTML, which 
no one does, or that HTML documents contain "XML islands", which is not 
well specified at all. It would be nice if HTML5 could tackle this in a 
way that makes sense.

> HTML5 has specified content-sniffing rules, FWIW: 
> http://www.whatwg.org/specs/web-apps/current-work/#content-type-sniffing

Yes, these rules never seem to identify a document as being XML, though.

> See also: http://www.w3.org/Bugs/Public/show_bug.cgi?id=1500

Prince always respects the Content-Type header, and only sniffs document 
content when no such metadata is available.

Best regards,

Michael

-- 
Print XML with Prince!
http://www.princexml.com