[whatwg] Distinguishing XML and HTML by content sniffing
Michael Day
mikeday at yeslogic.com
Mon Mar 5 15:25:24 PST 2007
Hi Simon,
> If you load a file from disk, then use any meta information the OS can
> provide. (I think Linux can store Content-Type information for files.)
> If the OS relies on file extensions (like Windows does) then use that.
Some Linux file systems might potentially be capable of storing extra
metadata in extended attributes, but in practice I haven't seen any
Linux distributions actually use this functionality for storing the
content type of files. This basically leaves us with file extensions,
just like Windows.
> .htm and .html are HTML. I know of lots of HTML documents that start
> with an "XML declaration" but are not well-formed if parsed as XML. (For
> starters, some version of DreamWeaver emitted XML declarations for
> documents, but did not ensure well-formedness and the result is often
> not well-formed.) Even if it was well-formed, it probably wasn't tested
> under XML conditions so it's likely that style sheets and scripts only
> work correctly under HTML conditions.
Given that Prince serves a different niche than most user agents, our
users tend to be more likely to use XML with embedded SVG etc., and less
likely to run Prince on documents created by DreamWeaver. When Prince is
run on a document retrieved over HTTP it obeys the Content-Type header,
so that documents on the web will be parsed as HTML.
However, it is true that if a document that appears to be XML but
actually isn't is downloaded and saved as a file then Prince will try to
load it as XML rather than HTML after sniffing the content in the
absence of a Content-Type header. The user will then receive error
messages if the document is not well-formed. In practice, this case does
not seem to arise very often, but if it encourages people to either fix
their XML and make it well-formed or stop pretending that their HTML is
XML then that doesn't sound like such a bad thing :)
> If an author authored a document and testing it with Prince, finding
> that XML-only features work even with a .html file extension, then it is
> likely that that document would break in browsers (because XML-only
> features don't work in HTML).
This comes back to the thorny issue of how MathML is supposed to work on
the web. It seems to require that content be served up as XHTML, which
no one does, or that HTML documents contain "XML islands", which is not
well specified at all. It would be nice if HTML5 could tackle this in a
way that makes sense.
> HTML5 has specified content-sniffing rules, FWIW:
> http://www.whatwg.org/specs/web-apps/current-work/#content-type-sniffing
Yes, these rules never seem to identify a document as being XML, though.
> See also: http://www.w3.org/Bugs/Public/show_bug.cgi?id=1500
Prince always respects the Content-Type header, and only sniffs document
content when no such metadata is available.
Best regards,
Michael
--
Print XML with Prince!
http://www.princexml.com
More information about the whatwg
mailing list