[whatwg] Distinguishing XML and HTML by content sniffing
mikeday at yeslogic.com
Sun Mar 4 03:14:19 PST 2007
> What, except efficiency, prevents you from parsing the whole file with
> an XML parser? If it parses, it is XML. Otherwise it isn't.
This approach would suffer from the opposite problem: documents that the
author intended to be treated as XML would be treated as HTML if there
was a single well-formedness error anywhere in the document.
The resulting behaviour would be quite confusing for users, as an XHTML
file containing SVG and MathML content would suddenly stop working if a
tag was left unclosed. However, since the file would probably still
parse correctly as HTML, especially if the unclosed tag was something
like <img> or <br>, the user might not get any error messages relating
to the well-formedness error. Instead, they could get error messages
relating to the unknown SVG and MathML tags in their "HTML" document.
Our heuristics are an attempt to guess the intentions of users.
Specifying an XML declaration or other XML-specific content is an
indication that the document should be treated as XML. In the absence of
any XML-specific signs, a .html file really has to be treated like a
HTML document, even if it would potentially be successfully parsed by an
XML parser. Any other policy would appear to lead to very confusing
Print XML with Prince!
More information about the whatwg