[whatwg] Distinguishing XML and HTML by content sniffing

Sun Mar 4 03:14:19 PST 2007

Hi Julian,

> What, except efficiency, prevents you from parsing the whole file with 
> an XML parser? If it parses, it is XML. Otherwise it isn't.

This approach would suffer from the opposite problem: documents that the 
author intended to be treated as XML would be treated as HTML if there 
was a single well-formedness error anywhere in the document.

The resulting behaviour would be quite confusing for users, as an XHTML 
file containing SVG and MathML content would suddenly stop working if a 
tag was left unclosed. However, since the file would probably still 
parse correctly as HTML, especially if the unclosed tag was something 
like <img> or <br>, the user might not get any error messages relating 
to the well-formedness error. Instead, they could get error messages 
relating to the unknown SVG and MathML tags in their "HTML" document.

Our heuristics are an attempt to guess the intentions of users. 
Specifying an XML declaration or other XML-specific content is an 
indication that the document should be treated as XML. In the absence of 
any XML-specific signs, a .html file really has to be treated like a 
HTML document, even if it would potentially be successfully parsed by an 
XML parser. Any other policy would appear to lead to very confusing 
behaviour.

Best regards,

Michael

-- 
Print XML with Prince!
http://www.princexml.com