[whatwg] Distinguishing XML and HTML by content sniffing

Julian Reschke julian.reschke at gmx.de
Sun Mar 4 02:47:35 PST 2007

Michael Day schrieb:
> ...
> I think that approach could easily misidentify valid HTML documents as 
> being XML. It would be easy to parse the first 8Kb of many HTML 
> documents with an XML parser, as unclosed tags like <link> and <meta> 
> would not trigger any well-formedness errors unless you parsed all the 
> way to the end of the document -- not just the first 8Kb -- and found 
> that they were never closed.
> On a more pragmatic level, I think it would also be slightly more 
> difficult to implement this approach with libxml2, as you would have to 
> carefully feed the parser only 8Kb (or some other amount) and then stop 
> it before it hits the end of the buffer and complains about all the 
> unclosed tags. However, the misidentification problem is a more serious 
> issue affecting this approach.


What, except efficiency, prevents you from parsing the whole file with 
an XML parser? If it parses, it is XML. Otherwise it isn't.

Best regards, Julian

