[whatwg] Content type sniffing

Adam Barth whatwg at adambarth.com
Sun Jan 11 23:54:18 PST 2009

On Sun, Jan 11, 2009 at 6:41 PM, Boris Zbarsky <bzbarsky at mit.edu> wrote:
> I just noticed that section 2.7.1 of HTML5 says:
>  Extensions must not be used for determining resource types
>  for resources fetched over HTTP.

Extensions are bad news for content sniffing because they can often be
chosen by the attacker.  For example, suppose user-uploaded content is
can be downloaded at:


In most PHP configurations, the attacker can choose whatever file
extension he likes by directing the user's browser to:


And the PHP script will happily run.

> Now this use case (no content-type at all) was pretty common when the
> unknown type sniffer in Gecko was written, but that was years ago.  Do we
> have any data on how common it is now?

Yes.  We do have lots of data from opt-in user metrics from Chrome.
Here is a somewhat recent summary:


To address your particular concern, <body occurs 6899 times less often
than <script on Web content that lacks a Content-Type (or has an bogus
Content-Type like */*), assuming I did my arithmetic correctly.

> P.S.  Of course at the moment the sniffer in Gecko is used for more than
> just HTTP, and it looks like we'll need separate modes for things like HTTP
> and things like file://.  I can live with that, though.  For the file://
> case detection of HTML in documents with no doctype/<html>/<head> is a must.

I'm sympathetic to adding more HTML tags to the list, but I'm not sure
how far down the tail we should go.  In Chrome, we went for 99.999%
compatibility, which might be a bit far down the tail.  You can see
the algorithm here:


Using that figure, we went down to <p (which is two tags less common
than <body).


More information about the whatwg mailing list