[whatwg] Content type sniffing

Mon Jan 12 07:54:15 PST 2009

Adam Barth wrote:
> Extensions are bad news for content sniffing because they can often be
> chosen by the attacker.  For example, suppose user-uploaded content is
> can be downloaded at:
> 
> http://example.com/download.php
> 
> In most PHP configurations, the attacker can choose whatever file
> extension he likes by directing the user's browser to:
> 
> http://example.com/download.php/whatever.foo
> 
> And the PHP script will happily run.

Right, I understand that.

> Yes.  We do have lots of data from opt-in user metrics from Chrome.
> Here is a somewhat recent summary:
> 
> https://crypto.stanford.edu/~abarth/research/html5/content-sniffing/

I'm not quite sure what to make of this, actually.  Specifically, where 
is the "22.19%" number for "HTML Tags" coming from?  22.19% of what? 
The magic numbers stuff actually adds up to 100%, but of what?

> To address your particular concern, <body occurs 6899 times less often
> than <script on Web content that lacks a Content-Type (or has an bogus
> Content-Type like */*), assuming I did my arithmetic correctly.

OK, that's good to know.

> I'm sympathetic to adding more HTML tags to the list, but I'm not sure
> how far down the tail we should go.  In Chrome, we went for 99.999%
> compatibility, which might be a bit far down the tail.

Doesn't seem that way to me, given the number of web pages out there.

> http://src.chromium.org/viewvc/chrome/trunk/src/net/base/mime_sniffer.cc?view=markup

Ah, ok.  The relevant Gecko code is 
<http://hg.mozilla.org/mozilla-central/annotate/9f82199fdb9c/netwerk/streamconv/converters/nsUnknownDecoder.cpp#l477>. 
I'd probably be fine with trimming that list down a bit, but I'm not 
quite sure what the downsides of having more tags in it are here.

-Boris