[whatwg] Content type sniffing

Mon Jan 12 00:02:31 PST 2009

I should say that these figures are weighted by the number of page
loads, so if sniffing for a particular tag is needed for the digg.com
home page, it will show up as a large number.  If you don't weight by
traffic, you get similar results, but with slightly different numbers.

Adam

On Sun, Jan 11, 2009 at 11:54 PM, Adam Barth <whatwg at adambarth.com> wrote:
> On Sun, Jan 11, 2009 at 6:41 PM, Boris Zbarsky <bzbarsky at mit.edu> wrote:
>> I just noticed that section 2.7.1 of HTML5 says:
>>
>>  Extensions must not be used for determining resource types
>>  for resources fetched over HTTP.
>
> Extensions are bad news for content sniffing because they can often be
> chosen by the attacker.  For example, suppose user-uploaded content is
> can be downloaded at:
>
> http://example.com/download.php
>
> In most PHP configurations, the attacker can choose whatever file
> extension he likes by directing the user's browser to:
>
> http://example.com/download.php/whatever.foo
>
> And the PHP script will happily run.
>
>> Now this use case (no content-type at all) was pretty common when the
>> unknown type sniffer in Gecko was written, but that was years ago.  Do we
>> have any data on how common it is now?
>
> Yes.  We do have lots of data from opt-in user metrics from Chrome.
> Here is a somewhat recent summary:
>
> https://crypto.stanford.edu/~abarth/research/html5/content-sniffing/
>
> To address your particular concern, <body occurs 6899 times less often
> than <script on Web content that lacks a Content-Type (or has an bogus
> Content-Type like */*), assuming I did my arithmetic correctly.
>
>> P.S.  Of course at the moment the sniffer in Gecko is used for more than
>> just HTTP, and it looks like we'll need separate modes for things like HTTP
>> and things like file://.  I can live with that, though.  For the file://
>> case detection of HTML in documents with no doctype/<html>/<head> is a must.
>
> I'm sympathetic to adding more HTML tags to the list, but I'm not sure
> how far down the tail we should go.  In Chrome, we went for 99.999%
> compatibility, which might be a bit far down the tail.  You can see
> the algorithm here:
>
> http://src.chromium.org/viewvc/chrome/trunk/src/net/base/mime_sniffer.cc?view=markup
>
> Using that figure, we went down to <p (which is two tags less common
> than <body).
>
> Adam
>