[whatwg] Caching of identical files from different URLs using checksums

Sat Feb 18 05:45:36 PST 2012

Am 17.02.12 19:05, schrieb Julian Reschke:
> On 2012-02-17 09:42, Sven Neuhaus wrote:
>> Hello,
>>
>> as of 2012, some websites are including popular javascript libraries
>> from CDNs, like
>> Google's. The benefits are:
>>
>> * Traffic savings for the site operator because the javascript
>> libraries are downloaded from
>>    the CDN and not from the site that uses them
>> * If enough sites refer to the same external file, the browser will
>> cache the file and even if
>>    it's a first visit, the (potentially large) javascript file will
>> not have to be downloaded.
>>
>> There are however some drawbacks to this approach:
>>
>> * Security: The site operator is trusting an external site.  If the
>> CDN serves a malicious file
>>    it will directly lead to code execution in browsers under the
>> domain settings of the site
>>    including it (a form of cross site scripting).
>> * Availability: The site depends on the CDN to be available. If the
>> CDN is down the site may not
>>    be available at all.
>> * Privacy: The CDN will see requests for the file with HTTP referer
>> headers for every visitor
>>    of the site.
>> * Extra DNS lookup if file is not already cached
>> * Extra HTTP connection (can't use persistent connection because it's
>> a different site) if file is not cached
>>
>> I am proposing a solution that will solve all these problems, keep the
>> benefits and offers
>> some extra advantages:
>>
>> 1. The site stores a copy of the library file(s) on its own site.
>> 2. The web page includes the library from the site itself instead of
>> from the CDN
>> 3. The script tag specifies a checksum calculated using a
>> cryptographic hash function.
>>
>> With this solution, whenever a browser downloads a file and stores it
>> in the local cache, it calculates
>> its checksum. The browser can check its cache for an (identical) file
>> with the same checksum
>> (no matter what URL it was retrieved from) and use it instead of
>> downloading the file again.
>>
>> This suggestion has previously been discussed here (
>> http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2006-November/thread.html#7825
>> ), however for a different purpose (file integrity instead of caching
>> identical files from different sites) and I don't feel the points
>> raised back then apply.
>>
>> If a library is popular, chances are that many sites are including the
>> identical file and it will
>> already be in the browser's cache. No network access is necessary to
>> use it, improving the users'
>> privacy. It doesn't matter if the sites store the library file at a
>> different URL. It will always
>> be identified by its checksum. The cached file can be used more often.
>>
>> The syntax used to specify the checksum is using the fragment
>> identifier component of a URI
>> (RFC 3986 section 3.5).
>> ...
> 
> Stop here. That's not what the fragment identifier is for.
> 
> Instead, you could specify the hash as a separate attribute on the
> containing element.

The relevant section from RFC 3986 reads:

  "The fragment identifier component of a URI allows indirect
   identification of a secondary resource by reference to a primary
   resource and additional identifying information.  The identified
   secondary resource may be some portion or subset of the primary
   resource, some view on representations of the primary resource, or
   some other resource defined or described by those representations."

This description is not contradicting the use of checksum as fragment
identifiers. They are "additional identifying information."

However, if there is a consensus that checksums shouldn't be stored in
the fragment part of the URL, a new attribute would be a good alternative.

Regards,
-Sven Neuhaus