[whatwg] Caching of identical files from different URLs using checksums

Fri Feb 17 10:05:17 PST 2012

On 2012-02-17 09:42, Sven Neuhaus wrote:
> Hello,
>
> as of 2012, some websites are including popular javascript libraries from CDNs, like
> Google's. The benefits are:
>
> * Traffic savings for the site operator because the javascript libraries are downloaded from
>    the CDN and not from the site that uses them
> * If enough sites refer to the same external file, the browser will cache the file and even if
>    it's a first visit, the (potentially large) javascript file will not have to be downloaded.
>
> There are however some drawbacks to this approach:
>
> * Security: The site operator is trusting an external site.  If the CDN serves a malicious file
>    it will directly lead to code execution in browsers under the domain settings of the site
>    including it (a form of cross site scripting).
> * Availability: The site depends on the CDN to be available. If the CDN is down the site may not
>    be available at all.
> * Privacy: The CDN will see requests for the file with HTTP referer headers for every visitor
>    of the site.
> * Extra DNS lookup if file is not already cached
> * Extra HTTP connection (can't use persistent connection because it's a different site) if file is not cached
>
> I am proposing a solution that will solve all these problems, keep the benefits and offers
> some extra advantages:
>
> 1. The site stores a copy of the library file(s) on its own site.
> 2. The web page includes the library from the site itself instead of from the CDN
> 3. The script tag specifies a checksum calculated using a cryptographic hash function.
>
> With this solution, whenever a browser downloads a file and stores it in the local cache, it calculates
> its checksum. The browser can check its cache for an (identical) file with the same checksum
> (no matter what URL it was retrieved from) and use it instead of downloading the file again.
>
> This suggestion has previously been discussed here ( http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2006-November/thread.html#7825 ), however for a different purpose (file integrity instead of caching identical files from different sites) and I don't feel the points raised back then apply.
>
> If a library is popular, chances are that many sites are including the identical file and it will
> already be in the browser's cache. No network access is necessary to use it, improving the users'
> privacy. It doesn't matter if the sites store the library file at a different URL. It will always
> be identified by its checksum. The cached file can be used more often.
>
> The syntax used to specify the checksum is using the fragment identifier component of a URI
> (RFC 3986 section 3.5).
> ...

Stop here. That's not what the fragment identifier is for.

Instead, you could specify the hash as a separate attribute on the 
containing element.

Best regards, Julian