[whatwg] Zip archives as first-class citizens

Wed Aug 28 07:36:14 PDT 2013

On 8/28/13 9:32 AM, Anne van Kesteren wrote:
> We have thought of three approaches for zip URL design thus far:
>
> * Using a sub-scheme (zip) with a zip-path (after !):
> zip:http://www.example.org/zip!image.gif
> * Introducing a zip-path (after %!): http://www.example.org/zip%!image.gif
> * Using media fragments: http://www.example.org/zip#path=image.gif
>
> High-level drawbacks:
>
> * Sub-scheme: requires changing the URL syntax with both sub-scheme
> and zip-path.
> * Zip-path: requires changing the URL syntax.
> * Fragments: fail to work well for URLs relative to a zip archive.
>
> Fragments are conceptually the cleanest as the only part of a URL
> that's supposed to depend on the Content-Type is the fragment.
> However, if you want to link to an ID inside an HTML resource you'd
> have to do #path=test.html&id=test which would require adding
> knowledge to the HTML resource that it is contained in a zip archive
> and have special processing based on that. And not just HTML, same
> goes for CSS or JavaScript.
>
> I'm not sure we need to consider sub-scheme if zip-path can work as
> it's more complex and not very well thought out. E.g. imagine
> view-source:zip:http://www.example.org/zip!test.html. (I hope we never
> need to standardize view-source and that it can be restricted to the
> address bar in browsers.)
>
> zip-path makes zip archive packaging by far the easiest. If we use %!
> as separator that would cause a network error in some existing
> browsers (due to an illegal %), which means it's extensible there,
> though not backwards compatible.
>
> We'd adjust the URL parser to build a zip-path once %! is encountered.
> And relative URLs would first look if there's a zip-path and work
> against that, and use path otherwise.
>
> Fetching would always use the path. If there's a zip-path and the
> returned resource is not a zip archive it would cause a network error.
>
> As for nested zip archives. Andrea suggested we should support this,
> but that would require zip-path to be a sequence of paths. I think we
> never went to allow relative URLs to escape the top-most zip archive.
> But I suppose we could support in a way that
>
>    %!test.zip!test.html
>
> goes one level deeper. And "../image.gif" in test.html looks in the
> enclosing zip. And "../../image.gif" in test.html looks in the
> enclosing zip as well because it cannot ever be relative to the path,
> only the zip-path.
>

As the following URLs suggest, the %! (or %-anything) will likely not 
work for ZIP files generated by a script using the query portion of the 
URL, as the path information will be subsumed into the last value 
without causing a network error:

http://whatwg.gphemsley.org/url_test.php?file=test.zip&spacer=1%!example.png
http://whatwg.gphemsley.org/url_test.php?file=test.zip&spacer=1%/example.png
http://whatwg.gphemsley.org/url_test.php?file=test.zip&spacer=1?example.png

(And feel free to use that script to try out any other combos.)

However, since fragments (i.e. anything beginning with '#') are already 
not sent to the server, what if you modified the URL parser to use a 
special hash-prefix combo that indicates the path? Then you could avoid 
the problem of having to make documents aware of the fact that they're 
in a ZIP because the hash-prefix combo would come before the plain hash 
which holds the ID.

So, for example:

http://whatwg.gphemsley.org/url_test.php?file=test.zip&spacer=1#/example.html#middle

Then you could also take the opportunity to spec the #! prefix (and 
other hash-combo prefixes) that is used by a lot of sites nowadays.

-- 
Gordon P. Hemsley
me at gphemsley.org
http://gphemsley.org/