[whatwg] Archive API - proposal

Wed Aug 15 20:10:08 PDT 2012

On Tue, Aug 14, 2012 at 1:20 PM, Glenn Maynard <glenn at zewt.org> wrote:
> (I've reordered my responses to give a more logical progression.)
>
> On Tue, Jul 17, 2012 at 9:23 PM, Andrea Marchesini <baku at mozilla.com> wrote:
>
>> // The getFilenames handler receives a list of DOMString:
>> var handle = this.reader.getFile(this.result[i]);
>>
>
> This interface is problematic.  Since ZIP files don't have a standard
> encoding, filenames in ZIPs are often garbage.  This API requires that
> filenames round-trip uniquely, or else files aren't accessible t all.  For
> example, if you have two filenames in CP932, "日" and "本", but the encoding
> isn't determined correctly, you may end up with two files both with a
> filename of "??".  Either you can't open either file, or you can only open
> one of them.  This isn't theoretical; I hit ZIP files like this in the wild
> regularly.
>
> Instead, I'd recommend that the primary API simply returns File objects
> directly from the ZIP.  For example:
>
> var reader = archive.getFiles();
> reader.onsuccess = function(result) {
>     // result = [File, File, File, File...];
>
>     console.log(result[0].name);
>     // read the file
>     new FileReader(result[0]);
> }
>
> This allows opening files without any dependency on the filename.  Since
> File objects are by design lightweight--no decompression should happen
> until you actually read from the file--this isn't expensive and won't
> perform any extra I/O.  All the information you need to expose a File
> object is in the central directory (filename, mtime, decompressed size).

This is a good idea. It neatly solves the problem of not having to
rely on filenames as keys.

Though I still think that we should support reading out specific files
using a filename as a key. I think a common use-case for ArchiveReader
is going to be web developers wanting to download a set of resources
from their own website and wanting to use a .zip file as a way to get
compression and packaging. In that case they can easily either ensure
to stick with ASCII filenames, or encode the names in UTF8.

By allowing them to download a .zip file, they can also store that
.zip in compressed form in IndexedDB or the FileSystem API in order to
use less space on the user's device. (Additionally many times IO gets
faster by using .zip files because the time saved in doing less IO is
larger than the time spent decompressing. Obviously very dependent on
what data is being stored).

>> . Do you think it can be useful?
>> . Do you see any limitation, any feature missing?
>
> It should be possible to get the CRC32 of files, which ZIP stores in the
> central directory.  This both allows the user to perform checksum
> verification himself if wanted, and all the other variously useful things
> about being able to get a file's checksum without having to read the whole
> file.

One way we could support this would be to have a method which allows
getting a list of meta-data about each entry. Probably together with
the File object itself. So we could return an array of objects like:

[ {
    rawName: <UInt8Array>,
    file: <File object>,
    crc32: <UInt8Array>
  },
  {
    rawName: <UInt8Array>,
    file: <File object>,
    crc32: <UInt8Array>
  },
  ...
]

That way we can also leave out the crc from archive types that doesn't
support it.

Though I'm not convinced that CRCs are important enough that we need
to put it in the first iteration of the API.

> (I don't think CRC32 checks should be performed automatically, since it's
> too hard for that to make sense when random access is involved.)

I agree with this.

>>   // The ArchiveReader object works with Blob objects:
>>   var archiveReader = new ArchiveReader(file);
>>
>>   // Any request is asynchronous:
>>
>
> The only operation that needs to be asynchronous is creating the
> ArchiveReader itself.  It should parse the ZIP central record before before
> returning a result.  Once you've done that you can do the rest
> synchronously, because no further I/O is necessary until you actually read
> data from a file.

This is definitely an interesting idea. The current API is designed
around doing the IO when each individual operation is done. You are
proposing to do all IO up front which allows all operations to be
synchronous.

I suspect that doing the IO "lazily" can provide better performance
for some types of operations, such as only wanting to extract a single
resource from an archive. But maybe the difference wouldn't be that
big in most cases.

But I like this approach a lot of we can make it work. The main thing
I'd be worried about, apart from the IO performance above, is if we
can make it work for a larger set of archive formats. Like, can we
make it work for .tar and .tar.gz? I think we couldn't but we would
need to verify.

/ Jonas