[whatwg] Archive API - proposal

Tue Aug 14 13:20:48 PDT 2012

(I've reordered my responses to give a more logical progression.)

On Tue, Jul 17, 2012 at 9:23 PM, Andrea Marchesini <baku at mozilla.com> wrote:

> // The getFilenames handler receives a list of DOMString:
> var handle = this.reader.getFile(this.result[i]);
>

This interface is problematic.  Since ZIP files don't have a standard
encoding, filenames in ZIPs are often garbage.  This API requires that
filenames round-trip uniquely, or else files aren't accessible t all.  For
example, if you have two filenames in CP932, "日" and "本", but the encoding
isn't determined correctly, you may end up with two files both with a
filename of "??".  Either you can't open either file, or you can only open
one of them.  This isn't theoretical; I hit ZIP files like this in the wild
regularly.

Instead, I'd recommend that the primary API simply returns File objects
directly from the ZIP.  For example:

var reader = archive.getFiles();
reader.onsuccess = function(result) {
    // result = [File, File, File, File...];

    console.log(result[0].name);
    // read the file
    new FileReader(result[0]);
}

This allows opening files without any dependency on the filename.  Since
File objects are by design lightweight--no decompression should happen
until you actually read from the file--this isn't expensive and won't
perform any extra I/O.  All the information you need to expose a File
object is in the central directory (filename, mtime, decompressed size).

I would like to receive feedback about this.. In particular:
> . Do you think it can be useful?
> . Do you see any limitation, any feature missing?
>

It should be possible to get the CRC32 of files, which ZIP stores in the
central directory.  This both allows the user to perform checksum
verification himself if wanted, and all the other variously useful things
about being able to get a file's checksum without having to read the whole
file.

(I don't think CRC32 checks should be performed automatically, since it's
too hard for that to make sense when random access is involved.)

  // The ArchiveReader object works with Blob objects:
>   var archiveReader = new ArchiveReader(file);
>
>   // Any request is asynchronous:
>

The only operation that needs to be asynchronous is creating the
ArchiveReader itself.  It should parse the ZIP central record before before
returning a result.  Once you've done that you can do the rest
synchronously, because no further I/O is necessary until you actually read
data from a file.

This gives the following, simpler interface:

var opener = new ZipOpener(file);
opener.onerror = function() { console.error("Loading failed"); }
opener.onsuccess = function(zipFile)
{
    // .files is a FileList, representing each file in the archive.
    if(zipFile.files.length == 0) { console.error("ZIP file is empty");
return; }

    var example_file = zipFile.files[0];
    console.log("The first filename is", example_file.name, "with an
expected CRC of", example_file.expectedCRC);

    // Read from the file:
    var reader = new FileReader(example_file);

    // For convenience, add "getter File? (DOMString name)" to FileList, to
find a file by name.  This is equivalent
    // to iterating through files[] and comparing .name.  If no match is
found, return null.  This could be a function
    // instead of a getter.
    var example_file2 = zipFile.files["file.txt"];
    if(example_file2 == null) { console.error("file.txt not found in ZIP";
return; }
}

(To fit expectedCRC in there, it would actually need to use a subclass of
File, not File itself.)

This also eliminates an error condition (no getFile error callback), and
since .files looks just like HTMLInputElement.files, it can be used
directly with code written for it.  For example, if you have a function
"uploadAllFiles(files)", you can pass in both an <input type=file
multiple>'s .input or a zipFile.files, and they'll both work.

-- 
Glenn Maynard