[whatwg] Archive API - proposal

Wed Aug 15 23:22:41 PDT 2012

On Wed, Aug 15, 2012 at 9:38 PM, Glenn Maynard <glenn at zewt.org> wrote:
> On Wed, Aug 15, 2012 at 10:10 PM, Jonas Sicking <jonas at sicking.cc> wrote:
>>
>> Though I still think that we should support reading out specific files
>> using a filename as a key. I think a common use-case for ArchiveReader
>> is going to be web developers wanting to download a set of resources
>> from their own website and wanting to use a .zip file as a way to get
>> compression and packaging. In that case they can easily either ensure
>> to stick with ASCII filenames, or encode the names in UTF8.
>
>
> That's what this was for:
>
>
>     // For convenience, add "getter File? (DOMString name)" to FileList, to
> find a file by name.  This is equivalent
>     // to iterating through files[] and comparing .name.  If no match is
> found, return null.  This could be a function
>     // instead of a getter.
>     var example_file2 = zipFile.files["file.txt"];
>     if(example_file2 == null) { console.error("file.txt not found in ZIP";
> return; }
>
> I suppose a named getter isn't a great idea--you might have a filename
> "length"--so a "zipFile.files.find('file.txt')" function is probably better.

I definitely wouldn't want to use a getter. That runs into all sorts
of problems and the syntactical wins are pretty small.

>> One way we could support this would be to have a method which allows
>> getting a list of meta-data about each entry. Probably together with
>> the File object itself. So we could return an array of objects like:
>>
>> [ {
>>     rawName: <UInt8Array>,
>>     file: <File object>,
>>     crc32: <UInt8Array>
>>   },
>>   {
>>     rawName: <UInt8Array>,
>>     file: <File object>,
>>     crc32: <UInt8Array>
>>   },
>>   ...
>> ]
>>
>> That way we can also leave out the crc from archive types that doesn't
>> support it.
>
> This means exposing two objects per file.  I'd prefer a single File-subclass
> object per file, with any extra metadata put on the subclass.

First of all, we're be talking about 5 vs. 6 objects per file entry:
two ArrayBuffers, two ArrayBufferViews, one File and potentially one
JS-object. Actually, in Gecko it's more like 8 vs. 9 objects once you
start counting the C++ objects and their JS-wrappers.

Second, at least in the Gecko engine, allocating the first 5 objects
take about three orders of magnitude more time than allocating the
JS-object.

I'm also not a fan of sticking the crc32 on the File object itself
since we don't actually know that that's the correct crc32 value.

>> But I like this approach a lot of we can make it work. The main thing
>> I'd be worried about, apart from the IO performance above, is if we
>> can make it work for a larger set of archive formats. Like, can we
>> make it work for .tar and .tar.gz? I think we couldn't but we would
>> need to verify.
>
> It wouldn't handle it very well, but the original API wouldn't, either.  In
> both cases, the only way to find filenames in a TAR--whether it's to search
> for one or to construct a list--is to scan through the whole file (and
> decompress it all, for .tgz).  Simply retrieving a list of filenames from a
> large .tgz would thrash the user's disk and chew CPU.
>
> I don't think there's much use in supporting .tar, anyway.  Even if you want
> true streaming (which would be a different API anyway, since we're reading
> from a Blob here), ZIP can do that too, by using the local file headers
> instead of the central directory.

The main argument that I could see is that the initial proposal
allowed extracting files from a .tar.gz while only extracting up to
the point of finding the file-to-be-extracted. As long as
.getFileNames wasn't called. Which I'll grant isn't a huge benefit.

/ Jonas