[whatwg] Archive API - proposal
Glenn Maynard
glenn at zewt.org
Wed Aug 15 21:38:56 PDT 2012
On Wed, Aug 15, 2012 at 10:10 PM, Jonas Sicking <jonas at sicking.cc> wrote:
> Though I still think that we should support reading out specific files
> using a filename as a key. I think a common use-case for ArchiveReader
> is going to be web developers wanting to download a set of resources
> from their own website and wanting to use a .zip file as a way to get
> compression and packaging. In that case they can easily either ensure
> to stick with ASCII filenames, or encode the names in UTF8.
>
That's what this was for:
// For convenience, add "getter File? (DOMString name)" to FileList, to
find a file by name. This is equivalent
// to iterating through files[] and comparing .name. If no match is
found, return null. This could be a function
// instead of a getter.
var example_file2 = zipFile.files["file.txt"];
if(example_file2 == null) { console.error("file.txt not found in ZIP";
return; }
I suppose a named getter isn't a great idea--you might have a filename
"length"--so a "zipFile.files.find('file.txt')" function is probably better.
By allowing them to download a .zip file, they can also store that
> .zip in compressed form in IndexedDB or the FileSystem API in order to
> use less space on the user's device. (Additionally many times IO gets
> faster by using .zip files because the time saved in doing less IO is
> larger than the time spent decompressing. Obviously very dependent on
> what data is being stored).
>
There's also the question of when decompression happens--you don't want to
decompress the whole thing in advance if you can avoid it, since if the
user isn't doing random access you can stream the decompression--but that's
just QoI, of course.
One way we could support this would be to have a method which allows
> getting a list of meta-data about each entry. Probably together with
> the File object itself. So we could return an array of objects like:
>
> [ {
> rawName: <UInt8Array>,
> file: <File object>,
> crc32: <UInt8Array>
> },
> {
> rawName: <UInt8Array>,
> file: <File object>,
> crc32: <UInt8Array>
> },
> ...
> ]
>
> That way we can also leave out the crc from archive types that doesn't
> support it.
>
This means exposing two objects per file. I'd prefer a single
File-subclass object per file, with any extra metadata put on the subclass.
>
> This is definitely an interesting idea. The current API is designed
> around doing the IO when each individual operation is done. You are
> proposing to do all IO up front which allows all operations to be
> synchronous.
>
> I suspect that doing the IO "lazily" can provide better performance
> for some types of operations, such as only wanting to extract a single
> resource from an archive. But maybe the difference wouldn't be that
> big in most cases.
>
I'd expect the I/O savings to be negligible, since ZIP has a central
directory at the end, allowing the whole thing to be read very quickly.
I hope creating an array of File objects (even thousands of them) isn't too
expensive. Even if it is, though, this could be refactored to still give a
synchronous interface: store the file directory natively (in a non-File,
non-GC'd way), and allow looking up and iterating that list in a way that
only instantiates one File object at a time. (This would lose the FileList
API compatibility with <input type=file>, though, which I think is a nice
plus.)
But I like this approach a lot of we can make it work. The main thing
> I'd be worried about, apart from the IO performance above, is if we
> can make it work for a larger set of archive formats. Like, can we
> make it work for .tar and .tar.gz? I think we couldn't but we would
> need to verify.
>
It wouldn't handle it very well, but the original API wouldn't, either. In
both cases, the only way to find filenames in a TAR--whether it's to search
for one or to construct a list--is to scan through the whole file (and
decompress it all, for .tgz). Simply retrieving a list of filenames from a
large .tgz would thrash the user's disk and chew CPU.
I don't think there's much use in supporting .tar, anyway. Even if you
want true streaming (which would be a different API anyway, since we're
reading from a Blob here), ZIP can do that too, by using the local file
headers instead of the central directory.
--
Glenn Maynard
More information about the whatwg
mailing list