[whatwg] Archive API - proposal

Wed Aug 15 17:15:32 PDT 2012

On Wed, Aug 15, 2012 at 6:14 AM, Henri Sivonen <hsivonen at iki.fi> wrote:

> As for the filenames, after an off-list discussion, I think the best
> solution is that UTF-8 is tried first but the ArchiveReader
> constructor takes an optional second argument that names a character
> encoding from the Encoding Standard. This will be known as the
> fallback encoding. If no fallback encoding is provided by the caller
> of the constructor, "Windows-1252" is set as the fallback encoding.
> When it ArchiveReader processes a filename from the zip archive, it
> first tests if the byte string is a valid UTF-8 string. If it is, the
> byte string is interpreted as UTF-8 when converting to UTF-16. If the
> filename is not a valid UTF-8 string, it is decoded into UTF-16 using
> the fallback encoding.
>

This would misinterpret filenames as UTF-8.  For example, "黴雨.jpg" in a
CP932 (SJIS) ZIP is also legal UTF-8.  This would happen even though the
user explicitly specified an encoding, and even though UTF-8 is
exceptionally rare in ZIPs (all Windows ZIP software outputs filenames in
the user's ACP, and many don't support UTF-8 at all).

On Wed, Aug 15, 2012 at 6:17 AM, Andrea Marchesini
<amarchesini at mozilla.com>wrote:

> I agree. I was thinking that the default encoding for filenames is:
> UTF-8. If filename is not a valid UTF-8 string we can use the
> caller-supplied encoding:
>

I hate to argue against defaulting to UTF-8, but very few ZIPs are actually
UTF-8.  CP1252 as a default will at least often be correct, but UTF-8 will
almost never be.  (The only straightforward way I know to create a ZIP with
UTF-8 filenames is with a *nix commandline client, and most Windows
software won't understand it.)

var reader = new ArchiveReader(blob, "Windows-1252");
>
> If this fails, this filename/file will be excluded from the results.
>

There's no need.  Decode with proper error handling, as specified in the
Encoding spec: http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html.
This will give placeholder characters (U+FFFD); even if the whole filename
comes out unreadable, the file can still be read, selected from a list,
shown in a thumbnail view, and so on.  Lots of uses aren't dependant on
filenames.

>  > It should be possible to get the CRC32 of files, which ZIP stores in
> > the central directory. This both allows the user to perform checksum
> > verification himself if wanted, and all the other variously useful
> > things about being able to get a file's checksum without having to
> > read the whole file.
>
> can we have 'generic' archive API supporting CRC32?
>

Do you actually have any concrete plans for other archive formats?  The
only others commonly used are TAR and RAR.  TAR is unsuitable for
non-archive use (you have to scan the whole file to construct a file list),
and RAR is proprietary.

You could design a checksum API that uses the algorithm for a particular
format, but that's severe overdesign if it never supports anything but
ZIP.  I wouldn't worry about this.

-- 
Glenn Maynard