[whatwg] Archive API - proposal
glenn at zewt.org
Wed Aug 15 17:15:32 PDT 2012
On Wed, Aug 15, 2012 at 6:14 AM, Henri Sivonen <hsivonen at iki.fi> wrote:
> As for the filenames, after an off-list discussion, I think the best
> solution is that UTF-8 is tried first but the ArchiveReader
> constructor takes an optional second argument that names a character
> encoding from the Encoding Standard. This will be known as the
> fallback encoding. If no fallback encoding is provided by the caller
> of the constructor, "Windows-1252" is set as the fallback encoding.
> When it ArchiveReader processes a filename from the zip archive, it
> first tests if the byte string is a valid UTF-8 string. If it is, the
> byte string is interpreted as UTF-8 when converting to UTF-16. If the
> filename is not a valid UTF-8 string, it is decoded into UTF-16 using
> the fallback encoding.
This would misinterpret filenames as UTF-8. For example, "黴雨.jpg" in a
CP932 (SJIS) ZIP is also legal UTF-8. This would happen even though the
user explicitly specified an encoding, and even though UTF-8 is
exceptionally rare in ZIPs (all Windows ZIP software outputs filenames in
the user's ACP, and many don't support UTF-8 at all).
On Wed, Aug 15, 2012 at 6:17 AM, Andrea Marchesini
<amarchesini at mozilla.com>wrote:
> I agree. I was thinking that the default encoding for filenames is:
> UTF-8. If filename is not a valid UTF-8 string we can use the
> caller-supplied encoding:
I hate to argue against defaulting to UTF-8, but very few ZIPs are actually
UTF-8. CP1252 as a default will at least often be correct, but UTF-8 will
almost never be. (The only straightforward way I know to create a ZIP with
UTF-8 filenames is with a *nix commandline client, and most Windows
software won't understand it.)
var reader = new ArchiveReader(blob, "Windows-1252");
> If this fails, this filename/file will be excluded from the results.
There's no need. Decode with proper error handling, as specified in the
Encoding spec: http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html.
This will give placeholder characters (U+FFFD); even if the whole filename
comes out unreadable, the file can still be read, selected from a list,
shown in a thumbnail view, and so on. Lots of uses aren't dependant on
> > It should be possible to get the CRC32 of files, which ZIP stores in
> > the central directory. This both allows the user to perform checksum
> > verification himself if wanted, and all the other variously useful
> > things about being able to get a file's checksum without having to
> > read the whole file.
> can we have 'generic' archive API supporting CRC32?
Do you actually have any concrete plans for other archive formats? The
only others commonly used are TAR and RAR. TAR is unsuitable for
non-archive use (you have to scan the whole file to construct a file list),
and RAR is proprietary.
You could design a checksum API that uses the algorithm for a particular
format, but that's severe overdesign if it never supports anything but
ZIP. I wouldn't worry about this.
More information about the whatwg