[whatwg] HTML resource packages

Wed Aug 4 13:01:51 PDT 2010

> Brett Zamir <brettz9 at yahoo.com> wrote:
> 1) I think it would be nice to see explicit confirmation in the spec that this works with offline caching.

Yes.  I'll do that.

> 2) Could data files such as .txt, .json, or .xml files be used as part of
> such a package as well?

> 3) Can XMLHttpRequest be made to reference such files and get them from the
> cache, and if so, when referencing only a zip in the packages attribute, can
> XMLHttpRequest access files in the zip not spelled out by a tag like <link/>?
> I think this would be quite powerful/avoid duplication, even if it adds
> functionality (like other HTML5 features) which would not be available to
> older browsers.

This is tricky.  The problem is: If you have an <img> on a page which might be
able to be served from a resource package, we'll block the download of the
image until can either serve the request from a resource package or can be sure
that no package contains the image.

I can imagine this behavior being confusing with XMLHttpRequests.  On the other
hand, it could certainly be powerful when used correctly.

I think the natural thing is go ahead and treat things requested by an
XMLHttpRequest the same as anything else on a page and retrieve them from
packages as possible.  If you really don't want your XMLHttpRequest to block on
a resource package, you can always use a POST.  But I need to investigate more
to determine whether this makes sense.

> 4) Could such a protocol also be made to accommodate profiles of packages,
> e.g., by a namespace being allowable somewhere for each package?

This sounds way outside the scope of what we're trying to do with resource
packages.  I'm all for designing for the future, but I don't think we want to
introduce the complexity even of these namespaces unless we intend to use them
immediately.

> Maciej Stachowiak <mjs at apple.com> wrote:
>
> Have you done any performance testing of this feature, and if so can you share any of that data?

There's a document (PDF) with some rough performance numbers in the bug:

    https://bugzilla.mozilla.org/attachment.cgi?id=455820

Although the results are preliminary, I think doing much more than this on a
simulated network for a test page might be going a bit overboard.  Results from
real pages over real networks would be much more meaningful at this point.

> Separately, I am curious to hear how http headers are handled; it's a TODO in
> the spec, and what the TODO says seems poor for the Content-Type header in
> particular. It would make it hard to use package resources in any context
> that looks at the MIME type rather than always sniffing. Any thoughts on
> this?

The intent is for UAs to sniff the content-type of anything coming from a
resource package, so I think that TODO needs to be turned on its head: The UA
shouldn't apply any of the response headers from the resource package to its
elements.

> Christoph Päper <christoph.paeper at crissov.de> wrote:
>> A page indicates in its <html> element that it uses one or more resource packages (…).
>
> Why do you want to put this on the HTML level (exclusively), not the HTTP level?
> ...
> Images might be referenced from within HTML or CSS files.

If you reference an image from a CSS file and include that CSS file in an HTML
file which uses resource packages, the image can be loaded from the resource
package.

> Why did you decide against <link rel="resource-package"
> href="pkg1.zip#files='img1.png,…'"/> or something like that? (The hash part
> is just guesswork.)

We actually originally spec'ed resource packages with the <link> tag, but we
encountered some difficulties with this.  For example, it led to confusing
behavior when a resource package was defined after a <link rel='javascript'>.
Do we load the script from the network, or do we wait until we've received the
whole <head> before loading any scripts?

Resource packages as a <link> also interacted poorly with Mozilla's speculative
parsing algorithm, which tries to download resources before we run the page's
scripts.  We probably could have come up with semantics which didn't run into
problems with our own speculative parsing implementation, but we realized it
would be difficult to spec it in such a way that we didn't make things very
difficult for *someone*.

> * Argument: What about incremental rendering?

The spec (and our implementation in Firefox) cares deeply about incremental
rendering.  Although the zip format isn't strictly suitable for incremental
extraction, I defined alternate semantics in the spec which should work.

Zip is better than tar-gz for this kind of thing for two reasons:

 * Zip file headers are uncompressed, so you don't have to extract the whole
   file in order to tell what's inside.

 * Entries in a zip file are individually compressed.  Although this might
   cause you to compress less effectively, you can compress all your files
   ahead of time and construct a zip file on the fly pretty very cheaply.

> Philip Taylor <excors+whatwg at gmail.com> wrote:
> It seems a bit surprising that [pkg.zip img1.png img2.png] provides
> more files than [pkg.zip img1.png] but *fewer* files than [pkg.zip]
> (which includes all files). I can imagine people would write code
> like:
>
>  print "<html packages='[cached-image-thumbnails.zip " . (join " ",
> @thumbnails_which_are_not_out_of_date) . "]'>";
>
> (intending the package to be updated infrequently, and used only for
> images that haven't been modified since the last package update), and
> they would get completely the wrong behaviour when the list is empty.
> So maybe "[pkg.zip]" should mean no files (vs "pkg.zip" which still
> means all files).

I think this is a good idea.  I'll change the spec.

> Filenames in zips are byte-strings, not Unicode-character-strings.
> What should happen with non-ASCII in the zip's list of contents?
> People will use standard zip programs and frequently end up with
> various random character encodings in their file - would browsers
> guess or decode as CP437 or decode as UTF-8 or fail? would they look
> at the zip header's language encoding flag? etc.

I guess we need something in the resource packages spec like what timeless
pointed to in the web widgets spec.

> What happens if the document contains multiple <html> elements (not
> all the root element)? (e.g. if it's XHTML, or the elements are added
> by scripts). The packages spec seems to assume there is only ever one.

The packages attribute should work like the manifest attribute currently works.
I don't see language in the cache manifest section of HTML5 (6.6) specifying
what happens when there are multiple <html> elements, so I hope I don't need to
specify this either.  :)

> The note at the end of 4.1 seems to be about avoiding problems like
> http://evil.com/ saying:
>
>    <html packages="eviloverride.zip"> <!-- gets downloaded from evil.com -->
>    <base href="http://bank.com/">
>    <img src="http://bank.com/logo.png"> <!-- this shouldn't be
> allowed to come from the .zip -->
>
> Why is this particular example an important problem?

This was mostly a matter of hygiene: A page shouldn't be able to claim that
it's loaded one domain's resource when it's actually loaded another.  This
would be significant if the user inquired as to the origin of the resource.
But it certainly wouldn't be the end of the world

> If the attacker
> wants to insert their own files into their own pages, they can just do
> it directly without using packages. Since this is (I assume) only used
> for resources like images and scripts and stylesheets, and not for <a
> href>s or <iframe href>s, I don't see how it would let the attacker
> circumvent any same-origin restrictions or do anything else dangerous.

> The opposite way seems more dangerous, where evil.com says:
>
>    <html packages="http://evil.com/redirect.cgi?http://secret-bank-intranet-server/packages.zip">
>    <img src="http://evil.com/logo.png">
>    <!-- now use canvas to read the pixel data of the secret logo,
> since it was loaded from the evil.com origin -->

Ah, this is much more incisive.  I guess we'd need to outlaw cross-origin
redirects in packages hrefs.

> In 4.3 step 2: What is pkg-url initialised to? (The package href of p?)

Ouch.  Yes, it should be p's package href.

-Justin