[whatwg] Proposal for improved handling of '#' inside of data URIs

Sat Sep 10 16:44:00 PDT 2011

It seems like a bad idea to require look-ahead to parse data URLs.  Is
there some reason we can't just treat the whole payload as part of the
document?  That's almost certainly what authors want.

Adam

On Sat, Sep 10, 2011 at 2:15 PM, Daniel Holbert <dholbert at mozilla.com> wrote:
> Hi whatwg,
>
> I'm writing with a proposal to improve the handling of "#" in data URIs. I'm
> particularly looking for feedback from other browser vendors, but of course
> feedback from others is welcome as well.
>
> SUMMARY:
> ========
> Browsers handle the "#" character in data URIs very differently, and the
> arguably "correct" behavior is probably not what authors actually want in
> many cases.
>
> This could be more intuitive/do-what-I-mean if we restricted the cases under
> which "#" is treated as a fragment-ID delimiter inside of data URIs.  In
> particular: when a "#" character is followed by ">" or "<" in a data URI, I
> propose that we *don't* treat the "#" as a delimiter, and instead just treat
> it as part of the encoded document.
>
> Now, a set of tests, to which I'll refer below:
>  http://people.mozilla.org/~dholbert/dataURIHashTests/tests_v1.xhtml
>
> PROBLEM:
> ========
> When an author writes a data URI for a document that contains a "#"
> character, she may unintentionally end up with broken results (or at least
> inconsistently-handled results), because the "#" may be treated as the end
> of the document & the beginning of the URI's fragment identifier.
>
> (I believe this to be the _technically_ correct (albeit unintuitive)
> behavior per the URI RFC [1] -- it's the behavior we've implemented in
> Firefox 6 [2] and it's what I've described as "Correct" in my testcase.
> (with quotes to indicate unintuitiveness))
>
> Technically, the author *really* should encode the "#" character as "%23",
> if she doesn't want it to be a delimiter.
>
> However, this gotcha is easy to overlook -- especially because Opera &
> Webkit are less strict than Firefox in this respect and will gladly accept
> "#" inside data URIs under some circumstances.
>
> THE PROPOSAL & HOW IT HELPS:
> ============================
> We can help out the author by relaxing our fragment-ID-parsing rules a bit
> here.
>
> Note that in cases where an author *accidentally* includes "#" inside their
> data URI (e.g. <body background="#f00">), there almost certainly will be
> more content following it -- in particular, there will be an </html>, or an
> </svg>, or at least a ">" (if it's inside the final tag) still to come.
>
> So we can proactively check for >/< characters anywhere after the "#", and
> if we find them, then we can pretty safely assume that the author intended
> for the "#" to be part of the document, rather than a fragment-ID delimiter.
>
> OVERVIEW OF BROWSERS' CURRENT HANDLING OF "#" IN DATA URIs:
> ===========================================================
> url: http://people.mozilla.org/~dholbert/dataURIHashTests/tests_v1.xhtml
>
>  * Firefox 6+ breaks the author's expectations in my tests A & B due to URI
> parsing strictness. (But if we were to implement the above proposal, we'd
> match the author's expectations.)  We pass test C due to correctly trimming
> "#target" off of the end and scrolling to the referenced element.  And we
> fail test D only due to a bug with over-enforcing same-origin checks.[3]
>
>  * WebKit matches the author's expectations on A & B -- however, that's only
> because they don't seem to support "#ref" suffixes on the ends of data URIs
> at all, so they _always_ include "#" in the document.  (They *do* apparently
> support _relative_ references within data URI documents, e.g.
> xlink:href='#greenRect' as used in test B.)  So, Webkit ends up failing test
> C because they don't strip off the "#target" suffix (resulting in broken
> XML).  They fail test D presumably for the same reason.  (They also have
> some zooming issues on the <img> examples, but I'm ignoring those for the
> purposes of this post.)
>
>  * Opera is interesting -- it can exhibit either the Firefox or WebKit
> behaviors in tests A/B/C, depending on whether the data URI as an embedded
> element (via iframe/img) or view it directly.  When you view it as an
> embedded element (in my testcase), Opera matches WebKit on A/B/C (including
> the XML parse error on C).  However, if you *directly view* the data URIs
> (right-click on iframe, Frame|Open, focus URLbar & hit enter), then Opera
> matches Firefox.  Also, Opera passes test D.
>
> (I don't have results for IE -- I briefly tried to support it in the test,
> but I had issues getting data URIs to work there at all.)
>
> CONCLUSION:
> ===========
> So - to sum up the test-results above: webkit doesn't give "#" any special
> delimiter status in data URIs, which is a bug, but probably matches what
> authors intend a lot of the time; Opera sometimes behaves like Webkit and
> sometimes not; and Firefox parses fragment-identifiers strictly, potentially
> giving authors headaches and truncating content that renders fine in
> Opera/Webkit.
>
> With my proposal here -- relaxing the situations under which "#" should be
> treated as a delimiter in a data URI -- I think we'd better match author
> expectations and improve the browser-compatibility picture.
>
> Thoughts?
>
> Thanks,
> Daniel Holbert
> Mozilla Corporation
>
> P.S. Thanks to Robert O'Callahan for coming up with this proposal a week or
> so back.
>
> P.P.S. Browser versions that I tested (on Ubuntu 11.04 x86):
>  Firefox 6.02
>  Opera 11.51
>  Chromium 14.0.835.126 (Developer Build 99097 Linux)
>
> [1] https://www.ietf.org/rfc/rfc2396.txt See section 4.1 & appendix "B"
> ("Parsing a URI Reference with a Regular Expression") which shows that "#"
> is technically disallowed up until the #reference at the end.)
>
> [2] https://bugzilla.mozilla.org/show_bug.cgi?id=308590
>
> [3] https://bugzilla.mozilla.org/show_bug.cgi?id=686013
>