[whatwg] Proposal for improved handling of '#' inside of data URIs

Sat Sep 10 14:15:09 PDT 2011

Hi whatwg,

I'm writing with a proposal to improve the handling of "#" in data URIs. 
I'm particularly looking for feedback from other browser vendors, but of 
course feedback from others is welcome as well.

SUMMARY:
========
Browsers handle the "#" character in data URIs very differently, and the 
arguably "correct" behavior is probably not what authors actually want in 
many cases.

This could be more intuitive/do-what-I-mean if we restricted the cases 
under which "#" is treated as a fragment-ID delimiter inside of data URIs. 
  In particular: when a "#" character is followed by ">" or "<" in a data 
URI, I propose that we *don't* treat the "#" as a delimiter, and instead 
just treat it as part of the encoded document.

Now, a set of tests, to which I'll refer below:
   http://people.mozilla.org/~dholbert/dataURIHashTests/tests_v1.xhtml

PROBLEM:
========
When an author writes a data URI for a document that contains a "#" 
character, she may unintentionally end up with broken results (or at least 
inconsistently-handled results), because the "#" may be treated as the end 
of the document & the beginning of the URI's fragment identifier.

(I believe this to be the _technically_ correct (albeit unintuitive) 
behavior per the URI RFC [1] -- it's the behavior we've implemented in 
Firefox 6 [2] and it's what I've described as "Correct" in my testcase. 
(with quotes to indicate unintuitiveness))

Technically, the author *really* should encode the "#" character as "%23", 
if she doesn't want it to be a delimiter.

However, this gotcha is easy to overlook -- especially because Opera & 
Webkit are less strict than Firefox in this respect and will gladly accept 
"#" inside data URIs under some circumstances.

THE PROPOSAL & HOW IT HELPS:
============================
We can help out the author by relaxing our fragment-ID-parsing rules a bit 
here.

Note that in cases where an author *accidentally* includes "#" inside 
their data URI (e.g. <body background="#f00">), there almost certainly 
will be more content following it -- in particular, there will be an 
</html>, or an </svg>, or at least a ">" (if it's inside the final tag) 
still to come.

So we can proactively check for >/< characters anywhere after the "#", and 
if we find them, then we can pretty safely assume that the author intended 
for the "#" to be part of the document, rather than a fragment-ID delimiter.

OVERVIEW OF BROWSERS' CURRENT HANDLING OF "#" IN DATA URIs:
===========================================================
url: http://people.mozilla.org/~dholbert/dataURIHashTests/tests_v1.xhtml

  * Firefox 6+ breaks the author's expectations in my tests A & B due to 
URI parsing strictness. (But if we were to implement the above proposal, 
we'd match the author's expectations.)  We pass test C due to correctly 
trimming "#target" off of the end and scrolling to the referenced element. 
  And we fail test D only due to a bug with over-enforcing same-origin 
checks.[3]

  * WebKit matches the author's expectations on A & B -- however, that's 
only because they don't seem to support "#ref" suffixes on the ends of 
data URIs at all, so they _always_ include "#" in the document.  (They 
*do* apparently support _relative_ references within data URI documents, 
e.g. xlink:href='#greenRect' as used in test B.)  So, Webkit ends up 
failing test C because they don't strip off the "#target" suffix 
(resulting in broken XML).  They fail test D presumably for the same 
reason.  (They also have some zooming issues on the <img> examples, but 
I'm ignoring those for the purposes of this post.)

  * Opera is interesting -- it can exhibit either the Firefox or WebKit 
behaviors in tests A/B/C, depending on whether the data URI as an embedded 
element (via iframe/img) or view it directly.  When you view it as an 
embedded element (in my testcase), Opera matches WebKit on A/B/C 
(including the XML parse error on C).  However, if you *directly view* the 
data URIs (right-click on iframe, Frame|Open, focus URLbar & hit enter), 
then Opera matches Firefox.  Also, Opera passes test D.

(I don't have results for IE -- I briefly tried to support it in the test, 
but I had issues getting data URIs to work there at all.)

CONCLUSION:
===========
So - to sum up the test-results above: webkit doesn't give "#" any special 
delimiter status in data URIs, which is a bug, but probably matches what 
authors intend a lot of the time; Opera sometimes behaves like Webkit and 
sometimes not; and Firefox parses fragment-identifiers strictly, 
potentially giving authors headaches and truncating content that renders 
fine in Opera/Webkit.

With my proposal here -- relaxing the situations under which "#" should be 
treated as a delimiter in a data URI -- I think we'd better match author 
expectations and improve the browser-compatibility picture.

Thoughts?

Thanks,
Daniel Holbert
Mozilla Corporation

P.S. Thanks to Robert O'Callahan for coming up with this proposal a week 
or so back.

P.P.S. Browser versions that I tested (on Ubuntu 11.04 x86):
  Firefox 6.02
  Opera 11.51
  Chromium 14.0.835.126 (Developer Build 99097 Linux)

[1] https://www.ietf.org/rfc/rfc2396.txt See section 4.1 & appendix "B" 
("Parsing a URI Reference with a Regular Expression") which shows that "#" 
is technically disallowed up until the #reference at the end.)

[2] https://bugzilla.mozilla.org/show_bug.cgi?id=308590

[3] https://bugzilla.mozilla.org/show_bug.cgi?id=686013