[whatwg] Proposal for improved handling of '#' inside of data URIs
dholbert at mozilla.com
Sat Sep 10 14:15:09 PDT 2011
I'm writing with a proposal to improve the handling of "#" in data URIs.
I'm particularly looking for feedback from other browser vendors, but of
course feedback from others is welcome as well.
Browsers handle the "#" character in data URIs very differently, and the
arguably "correct" behavior is probably not what authors actually want in
This could be more intuitive/do-what-I-mean if we restricted the cases
under which "#" is treated as a fragment-ID delimiter inside of data URIs.
In particular: when a "#" character is followed by ">" or "<" in a data
URI, I propose that we *don't* treat the "#" as a delimiter, and instead
just treat it as part of the encoded document.
Now, a set of tests, to which I'll refer below:
When an author writes a data URI for a document that contains a "#"
character, she may unintentionally end up with broken results (or at least
inconsistently-handled results), because the "#" may be treated as the end
of the document & the beginning of the URI's fragment identifier.
(I believe this to be the _technically_ correct (albeit unintuitive)
behavior per the URI RFC  -- it's the behavior we've implemented in
Firefox 6  and it's what I've described as "Correct" in my testcase.
(with quotes to indicate unintuitiveness))
Technically, the author *really* should encode the "#" character as "%23",
if she doesn't want it to be a delimiter.
However, this gotcha is easy to overlook -- especially because Opera &
Webkit are less strict than Firefox in this respect and will gladly accept
"#" inside data URIs under some circumstances.
THE PROPOSAL & HOW IT HELPS:
We can help out the author by relaxing our fragment-ID-parsing rules a bit
Note that in cases where an author *accidentally* includes "#" inside
their data URI (e.g. <body background="#f00">), there almost certainly
will be more content following it -- in particular, there will be an
</html>, or an </svg>, or at least a ">" (if it's inside the final tag)
still to come.
So we can proactively check for >/< characters anywhere after the "#", and
if we find them, then we can pretty safely assume that the author intended
for the "#" to be part of the document, rather than a fragment-ID delimiter.
OVERVIEW OF BROWSERS' CURRENT HANDLING OF "#" IN DATA URIs:
* Firefox 6+ breaks the author's expectations in my tests A & B due to
URI parsing strictness. (But if we were to implement the above proposal,
we'd match the author's expectations.) We pass test C due to correctly
trimming "#target" off of the end and scrolling to the referenced element.
And we fail test D only due to a bug with over-enforcing same-origin
* WebKit matches the author's expectations on A & B -- however, that's
only because they don't seem to support "#ref" suffixes on the ends of
data URIs at all, so they _always_ include "#" in the document. (They
*do* apparently support _relative_ references within data URI documents,
e.g. xlink:href='#greenRect' as used in test B.) So, Webkit ends up
failing test C because they don't strip off the "#target" suffix
(resulting in broken XML). They fail test D presumably for the same
reason. (They also have some zooming issues on the <img> examples, but
I'm ignoring those for the purposes of this post.)
* Opera is interesting -- it can exhibit either the Firefox or WebKit
behaviors in tests A/B/C, depending on whether the data URI as an embedded
element (via iframe/img) or view it directly. When you view it as an
embedded element (in my testcase), Opera matches WebKit on A/B/C
(including the XML parse error on C). However, if you *directly view* the
data URIs (right-click on iframe, Frame|Open, focus URLbar & hit enter),
then Opera matches Firefox. Also, Opera passes test D.
(I don't have results for IE -- I briefly tried to support it in the test,
but I had issues getting data URIs to work there at all.)
So - to sum up the test-results above: webkit doesn't give "#" any special
delimiter status in data URIs, which is a bug, but probably matches what
authors intend a lot of the time; Opera sometimes behaves like Webkit and
sometimes not; and Firefox parses fragment-identifiers strictly,
potentially giving authors headaches and truncating content that renders
fine in Opera/Webkit.
With my proposal here -- relaxing the situations under which "#" should be
treated as a delimiter in a data URI -- I think we'd better match author
expectations and improve the browser-compatibility picture.
P.S. Thanks to Robert O'Callahan for coming up with this proposal a week
or so back.
P.P.S. Browser versions that I tested (on Ubuntu 11.04 x86):
Chromium 14.0.835.126 (Developer Build 99097 Linux)
 https://www.ietf.org/rfc/rfc2396.txt See section 4.1 & appendix "B"
("Parsing a URI Reference with a Regular Expression") which shows that "#"
is technically disallowed up until the #reference at the end.)
More information about the whatwg