[whatwg] Proposal for improved handling of '#' inside of data URIs

Sun Sep 11 10:54:01 PDT 2011

On Sun, 11 Sep 2011 11:30:07 -0400, Daniel Holbert <dholbert at mozilla.com>  
wrote:

> On 09/11/2011 07:21 AM, Michael A. Puls II wrote:
>> Not only must "#" be "%23" if you don't want it as a frag id, but ">"
>> and "<" should be "%3E" and "%3C".
> [...]
>  > Of course, if you can percent-encode everything needed as you type,  
> you
>  > can hand-author the URI data. But, who wants to do that,
>
> As I noted in a response to Nils earlier in this thread,  
> Firefox/Webkit/Opera don't actually require authors to percent-encode  
> brackets and spaces in data URIs. (not sure whether that's correct per  
> spec or not).
>
> For example
>    data:text/html,<i>here is some italic text<i>
> works just fine in all three.
>
> So that makes it quite easy to hand-author data URIs, in fact. (aside  
> from this "#" gotcha)

Yes, but it's important to know that the browser still percent-decodes  
everything after the ",". It's just that in this case, there are no %HH to  
decode. You have to be careful here and know that the data/markup is still  
not literal. For example, if you want a literal "%5E", you have to use  
%255E. If you include a URI with a bunch of %HH, you have to escape all  
those "%". So, while typing, if you have no problem typing %25, you should  
have no problem typing %23.

Are you saying that data URI authors know that they have to escape "%",  
but don't know that they have to escape "#"? Or, are you saying that the  
problem is more serious and data URI authors think the data is  
*completely* literal? If the latter, we definitely shouldn't be  
encouraging anything but properly-encoded data.

FWIW, I asked for advice on "#" in mailto URIs (since mailto URI handlers  
don't make use of frag ids for mailto and frag ids are not specified for  
mailto) at  
<http://lists.w3.org/Archives/Public/public-iri/2009Oct/0030.html> and  
wanted to propose that '#' be allowed as-is when authoring without having  
to percent-encode it. But, that didn't go over too well.

>    data:text/html,<i>here is some italic text<i>

I don't really like that though as it's not portable. If I wanted to copy  
that from the address field and paste it into a plain-text document, it'd  
look funny like this:

<data:text/html,<i>here is some italic text<i>>

And, for mail clients that linkify links in plain-text messages, I can see  
that going wrong with the link (the clickable, underlined part and href)  
ending up as only "data:text/html,<i".

> So we can proactively check for >/< characters anywhere after the "#",  
> and if we find them, then we can pretty safely assume that the author  
> intended for the "#" to be part of the document, rather than a  
> fragment-ID delimiter.

I still don't like it personally as it further encourages authors to not  
encode their data and is not portable. But, if this is to happen, it  
should definitely be limited to mime types that contain markup. It  
wouldn't be useful for data:text/plain (how would you differentiate in  
that case?). And, for text/javascript and text/css etc., some other type  
of lookahead characters(s) would have to be used.

-- 
Michael