[whatwg] New URL Standard

Mon Sep 24 07:06:14 PDT 2012

On 9/24/12 4:58 AM, Anne van Kesteren wrote:
> Say you have <a href="data:test"/>; the concern is what e.g.
> a.protocol and a.pathname would return here. For invalid URLs they
> would return ":" and "" respectively. If we treat this as a valid URL
> you would get "data:" and "test". In Gecko I get "http:" and "". If I
> make that <a href="data:text/html,test"/> Gecko will give meaningful
> answers (well pathname is still "", maybe that is okay and pathname
> should only work for hierarchical URLs).

Ah, I see.

So what happens here is that Gecko treats this as an invalid URL (more 
precisely, it cannot create an internal "URI" object from this string). 
  I guess that's what you were getting at: that data: URLs actually have 
a concept of "invalid" in Gecko.  This is actually true for all schemes 
Gecko supports, in general.  For example, "http://something or other" 
(with the spaces) will do the same thing.

For an invalid URI, .protocol currently returns "http:" in Gecko.  I 
have no idea why, offhand.  It could just as easily return ":".

As far as .pathname, what Gecko does is exactly what you say: .pathname 
only works on hierarchical schemes.

> More general, what I want is that for *any* given input in <a
> href="..."/>, xhr.open("GET", ...), new URL(...), etc. I want to be
> able to tell what the various URL components are going to be. The kind
> of predictability we have for the HTML parser, I want to have for the
> URL parser as well.

Yes, absolutely agreed.

> (If that means handling data URLs at the layer of the URL parser
> rather than a separate parser that goes over the path, as Gecko
> appears to be doing, so be it.)

We could change Gecko's handling here, for what it's worth.  One reason 
for the current handling is that right now we don't even make <a> into a 
link unless its href is a valid URI as far as Gecko is concerned.  But 
I'm considering changing that anyway, since no one else bothers with 
such niceties and they complicate implementation a bit...

>> If you want constructive advice, it would be interesting to get a full list
>> of all the weird stuff that UAs do here so we can evaluate which parts of it
>> are needed and why.  I can try to produce such a list for Gecko, if there
>> seems to be motion on the general idea.
>
> I think that would be a great start. I'm happy to start out with
> Gecko's behavior and iterate over time as feedback comes in from other
> browsers.

Hmm.  So here goes at least a partial list:

1)  On Windows and OS/2, Gecko replaces '\\' with '/' in file:// URI 
strings before doing anything else with the string when parsing a new 
URL.  That includes relative URI strings being resolved against a 
file:// base.

2)  file:// URIs are parsed as a "no authority" URL in Gecko.  Quoting 
the IDL comment:

35     /**
36      * blah:foo/bar    => blah:///foo/bar
37      * blah:/foo/bar   => blah:///foo/bar
38      * blah://foo/bar  => blah://foo/bar
39      * blah:///foo/bar => blah:///foo/bar
40      */

where the thing on the left is the input string and the thing on the 
right is the normalized form that the parser produces from it.  Note 
that this is different from how HTTP URIs are parsed, for all except the 
item on line number 38 there.

3)  Gecko does not allow setting a username, password, hostname, port on 
an existing "no authority" URL object, including file://.  Attempts to 
do that throw internally; I believe for web stuff it just becomes a no-op.

4)  For "no authority" URLs, including file://, on Windows and OS/2 
only, if what looks like authority section looks like a drive letter, 
it's treated as part of the path.  For example, "file://c:/" is treated 
as the filename "c:\".  "Looks like a drive letter" is defined as "ASCII 
letter (any case), followed by a ':' or '|' and then followed by end of 
string or '/' or '\\'".  I'm not sure why this is checking for '\\' 
again, honestly.  ;)

5)  When parsing a "no authority" URL (including file://), and when item 
4 above does not apply, it looks like Gecko skips everything after 
"file://" up until the next '/', '?', or '#' char before parsing path stuff.

6)  On Windows and OS/2, when dynamically parsing a path for a "no 
authority" URL (not sure whether this is actually web-exposed, fwiw...) 
Gecko will do something involving looking for a path that's only an 
ASCII letter followed by ':' or '|' followed by end of string.  I'm not 
quite sure what that part is about...  It might have to do with the fact 
that URI objects in Gecko can have concepts of "directory", "filename", 
"extension" or something like that.

7)  When doing URI equality comparisons, if two file:// URIs only differ 
in their directory/filename/extension (so the actual file path), then an 
equality comparison is done on the underlying file path objects.  What 
this means depends on the OS.  On "Unix" this is just a straight-up byte 
by byte compare of file paths.  I think OS X now follows the "Unix" code 
path as do most other supported platforms.  But note that "file path" in 
this case is normalized in various ways.  Specifically: trailing '/' are 
stripped and some sort of normalization of HFS paths (possibly with a 
volume name) to POSIX paths is done on OSX.  One result of the latter is 
that file:///Users%2fbzbarsky ends up seeing my home directory, which is 
... slightly surprising.  On "Unix", the path bytes are treated as UTF-8 
if they're valid UTF-8, else treated as whatever the current locale 
charset is, I think.  Oh, and there is some sort of escaping going on 
for directory names, filenames, extensions.  Not sure what that's about, 
if anything.  The URI-escaping code is black magic, but I'm happy to run 
some black-box tests on it if someone wants to provide test strings.

The things that don't go through the "Unix" code for this stuff are 
Windows and OS/2.  I'm not going to dig through the OS/2 stuff, but on 
Windows if the filename contains a nonempty directory name and the 
second char is '|' that's converted to a ':'.  Again, escaping for 
directory names and file names and extensions.  Again, things that look 
like UTF-8 are treated thus and other stuff uses the current codepage. 
After all that, the actual equality comparison is done via _wcsicmp on 
the return value of GetShortPathNameW.  So whatever things that 
combination considers equal are equal.

8)  When actually resolving a file:// URL, the underlying file path 
object as described above is used to get the data.  Plus there's a bit 
of weirdness about symlinks, I think...  Mostly affects what's shown in 
the url bar when pointing the browser to a symlink.

That's what I can spot offhand.  I won't guarantee there is nothing 
else.  :(

-Boris