[whatwg] New URL Standard
Boris Zbarsky
bzbarsky at MIT.EDU
Mon Sep 24 07:06:14 PDT 2012
On 9/24/12 4:58 AM, Anne van Kesteren wrote:
> Say you have <a href="data:test"/>; the concern is what e.g.
> a.protocol and a.pathname would return here. For invalid URLs they
> would return ":" and "" respectively. If we treat this as a valid URL
> you would get "data:" and "test". In Gecko I get "http:" and "". If I
> make that <a href="data:text/html,test"/> Gecko will give meaningful
> answers (well pathname is still "", maybe that is okay and pathname
> should only work for hierarchical URLs).
Ah, I see.
So what happens here is that Gecko treats this as an invalid URL (more
precisely, it cannot create an internal "URI" object from this string).
I guess that's what you were getting at: that data: URLs actually have
a concept of "invalid" in Gecko. This is actually true for all schemes
Gecko supports, in general. For example, "http://something or other"
(with the spaces) will do the same thing.
For an invalid URI, .protocol currently returns "http:" in Gecko. I
have no idea why, offhand. It could just as easily return ":".
As far as .pathname, what Gecko does is exactly what you say: .pathname
only works on hierarchical schemes.
> More general, what I want is that for *any* given input in <a
> href="..."/>, xhr.open("GET", ...), new URL(...), etc. I want to be
> able to tell what the various URL components are going to be. The kind
> of predictability we have for the HTML parser, I want to have for the
> URL parser as well.
Yes, absolutely agreed.
> (If that means handling data URLs at the layer of the URL parser
> rather than a separate parser that goes over the path, as Gecko
> appears to be doing, so be it.)
We could change Gecko's handling here, for what it's worth. One reason
for the current handling is that right now we don't even make <a> into a
link unless its href is a valid URI as far as Gecko is concerned. But
I'm considering changing that anyway, since no one else bothers with
such niceties and they complicate implementation a bit...
>> If you want constructive advice, it would be interesting to get a full list
>> of all the weird stuff that UAs do here so we can evaluate which parts of it
>> are needed and why. I can try to produce such a list for Gecko, if there
>> seems to be motion on the general idea.
>
> I think that would be a great start. I'm happy to start out with
> Gecko's behavior and iterate over time as feedback comes in from other
> browsers.
Hmm. So here goes at least a partial list:
1) On Windows and OS/2, Gecko replaces '\\' with '/' in file:// URI
strings before doing anything else with the string when parsing a new
URL. That includes relative URI strings being resolved against a
file:// base.
2) file:// URIs are parsed as a "no authority" URL in Gecko. Quoting
the IDL comment:
35 /**
36 * blah:foo/bar => blah:///foo/bar
37 * blah:/foo/bar => blah:///foo/bar
38 * blah://foo/bar => blah://foo/bar
39 * blah:///foo/bar => blah:///foo/bar
40 */
where the thing on the left is the input string and the thing on the
right is the normalized form that the parser produces from it. Note
that this is different from how HTTP URIs are parsed, for all except the
item on line number 38 there.
3) Gecko does not allow setting a username, password, hostname, port on
an existing "no authority" URL object, including file://. Attempts to
do that throw internally; I believe for web stuff it just becomes a no-op.
4) For "no authority" URLs, including file://, on Windows and OS/2
only, if what looks like authority section looks like a drive letter,
it's treated as part of the path. For example, "file://c:/" is treated
as the filename "c:\". "Looks like a drive letter" is defined as "ASCII
letter (any case), followed by a ':' or '|' and then followed by end of
string or '/' or '\\'". I'm not sure why this is checking for '\\'
again, honestly. ;)
5) When parsing a "no authority" URL (including file://), and when item
4 above does not apply, it looks like Gecko skips everything after
"file://" up until the next '/', '?', or '#' char before parsing path stuff.
6) On Windows and OS/2, when dynamically parsing a path for a "no
authority" URL (not sure whether this is actually web-exposed, fwiw...)
Gecko will do something involving looking for a path that's only an
ASCII letter followed by ':' or '|' followed by end of string. I'm not
quite sure what that part is about... It might have to do with the fact
that URI objects in Gecko can have concepts of "directory", "filename",
"extension" or something like that.
7) When doing URI equality comparisons, if two file:// URIs only differ
in their directory/filename/extension (so the actual file path), then an
equality comparison is done on the underlying file path objects. What
this means depends on the OS. On "Unix" this is just a straight-up byte
by byte compare of file paths. I think OS X now follows the "Unix" code
path as do most other supported platforms. But note that "file path" in
this case is normalized in various ways. Specifically: trailing '/' are
stripped and some sort of normalization of HFS paths (possibly with a
volume name) to POSIX paths is done on OSX. One result of the latter is
that file:///Users%2fbzbarsky ends up seeing my home directory, which is
... slightly surprising. On "Unix", the path bytes are treated as UTF-8
if they're valid UTF-8, else treated as whatever the current locale
charset is, I think. Oh, and there is some sort of escaping going on
for directory names, filenames, extensions. Not sure what that's about,
if anything. The URI-escaping code is black magic, but I'm happy to run
some black-box tests on it if someone wants to provide test strings.
The things that don't go through the "Unix" code for this stuff are
Windows and OS/2. I'm not going to dig through the OS/2 stuff, but on
Windows if the filename contains a nonempty directory name and the
second char is '|' that's converted to a ':'. Again, escaping for
directory names and file names and extensions. Again, things that look
like UTF-8 are treated thus and other stuff uses the current codepage.
After all that, the actual equality comparison is done via _wcsicmp on
the return value of GetShortPathNameW. So whatever things that
combination considers equal are equal.
8) When actually resolving a file:// URL, the underlying file path
object as described above is used to get the data. Plus there's a bit
of weirdness about symlinks, I think... Mostly affects what's shown in
the url bar when pointing the browser to a symlink.
That's what I can spot offhand. I won't guarantee there is nothing
else. :(
-Boris
More information about the whatwg
mailing list