[whatwg] URL: file: URLs

Sat Oct 27 12:35:21 PDT 2012

On Mon, Sep 24, 2012 at 4:06 PM, Boris Zbarsky <bzbarsky at mit.edu> wrote:
> Hmm.  So here goes at least a partial list:
>
> 1)  On Windows and OS/2, Gecko replaces '\\' with '/' in file:// URI strings
> before doing anything else with the string when parsing a new URL.  That
> includes relative URI strings being resolved against a file:// base.

This is covered as we do this for all URLs currently with a "relative
scheme" (http/ws/...). I know you indicated this as potentially
problematic, but note that a) "\" as a raw code point is invalid in a
URL and b) because of a) you can represent it as %5C, and c) other
user agents have hit issues with not supporting \ and / outside of
file: URLs.

> 2)  file:// URIs are parsed as a "no authority" URL in Gecko.  Quoting the
> IDL comment:
>
> 35     /**
> 36      * blah:foo/bar    => blah:///foo/bar
> 37      * blah:/foo/bar   => blah:///foo/bar
> 38      * blah://foo/bar  => blah://foo/bar
> 39      * blah:///foo/bar => blah:///foo/bar
> 40      */
>
> where the thing on the left is the input string and the thing on the right
> is the normalized form that the parser produces from it.  Note that this is
> different from how HTTP URIs are parsed, for all except the item on line
> number 38 there.

The parser in the specification should handle these in the same way. I
have not introduced a "no authority" concept however. The parser in
the specification also preserves the host as other user agents seem to
preserve it.

> 4)  For "no authority" URLs, including file://, on Windows and OS/2 only, if
> what looks like authority section looks like a drive letter, it's treated as
> part of the path.  For example, "file://c:/" is treated as the filename
> "c:\".  "Looks like a drive letter" is defined as "ASCII letter (any case),
> followed by a ':' or '|' and then followed by end of string or '/' or '\\'".
> I'm not sure why this is checking for '\\' again, honestly.  ;)

Is this part of URL parsing or part of doing something with the
resulting URL? (I do not plan on defining the latter because there's
no observable difference from the web and it's platform-dependent.)

> 5)  When parsing a "no authority" URL (including file://), and when item 4
> above does not apply, it looks like Gecko skips everything after "file://"
> up until the next '/', '?', or '#' char before parsing path stuff.

So the host is dropped? This is not what other user agents do and
http://www.cs.tut.fi/~jkorpela/fileurl.html suggests it might be
useful in cases. I don't know anything about file: URLs however so
whether that is still true or not I don't know.

> 6)  On Windows and OS/2, when dynamically parsing a path for a "no
> authority" URL (not sure whether this is actually web-exposed, fwiw...)
> Gecko will do something involving looking for a path that's only an ASCII
> letter followed by ':' or '|' followed by end of string.  I'm not quite sure
> what that part is about...  It might have to do with the fact that URI
> objects in Gecko can have concepts of "directory", "filename", "extension"
> or something like that.
>
> 7)  When doing URI equality comparisons, if two file:// URIs only differ in
> their directory/filename/extension (so the actual file path), then an
> equality comparison is done on the underlying file path objects.  What this
> means depends on the OS.  On "Unix" this is just a straight-up byte by byte
> compare of file paths.  I think OS X now follows the "Unix" code path as do
> most other supported platforms.  But note that "file path" in this case is
> normalized in various ways.  Specifically: trailing '/' are stripped and
> some sort of normalization of HFS paths (possibly with a volume name) to
> POSIX paths is done on OSX.  One result of the latter is that
> file:///Users%2fbzbarsky ends up seeing my home directory, which is ...
> slightly surprising.  On "Unix", the path bytes are treated as UTF-8 if
> they're valid UTF-8, else treated as whatever the current locale charset is,
> I think.  Oh, and there is some sort of escaping going on for directory
> names, filenames, extensions.  Not sure what that's about, if anything.  The
> URI-escaping code is black magic, but I'm happy to run some black-box tests
> on it if someone wants to provide test strings.
>
> The things that don't go through the "Unix" code for this stuff are Windows
> and OS/2.  I'm not going to dig through the OS/2 stuff, but on Windows if
> the filename contains a nonempty directory name and the second char is '|'
> that's converted to a ':'.  Again, escaping for directory names and file
> names and extensions.  Again, things that look like UTF-8 are treated thus
> and other stuff uses the current codepage. After all that, the actual
> equality comparison is done via _wcsicmp on the return value of
> GetShortPathNameW.  So whatever things that combination considers equal are
> equal.
>
> 8)  When actually resolving a file:// URL, the underlying file path object
> as described above is used to get the data.  Plus there's a bit of weirdness
> about symlinks, I think...  Mostly affects what's shown in the url bar when
> pointing the browser to a symlink.

These points do not seem to be about parsing, correct?

-- 
http://annevankesteren.nl/