[whatwg] URL standard: Query string parsing; host parsing

Wed Mar 13 10:38:36 PDT 2013

(This was originally a bug report, but I was told to e-mail instead.  Another
issue is also added.)

-- Non-relative URLs in the query string --

Earlier I posted an issue with serializing the query in non-relative URLs. But after
I read more about URIs, I am not sure whether the scheme data and query string
should be kept separate.  There is a distinction between how the URL specification
categorizes URLs and how the URI standards (RFC3986 and RFC3987) classify URIs.

Both standards allow fragments to appear in all URLs/URIs, but they differ on whether
a query string is parsed.  In the URL standard, query strings can occur in all URLs, but
in the URI standards, a query string is not parsed if the URI contains a scheme but
the scheme data doesn't begin with a slash (that is, if the URI is an "opaque" URI).

Take the following as an example:

mailto:me at example.com?subject=Hi

In the URL standard, the URL is parsed as:

scheme - mailto
scheme data - me at example.com
query - subject=Hi

but in the URI standards, the URI is parsed as:

scheme - mailto
scheme-specific part - me at example.com?subject=Hi

Here, in the mailto scheme, separating the scheme data and the query may be a useful distinction.

As another example, the string

jar:http://example.com/jar?x=1!/com/example/Foo.class

is parsed in the URI standards as:

scheme - jar
scheme-specific part - http://example.com/jar?x=1!/com/example/Foo.class

but in the URL standard as:

scheme - jar
scheme data - http://example.com/jar
query - x=1!/com/example/Foo.class

A better distinction for the jar scheme would have been "http://example.com/jar?x=1"
and "com/example/Foo.class", but this is specific to the jar scheme.

This shows that while it's useful for some schemes to parse the query string, it's not so useful for others.  That's because not all schemes recognize a query string in opaque URIs, and each scheme has different parsing rules.  In both examples, mailto and jar are not relative schemes in the URL standard.

But what about a scheme that _is_ a relative scheme?

The URL "http:example.com" would be parsed as follows:

in the URL standard:

scheme - http
path - example.com

or in the URI standard:

scheme - http
scheme-specific data - example.com

(Since the URL doesn't contain a slash, "example.com" is not treated as a host;
in fact, this URL would be disallowed under RFC2616 section 3.2.2, and for the
other relative schemes, the relevant RFCs don't seem to allow a syntax like that.)

But when someone enters that URL in Firefox or Google Chrome, it gets treated like
"http://example.com"/ and is probably parsed that way too.

So the following questions should be discussed:

- Should the URL standard not parse the query string in the "scheme data" state?  
This will allow jar to work well, but may be inconvenient for mailto and other schemes, 
since it requires an additional step by the application.
- Should the URL standard parse the query string only for certain schemes that allow it, 
such as mailto?  This will require adding another category of schemes in addition to 
"relative schemes".
- As stated above, the scheme data in "relative" schemes must start with "//", so they
are, mostly correctly, handled differently.  But there are other "non-relative" schemes,
such as nntp, that follow the same rules.  Should those schemes be added to the
list of relative schemes?  Or should the URL standard parse all URLs with a scheme and "//" at the start like "hierarchical" URIs? (The list of currently registered schemes is at this page: <http://www.iana.org/assignments/uri-schemes.html>.)

-- Host parsing and Unicode characters --

Rule 2 of the host parser says "Let host be the result of running utf-8's decoder on the percent decoding of input."  But the percent decoding algorithm only works on ASCII strings, and has undefined behavior on Unicode strings.  This may preclude the use of Unicode characters in host names, especially in IDNA, which probably isn't the intent.  Accordingly, should this rule and/or the percent decoding algorithm be redefined to allow Unicode characters here? (A related question is whether the URL standard should just go ahead and adopt Unicode Technical Standard 46 for IDNA, but that issue need not be answered now.)

------------------------------------
For these issues, I don't know what the correct answers should be, so you should take your time to answer.

--Peter