[whatwg] [URL] Starting work on a URL spec

Mon Jul 26 21:12:41 PDT 2010

On Jul 25, 2010, at 11:16 PM, Adam Barth wrote:

> 2010/7/26 Maciej Stachowiak <mjs at apple.com>:
>> On Jul 25, 2010, at 5:57 AM, Adam Barth wrote:
>>> 2010/7/24 Maciej Stachowiak <mjs at apple.com>:
>>>> On Jul 24, 2010, at 9:55 AM, Adam Barth wrote:
>>>>> 2010/7/23 Ian Fette (イアンフェッティ) <ifette at google.com>:
>>>>>> http://code.google.com/apis/safebrowsing/developers_guide_v2.html#Canonicalization lists
>>>>>> some interesting cases we've come across on the anti-phishing team in
>>>>>> Google. To the extent you're concerned with / interested in
>>>>>> canonicalizaiton, it may be worth taking a look at (not to suggest you
>>>>>> follow that in determining how to parse/canonicalize URLs, but rather to
>>>>>> make sure that you have some "correct" way of handling the listed URLs).
>>>>> 
>>>>> Thanks.  That's helpful.
>>>>> 
>>>>>> BTW, are you covering canonicalization?
>>>>> 
>>>>> Yes.  The three main things I'm hoping to cover are parsing,
>>>>> canonicalization, and resolving relative URLs.
>>>> 
>>>> Is there any place in the Web platform where "canonicalize" is exposed by itself in a Web-facing way? I think resolve against a base and parse into components are the only algorithms whose effects can be observed directly. I think we only need to spec "canonicalize" if it turns out to be a useful subroutine.
>>> 
>>> As far as I know, you can only see f(x) =
>>> canonicalize(parse(resolve(x))) and also some breakdown components of
>>> f(x) in HTMLAnchorElement and window.location.hash (and friends).
>>> 
>>> Conceptually, it's a bit easier to think about them as three separate
>>> functions.  The main difference between parse and canonicalize is that
>>> parse segments the input and canonicalize takes the segments, mutates
>>> them, and assembles them into a new string.
>>> 
>>> I haven't studied resolve in as much detail yet, so I'm less clear how
>>> that fits into the puzzle.
>> 
>> I would consider canonicalize() to be part of resolve(). Every time you retrieve a "cooked" URL (as opposed to original source text), you both resolve it against a possible base and canonicalize it as a single step. The two are not exposed separately. It's not clear to me that making this operation into three separate steps with a parse in the middle is helpful, or even representative of a good implementation strategy. I would think of parse() as something that happens after canonicalization in the cases where single components of the URL are exposed.
> 
> That's an interesting way to think about what's going on.  Different
> parts of the URL get different canonicalization transformations
> applied to them.  For example, the range of characters that make sense
> in a host name are different than those that make sense in a port or
> query, so, in some sense, the canonicalization algorithm needs to
> understand something about how the URL parses, or at least how to
> distinguish host names from, e.g., ports and queries.

Yes, but the relative resolution algorithm needs to find URL part boundaries as well. I guess part of the issue here is that we have two different senses of "parse":

(1) Find the URL component boundaries in a source string, to be used by other algorithms for reference purposes. In that sense, you may need to do it to both the base URL and the possibly-relative reference before resolve(). However, this step isn't really exposed directly to the Web.

(2) Extract URL components of a resolved canonicalized URL, with the appropriate post-processing to expose them via APIs like Location and HTMLAnchorElement.

I've been thinking of parse() in sense #2, since that is the version actually exposed as API. You can think of this as taking a resolved canonicalized URL as input, and having a tuple of strings representing the components as output. The only other public operation is resolve+canonicalize, which conceptually takes a base URL, a possibly relative URL reference, and an optional document encoding as input, and which produces the resolved canonicalized URL as output.

While there are other ways to factor these operations, using a different approach will make it less obvious how to glue them to the relevant other specs.

Regards,
Maciej