[whatwg] [URL] Starting work on a URL spec

Sun Jul 25 23:16:03 PDT 2010

2010/7/26 Maciej Stachowiak <mjs at apple.com>:
> On Jul 25, 2010, at 5:57 AM, Adam Barth wrote:
>> 2010/7/24 Maciej Stachowiak <mjs at apple.com>:
>>> On Jul 24, 2010, at 9:55 AM, Adam Barth wrote:
>>>> 2010/7/23 Ian Fette (イアンフェッティ) <ifette at google.com>:
>>>>> http://code.google.com/apis/safebrowsing/developers_guide_v2.html#Canonicalization lists
>>>>> some interesting cases we've come across on the anti-phishing team in
>>>>> Google. To the extent you're concerned with / interested in
>>>>> canonicalizaiton, it may be worth taking a look at (not to suggest you
>>>>> follow that in determining how to parse/canonicalize URLs, but rather to
>>>>> make sure that you have some "correct" way of handling the listed URLs).
>>>>
>>>> Thanks.  That's helpful.
>>>>
>>>>> BTW, are you covering canonicalization?
>>>>
>>>> Yes.  The three main things I'm hoping to cover are parsing,
>>>> canonicalization, and resolving relative URLs.
>>>
>>> Is there any place in the Web platform where "canonicalize" is exposed by itself in a Web-facing way? I think resolve against a base and parse into components are the only algorithms whose effects can be observed directly. I think we only need to spec "canonicalize" if it turns out to be a useful subroutine.
>>
>> As far as I know, you can only see f(x) =
>> canonicalize(parse(resolve(x))) and also some breakdown components of
>> f(x) in HTMLAnchorElement and window.location.hash (and friends).
>>
>> Conceptually, it's a bit easier to think about them as three separate
>> functions.  The main difference between parse and canonicalize is that
>> parse segments the input and canonicalize takes the segments, mutates
>> them, and assembles them into a new string.
>>
>> I haven't studied resolve in as much detail yet, so I'm less clear how
>> that fits into the puzzle.
>
> I would consider canonicalize() to be part of resolve(). Every time you retrieve a "cooked" URL (as opposed to original source text), you both resolve it against a possible base and canonicalize it as a single step. The two are not exposed separately. It's not clear to me that making this operation into three separate steps with a parse in the middle is helpful, or even representative of a good implementation strategy. I would think of parse() as something that happens after canonicalization in the cases where single components of the URL are exposed.

That's an interesting way to think about what's going on.  Different
parts of the URL get different canonicalization transformations
applied to them.  For example, the range of characters that make sense
in a host name are different than those that make sense in a port or
query, so, in some sense, the canonicalization algorithm needs to
understand something about how the URL parses, or at least how to
distinguish host names from, e.g., ports and queries.

Adam