[whatwg] [URL] Starting work on a URL spec

Sun Jul 25 23:00:26 PDT 2010

On Jul 25, 2010, at 5:57 AM, Adam Barth wrote:

> 2010/7/24 Maciej Stachowiak <mjs at apple.com>:
>> On Jul 24, 2010, at 9:55 AM, Adam Barth wrote:
>>> 2010/7/23 Ian Fette (イアンフェッティ) <ifette at google.com>:
>>>> http://code.google.com/apis/safebrowsing/developers_guide_v2.html#Canonicalization lists
>>>> some interesting cases we've come across on the anti-phishing team in
>>>> Google. To the extent you're concerned with / interested in
>>>> canonicalizaiton, it may be worth taking a look at (not to suggest you
>>>> follow that in determining how to parse/canonicalize URLs, but rather to
>>>> make sure that you have some "correct" way of handling the listed URLs).
>>> 
>>> Thanks.  That's helpful.
>>> 
>>>> BTW, are you covering canonicalization?
>>> 
>>> Yes.  The three main things I'm hoping to cover are parsing,
>>> canonicalization, and resolving relative URLs.
>> 
>> Is there any place in the Web platform where "canonicalize" is exposed by itself in a Web-facing way? I think resolve against a base and parse into components are the only algorithms whose effects can be observed directly. I think we only need to spec "canonicalize" if it turns out to be a useful subroutine.
> 
> As far as I know, you can only see f(x) =
> canonicalize(parse(resolve(x))) and also some breakdown components of
> f(x) in HTMLAnchorElement and window.location.hash (and friends).
> 
> Conceptually, it's a bit easier to think about them as three separate
> functions.  The main difference between parse and canonicalize is that
> parse segments the input and canonicalize takes the segments, mutates
> them, and assembles them into a new string.
> 
> I haven't studied resolve in as much detail yet, so I'm less clear how
> that fits into the puzzle.

I would consider canonicalize() to be part of resolve(). Every time you retrieve a "cooked" URL (as opposed to original source text), you both resolve it against a possible base and canonicalize it as a single step. The two are not exposed separately. It's not clear to me that making this operation into three separate steps with a parse in the middle is helpful, or even representative of a good implementation strategy. I would think of parse() as something that happens after canonicalization in the cases where single components of the URL are exposed.

Regards,
Maciej