[whatwg] New URL Standard

Mon Sep 24 21:18:03 PDT 2012

This is Anne's spec, so I'll let him give more canonical answers, but:

On Mon, 24 Sep 2012, David Sheets wrote:
> 
> Your conforming WHATWG-URL syntax will have production rule alphabets 
> which are supersets of the alphabets in RFC3986.

Not necessarily, but that's certainly possible. Personally I would 
recommend that we not change the definition of what is conforming from the 
current RFC3986/RFC3987 rules, except to the extent that the character 
encoding affects it (as per the HTML standard today).

   http://whatwg.org/html#valid-url

> This is what I propose you define and it does not necessarily have to be 
> in BNF (though a production rule language of some sort probably isn't a 
> bad idea).

We should definitely define what is a conforming URL, yes (either 
directly, or by reference to the RFCs, as HTML does now). Whether prose or 
a structured language is the better way to go depends on what the 
conformance rules are -- HTML is a good example here: it has parts that 
are defined in terms of prose (e.g. the HTML syntax as a whole), and other 
parts that are defined in terms of BNF (e.g. constraints on the conetnts 
of <script> elements in certain situations). It's up to Anne.

> Error recovery and extended syntax for conforming representations are 
> orthogonal.

Indeed.

> How will WHATWG-URLs which use the syntax extended from RFC3986 map into 
> RFC3986 URI references for systems that only support those?

The same way that those systems handle invalid URLs today, I would assume. 
Do you have any concrete systems in mind here? It would be good to add 
them to the list of systems that we test. (For what it's worth, in 
practice, I've never found software that exactly followed RFC3986 and 
also rejected any non-conforming strings. There are just too many invalid 
URLs out there for that to be a viable implementation strategy.)

I remember when I was testing this years ago, when doing the first pass on 
attempting to fix this, that I found that some less widely tested 
software, e.g. wget(1), did not handle URLs in the same manner as more 
widely tested software, e.g. IE, with the result being that Web pages were 
not handled interoperably between these two software classes. This is the 
kind of thing we want to stop, by providing a single way to parse all 
input strings, valid or invalid, as URLs.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'