[whatwg] New URL Standard

Tue Sep 25 11:01:48 PDT 2012

On Mon, Sep 24, 2012 at 9:18 PM, Ian Hickson <ian at hixie.ch> wrote:
>
> This is Anne's spec, so I'll let him give more canonical answers, but:
>
> On Mon, 24 Sep 2012, David Sheets wrote:
>>
>> Your conforming WHATWG-URL syntax will have production rule alphabets
>> which are supersets of the alphabets in RFC3986.
>
> Not necessarily, but that's certainly possible. Personally I would
> recommend that we not change the definition of what is conforming from the
> current RFC3986/RFC3987 rules, except to the extent that the character
> encoding affects it (as per the HTML standard today).
>
>    http://whatwg.org/html#valid-url

I believe the '#' character in the fragment identifier qualifies.

>> This is what I propose you define and it does not necessarily have to be
>> in BNF (though a production rule language of some sort probably isn't a
>> bad idea).
>
> We should definitely define what is a conforming URL, yes (either
> directly, or by reference to the RFCs, as HTML does now). Whether prose or
> a structured language is the better way to go depends on what the
> conformance rules are -- HTML is a good example here: it has parts that
> are defined in terms of prose (e.g. the HTML syntax as a whole), and other
> parts that are defined in terms of BNF (e.g. constraints on the conetnts
> of <script> elements in certain situations). It's up to Anne.

HTML is far larger and more compositional than URI. I am confident
that, no matter what is specified in the WHATWG New URL Standard, a
formal language exists which can describe the structure of conforming
identifiers. If no such formal language can be described, the syntax
specification is likely to be incomplete or unsound.

>> How will WHATWG-URLs which use the syntax extended from RFC3986 map into
>> RFC3986 URI references for systems that only support those?
>
> The same way that those systems handle invalid URLs today, I would assume.
> Do you have any concrete systems in mind here? It would be good to add
> them to the list of systems that we test. (For what it's worth, in
> practice, I've never found software that exactly followed RFC3986 and
> also rejected any non-conforming strings. There are just too many invalid
> URLs out there for that to be a viable implementation strategy.)

It is not the rejection of incoming nonconforming reference
identifiers that causes issues but rather the emission of strictly
conforming identifiers by Postel's Law (Robustness Principle). I know
of several URI implementations that, given a nonconforming reference
identifier, will only output conforming identifiers. Indeed, the
standard under discussion will behave in exactly this way.

This leads to loss of information in chains of URI processors that can
and will change the meaning of identifiers.

> I remember when I was testing this years ago, when doing the first pass on
> attempting to fix this, that I found that some less widely tested
> software, e.g. wget(1), did not handle URLs in the same manner as more
> widely tested software, e.g. IE, with the result being that Web pages were
> not handled interoperably between these two software classes. This is the
> kind of thing we want to stop, by providing a single way to parse all
> input strings, valid or invalid, as URLs.

Was wget in violation of the RFC? Was IE more lenient?

If every string, valid or invalid, is parseable as a URI reference, is
there an algorithm to accurately extract URIs from plain text?

> --
> Ian Hickson               U+1047E                )\._.,--....,'``.    fL
> http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
> Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'