[whatwg] New URL Standard

Tue Sep 25 11:42:55 PDT 2012

On Tue, 25 Sep 2012, David Sheets wrote:
> >
> > Not necessarily, but that's certainly possible. Personally I would 
> > recommend that we not change the definition of what is conforming from 
> > the current RFC3986/RFC3987 rules, except to the extent that the 
> > character encoding affects it (as per the HTML standard today).
> >
> >    http://whatwg.org/html#valid-url
> 
> I believe the '#' character in the fragment identifier qualifies.

Not sure what you mean.

Sounds like Anne is indeed expecting to widen the range of valid URLs 
though, so please disregard my comments on the matter. :-)

> > We should definitely define what is a conforming URL, yes (either 
> > directly, or by reference to the RFCs, as HTML does now). Whether 
> > prose or a structured language is the better way to go depends on what 
> > the conformance rules are -- HTML is a good example here: it has parts 
> > that are defined in terms of prose (e.g. the HTML syntax as a whole), 
> > and other parts that are defined in terms of BNF (e.g. constraints on 
> > the conetnts of <script> elements in certain situations).
> 
> HTML is far larger and more compositional than URI. I am confident that, 
> no matter what is specified in the WHATWG New URL Standard, a formal 
> language exists which can describe the structure of conforming 
> identifiers. If no such formal language can be described, the syntax 
> specification is likely to be incomplete or unsound.

Just because it's possible to use a formal language doesn't mean it's a 
good idea. It depends how clear it is. In the HTML spec, there are places 
where I've actually used a hybrid, using BNF with some terminals defined 
using prose because defining them in BNF, while possible, is confusing.

> >> How will WHATWG-URLs which use the syntax extended from RFC3986 map 
> >> into RFC3986 URI references for systems that only support those?
> >
> > The same way that those systems handle invalid URLs today, I would 
> > assume. Do you have any concrete systems in mind here? It would be 
> > good to add them to the list of systems that we test. (For what it's 
> > worth, in practice, I've never found software that exactly followed 
> > RFC3986 and also rejected any non-conforming strings. There are just 
> > too many invalid URLs out there for that to be a viable implementation 
> > strategy.)
> 
> It is not the rejection of incoming nonconforming reference identifiers 
> that causes issues but rather the emission of strictly conforming 
> identifiers by Postel's Law (Robustness Principle). I know of several 
> URI implementations that, given a nonconforming reference identifier, 
> will only output conforming identifiers. Indeed, the standard under 
> discussion will behave in exactly this way.
> 
> This leads to loss of information in chains of URI processors that can 
> and will change the meaning of identifiers.

I don't really follow. If you have any concrete examples that would really 
help.

> > I remember when I was testing this years ago, when doing the first 
> > pass on attempting to fix this, that I found that some less widely 
> > tested software, e.g. wget(1), did not handle URLs in the same manner 
> > as more widely tested software, e.g. IE, with the result being that 
> > Web pages were not handled interoperably between these two software 
> > classes. This is the kind of thing we want to stop, by providing a 
> > single way to parse all input strings, valid or invalid, as URLs.
> 
> Was wget in violation of the RFC? Was IE more lenient?

The RFC is so vague about what to do with non-conforming content that it's 
really hard to which was "in violation" or "more lenient".

But in any case that's the wrong way to look at it. There's legacy 
content, there's implementations, and there's the spec. The spec is (or 
should be) the most mutable of these; its goal should be to define how 
implementations should behave in order to make the content work 
interoperably amongst all of the implementations, and to define the best 
practice for content creators to avoid known dangers.

> If every string, valid or invalid, is parseable as a URI reference, is 
> there an algorithm to accurately extract URIs from plain text?

That would be an interesting thing to define, but in practice I don't 
think it's something implementors would care to follow. People tend to 
write URL fragments and expect them to be linked. For example, if I write, 
in an e-mail, the string google.com, people expect "google.com" to become 
a link to "http://google.com/" and for the comma to be ignored. Similarly, 
if I have a page on an intranet server and I write intranet/ianh/plan.txt, 
it would be useful if that was turned into a link to the file. But there's 
nothing to distinguish that from me writing freezing/ice/273.23K, which 
isn't intended to be a URL at all.

Given this, I think plain text renderers will be stuck with heuristics for 
some time to come. (Maybe even heuristics that involve actual DNS queries 
and HEAD requests to see if potential URLs are useful.)

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'