[whatwg] URL parsing and same-document references [was: Re: Citing multiple <blockquote> elements in HTML5]

Wed Feb 18 17:41:38 PST 2009

On Wed, 3 Dec 2008, Calogero Alex Baldacchino wrote:
> 
> My concern is, a character-by-character comparison between an id value 
> and a fragment identifier may fail several ways. What for href="#foo bar "
> and id="foo bar "? Actual rules would strip the trailing space only 
> for the href, so the matching would fail (but we might survive broken 
> links). Escaping both, then comparing would succed, as well as first 
> escaping then unescaping the href value before comparing (should it be 
> pointed out, somewhere, that a fragment identifier must be unescaped 
> before comparing to an id or a name? is it and I've missed it? - having 
> space characters in the unreserved production means thy don't need to be 
> escaped, but does it mean also they must be decoded from their 
> pct-production, after parsing and for resolving?).

The behavior specced now may change, but as it stands now unescaping is 
defined for fragment-identifier-to-id="" matching.

In general, though, the behaviour is constrained by what IE does and more 
to the point by what is needed by content that depends on what IE does.

(You sent another couple of e-mails on the topic; I understand -- mostly 
-- the points you make therein, and would like to refer you to the recent 
thread on the topic:

   http://lists.w3.org/Archives/Public/public-html/2009Feb/thread.html#msg407

...where the same issues were discussed with more concrete reference to 
actual implementations and constraints placed on us by legacy content.)

> > What terminology would you prefer rather than "subtree"? (We can't say 
> > document, since we are also trying to define conformance rules for 
> > disconnected subtrees handled from scripts.)
> 
> Uhm, it may depend on what kinds of manipulations you have in mind, whether
> the disconnected subtree must be anyway a whole document to fulfil the
> uniqueness rule, and perhaps also on what the subtree concept might be turned
> into by future DOM Core versions, so maybe just a clarification on what a
> subtree is with respect to both the document (as a tree) and the scripts
> handling possibilities might be enough, instead of searching a new
> terminology, just to 'scope' the id visibility. I mean, if the ID matching is
> relevant for scripts accessing the matching element through the
> getElementById() method, actually a document tree is always overlapping the
> concept of subtree, and a disconnected subtree must be a document without a
> browsing context; otherwise, if other dom manipulations are involved the
> concept of subtree may change, for instance a script might implement its own
> scanning routine, treating an id attribute as any other attribute and leading
> to the concept that any non-leaf node may be the root of a subtree (that is
> identifying a subtree with any possible document fragment); furthermore, a
> possible future version of DOM Core interfaces might move the getElementById
> method to the Node interface, leading to the same result. Thus, a generic
> definition of 'subtree' (or no definition, or a definition relying upon a
> specific DOM feature or on script handling) might result in a variable concept
> with a variable scope for the ID uniqueness, but might make sense in a working
> draft until at least a first definition of the Web DOM Core specification, or
> waiting for any reason arising to restrict or enlarge the concept; otherwise,
> if that's been stated with a large consensus that a subtree is always a
> document tree, the term might be changed into the expression "a document, with
> or without a browsing context", or (equivalently) be defined as "a document
> subtree having a node of type document as its root" (to cover the case of
> dynamically created documents). Otherwise, if a subtree can be either a whole
> document, or a document subtree detached from its owner document (i.e. a node
> removed from a document with its descendants, or a tree of nodes whose
> ownerDocument property is not defined or null), it might be defined just as
> such, leaving the term 'subtree' wherever it is now (but would such a
> manipulation be consistent with the - authoring - uniqueness rule when the
> subtree is inserted into an actual document?).

My brain got lost partway through reading the above, so I apologise if I 
missed a key point you were making.

Anyway, the spec now has the term "home subtree", which is defined in more 
detail than "subtree" was before. I hope this helps.

On Sat, 13 Dec 2008, Nils Dagsson Moskopp wrote:
> Am Freitag, den 12.12.2008, 20:36 +0100 schrieb Calogero Alex
> Baldacchino:
> >
> > The above (but the 'double check' I was suggesting) is about the way 
> > Firefox (2.x and 3.0.4) behaves (both href="#foo%20bar" and, in a 
> > different page, href="./example.html#foo%20bar" match id="foo bar"), 
> > while IE7 and Opera 9.x perform an exact comparison, and show, in the 
> > address bar, an url with eventual blank spaces, thus applying the 
> > relaxation allowed by URL parsing rules, but not conforming to RFC 
> > 3986, as a complete URI string.
>
> Whenever I copypaste an URI from the address bar to any other program, I 
> am severely annoyed by this, especially when spaces (delimiters !) are 
> part of the fake-URI. A chat or office program, for example, is unable 
> to highlight the fake-URI anymore, (how could it ?), also pasting it 
> into source code can create all kind of validation errors. And whenever 
> I get a bastardized URI via chat or mail, only a part of it is 
> clickable.
> 
> Can someone from the web browser faction please state if there is any 
> data to support breaking RFC-compatibility ? Because as I see it, its 
> something that makes it appear nicer, but breaks whenever URIs are to be 
> transferred / communicated.

Note that pages that rely on this behaviour (either in the linking or the 
targetting) are non-conforming.

There are pages that depend on weird behavior here, as noted in the thread 
I mentioned near the top of this e-mail, but it may be that we can change 
the actual rules a bit to handle this better.

> Getting to the problem mentioned here, the robustness principle says 
> that id="foo bar" should be accepted, but nevertheless invalid - because 
> a fragment with a space can never be part of an URI. So IMHO, any 
> program should strive to accept broken URIs if they are unambigous 
> (which they are here, because the address can hold only one URI at a 
> time), but never output them.

Agreed.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'