[whatwg] URL parsing and same-document references [was: Re: Citing multiple <blockquote> elements in HTML5]
Ian Hickson
ian at hixie.ch
Wed Feb 18 17:41:38 PST 2009
On Wed, 3 Dec 2008, Calogero Alex Baldacchino wrote:
>
> My concern is, a character-by-character comparison between an id value
> and a fragment identifier may fail several ways. What for href="#foo bar "
> and id="foo bar "? Actual rules would strip the trailing space only
> for the href, so the matching would fail (but we might survive broken
> links). Escaping both, then comparing would succed, as well as first
> escaping then unescaping the href value before comparing (should it be
> pointed out, somewhere, that a fragment identifier must be unescaped
> before comparing to an id or a name? is it and I've missed it? - having
> space characters in the unreserved production means thy don't need to be
> escaped, but does it mean also they must be decoded from their
> pct-production, after parsing and for resolving?).
The behavior specced now may change, but as it stands now unescaping is
defined for fragment-identifier-to-id="" matching.
In general, though, the behaviour is constrained by what IE does and more
to the point by what is needed by content that depends on what IE does.
(You sent another couple of e-mails on the topic; I understand -- mostly
-- the points you make therein, and would like to refer you to the recent
thread on the topic:
http://lists.w3.org/Archives/Public/public-html/2009Feb/thread.html#msg407
...where the same issues were discussed with more concrete reference to
actual implementations and constraints placed on us by legacy content.)
> > What terminology would you prefer rather than "subtree"? (We can't say
> > document, since we are also trying to define conformance rules for
> > disconnected subtrees handled from scripts.)
>
> Uhm, it may depend on what kinds of manipulations you have in mind, whether
> the disconnected subtree must be anyway a whole document to fulfil the
> uniqueness rule, and perhaps also on what the subtree concept might be turned
> into by future DOM Core versions, so maybe just a clarification on what a
> subtree is with respect to both the document (as a tree) and the scripts
> handling possibilities might be enough, instead of searching a new
> terminology, just to 'scope' the id visibility. I mean, if the ID matching is
> relevant for scripts accessing the matching element through the
> getElementById() method, actually a document tree is always overlapping the
> concept of subtree, and a disconnected subtree must be a document without a
> browsing context; otherwise, if other dom manipulations are involved the
> concept of subtree may change, for instance a script might implement its own
> scanning routine, treating an id attribute as any other attribute and leading
> to the concept that any non-leaf node may be the root of a subtree (that is
> identifying a subtree with any possible document fragment); furthermore, a
> possible future version of DOM Core interfaces might move the getElementById
> method to the Node interface, leading to the same result. Thus, a generic
> definition of 'subtree' (or no definition, or a definition relying upon a
> specific DOM feature or on script handling) might result in a variable concept
> with a variable scope for the ID uniqueness, but might make sense in a working
> draft until at least a first definition of the Web DOM Core specification, or
> waiting for any reason arising to restrict or enlarge the concept; otherwise,
> if that's been stated with a large consensus that a subtree is always a
> document tree, the term might be changed into the expression "a document, with
> or without a browsing context", or (equivalently) be defined as "a document
> subtree having a node of type document as its root" (to cover the case of
> dynamically created documents). Otherwise, if a subtree can be either a whole
> document, or a document subtree detached from its owner document (i.e. a node
> removed from a document with its descendants, or a tree of nodes whose
> ownerDocument property is not defined or null), it might be defined just as
> such, leaving the term 'subtree' wherever it is now (but would such a
> manipulation be consistent with the - authoring - uniqueness rule when the
> subtree is inserted into an actual document?).
My brain got lost partway through reading the above, so I apologise if I
missed a key point you were making.
Anyway, the spec now has the term "home subtree", which is defined in more
detail than "subtree" was before. I hope this helps.
On Sat, 13 Dec 2008, Nils Dagsson Moskopp wrote:
> Am Freitag, den 12.12.2008, 20:36 +0100 schrieb Calogero Alex
> Baldacchino:
> >
> > The above (but the 'double check' I was suggesting) is about the way
> > Firefox (2.x and 3.0.4) behaves (both href="#foo%20bar" and, in a
> > different page, href="./example.html#foo%20bar" match id="foo bar"),
> > while IE7 and Opera 9.x perform an exact comparison, and show, in the
> > address bar, an url with eventual blank spaces, thus applying the
> > relaxation allowed by URL parsing rules, but not conforming to RFC
> > 3986, as a complete URI string.
>
> Whenever I copypaste an URI from the address bar to any other program, I
> am severely annoyed by this, especially when spaces (delimiters !) are
> part of the fake-URI. A chat or office program, for example, is unable
> to highlight the fake-URI anymore, (how could it ?), also pasting it
> into source code can create all kind of validation errors. And whenever
> I get a bastardized URI via chat or mail, only a part of it is
> clickable.
>
> Can someone from the web browser faction please state if there is any
> data to support breaking RFC-compatibility ? Because as I see it, its
> something that makes it appear nicer, but breaks whenever URIs are to be
> transferred / communicated.
Note that pages that rely on this behaviour (either in the linking or the
targetting) are non-conforming.
There are pages that depend on weird behavior here, as noted in the thread
I mentioned near the top of this e-mail, but it may be that we can change
the actual rules a bit to handle this better.
> Getting to the problem mentioned here, the robustness principle says
> that id="foo bar" should be accepted, but nevertheless invalid - because
> a fragment with a space can never be part of an URI. So IMHO, any
> program should strive to accept broken URIs if they are unambigous
> (which they are here, because the address can hold only one URI at a
> time), but never output them.
Agreed.
--
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
More information about the whatwg
mailing list