[whatwg] URL parsing and same-document references [was: Re: Citing multiple <blockquote> elements in HTML5]

Thu Dec 4 16:57:55 PST 2008

Calogero Alex Baldacchino ha scritto:   
>
> Maybe the first is wrong, and I'm still unsure of the second. My 
> concern is, a character-by-character comparison between an id value 
> and a fragment identifier may fail several ways. What for href="#foo 
> bar " and id="foo bar "? Actual rules would strip the trailing space 
> only for the href, so the matching would fail (but we might survive 
> broken links). Escaping both, then comparing would succed, as well as 
> first escaping then unescaping the href value before comparing (should 
> it be pointed out, somewhere, that a fragment identifier must be 
> unescaped before comparing to an id or a name? is it and I've missed 
> it? - having space characters in the unreserved production means thy 
> don't need to be escaped, but does it mean also they must be decoded 
> from their pct-production, after parsing and for resolving?). As well, 
> stripping the trailing spaces in both cases would succed, but would 
> fail when comparing id="foo bar " with href="#foo bar%20" (which is a 
> valid url, according with actual parsing rules), even with escaping 
> rules (in this case the id value trailing space must stay there). And 
> what about id="foo%20bar" in http://foo.example.org/foo.html  and  
> href="#foo bar" on the same page, or on a page having the same base 
> URL, or a base element with href="http://foo.example.org/foo.html" ? 
> My point is, since comparisons for matching purpose happen after the 
> URL parsing and resolution, and the id value is not involved in such 
> steps, character-by-character comparisons may fail without a prior 
> normalization of both th fragment-identifier an the id value (or one 
> of them). However, if the above is yet solved with parsing and 
> resolving rules and I've misunderstood the spec, I retire all and 
> apologize. Or, perhaps, must a valid url with a valid fragment, which 
> is equivalent but not exactly matching an id value, be considered as a 
> broken link?
>
Maybe the above needs a further clarification. Let me start from URL 
parsing (and resolving) rules: after the URL is validated, it's divided 
into its components, but nothing is stated about normalization and/or 
%-encoded characters. I think that applying a somewhat normalization may 
be useful to parse equivalent URLs in a consistent manner, helpful when 
dealing with the interfaces for URL manipulation, as described in 
section 2.5.5, and, last but not least, an improvement in relative 
references matching (especially same-document references). A minimum 
requirement, for standardization sake, may consist of decoding any 
%-encoded characters in the <fragment> production, which are part of the 
<unreserved> production as defined in RFC 3986 with the changes defined 
in HTML 5 specification for URLs parsing and restricted to the Unicode 
ranges representing valid characters for an attribute value (those which 
are not prohibited neither as 'text' nor as 'character references'). 
This way, a character-for-character comparison between a fragment 
identifier and an id attribute value, which would have been equivalent 
but not matching without the normalization, should success most of 
times, because, as a consequence of the changes applied by HTML 5 
current specification to the <unreserved> production, such characters 
might or might not be %-encoded in a valid URL, while an id value is 
likely to contain them non-encoded.

After the above <fragment> normalization, a character-for-character 
comparison would fail if the id value contained any %-encoded triplet 
representing a decoded character, such as "foo%20bar". Anyway, such may 
be a weird thing to deal with, since it can be the %-encoded form of 
"foo bar", but also the decoded form of "foo%2520bar". In other words, 
if we apply the same normalization to two complete URLs, then we compare 
them, the result is quite reliable, but if we start from a component 
(such as a fragment identifier stored in an id attribute value) it's not 
easy to tell whether any normalization has been applied and which one, 
so there are always chances for false positives or false negatives to 
happen. According with RFC 3986, section "4.4. Same-Document Reference", 
the correct interpretation of a URI as a same-document reference cannot 
be hold as guaranteed, thus the mismatch between, for instance, the  
decoded fragment identifier "foo bar" and the id attribute value 
"foo%20bar", in front of (as I think) a wide majority of good matches, 
can be reasonable. Anyway, a kind of double check might be considered, 
such as:

- comparing the %-unescaped fragment identifier with the ID of each 
element in the DOM;
- upon failure, applying a %-unescape algorithm to the ID, then 
comparing again with the fragment identifier and, if matching, marking 
the element as a 'possible choice';
- upon a perfect (exact) match, without unescaping the evaluated element 
ID, choosing such element as the referenced document part (actually 
defined as "the indicated part of the document" in the spec) and stopping;
- without any perfect match in the whole document, choosing the first 
'possible choice', if any;
- without any match at all, the search for the referenced document part 
fails.

With respect to a "single check" for an exact match, the overall 
computational time should increase linearly, thus not being a 
performance issue.

Best regards, Alex.

 --
 Caselle da 1GB, trasmetti allegati fino a 3GB e in piu' IMAP, POP3 e SMTP autenticato? GRATIS solo con Email.it http://www.email.it/f

 Sponsor:
 RC Auto?
* Con Direct Line risparmi oltre il 30% sulla tua polizza! In più per te, 15% di extra sconto! Scopri subito l’offerta! 
* 
 Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=8496&d=5-12