[whatwg] URL parsing and same-document references [was: Re: Citing multiple <blockquote> elements in HTML5]
Calogero Alex Baldacchino
alex.baldacchino at email.it
Thu Dec 4 16:57:55 PST 2008
Calogero Alex Baldacchino ha scritto:
>
> Maybe the first is wrong, and I'm still unsure of the second. My
> concern is, a character-by-character comparison between an id value
> and a fragment identifier may fail several ways. What for href="#foo
> bar " and id="foo bar "? Actual rules would strip the trailing space
> only for the href, so the matching would fail (but we might survive
> broken links). Escaping both, then comparing would succed, as well as
> first escaping then unescaping the href value before comparing (should
> it be pointed out, somewhere, that a fragment identifier must be
> unescaped before comparing to an id or a name? is it and I've missed
> it? - having space characters in the unreserved production means thy
> don't need to be escaped, but does it mean also they must be decoded
> from their pct-production, after parsing and for resolving?). As well,
> stripping the trailing spaces in both cases would succed, but would
> fail when comparing id="foo bar " with href="#foo bar%20" (which is a
> valid url, according with actual parsing rules), even with escaping
> rules (in this case the id value trailing space must stay there). And
> what about id="foo%20bar" in http://foo.example.org/foo.html and
> href="#foo bar" on the same page, or on a page having the same base
> URL, or a base element with href="http://foo.example.org/foo.html" ?
> My point is, since comparisons for matching purpose happen after the
> URL parsing and resolution, and the id value is not involved in such
> steps, character-by-character comparisons may fail without a prior
> normalization of both th fragment-identifier an the id value (or one
> of them). However, if the above is yet solved with parsing and
> resolving rules and I've misunderstood the spec, I retire all and
> apologize. Or, perhaps, must a valid url with a valid fragment, which
> is equivalent but not exactly matching an id value, be considered as a
> broken link?
>
Maybe the above needs a further clarification. Let me start from URL
parsing (and resolving) rules: after the URL is validated, it's divided
into its components, but nothing is stated about normalization and/or
%-encoded characters. I think that applying a somewhat normalization may
be useful to parse equivalent URLs in a consistent manner, helpful when
dealing with the interfaces for URL manipulation, as described in
section 2.5.5, and, last but not least, an improvement in relative
references matching (especially same-document references). A minimum
requirement, for standardization sake, may consist of decoding any
%-encoded characters in the <fragment> production, which are part of the
<unreserved> production as defined in RFC 3986 with the changes defined
in HTML 5 specification for URLs parsing and restricted to the Unicode
ranges representing valid characters for an attribute value (those which
are not prohibited neither as 'text' nor as 'character references').
This way, a character-for-character comparison between a fragment
identifier and an id attribute value, which would have been equivalent
but not matching without the normalization, should success most of
times, because, as a consequence of the changes applied by HTML 5
current specification to the <unreserved> production, such characters
might or might not be %-encoded in a valid URL, while an id value is
likely to contain them non-encoded.
After the above <fragment> normalization, a character-for-character
comparison would fail if the id value contained any %-encoded triplet
representing a decoded character, such as "foo%20bar". Anyway, such may
be a weird thing to deal with, since it can be the %-encoded form of
"foo bar", but also the decoded form of "foo%2520bar". In other words,
if we apply the same normalization to two complete URLs, then we compare
them, the result is quite reliable, but if we start from a component
(such as a fragment identifier stored in an id attribute value) it's not
easy to tell whether any normalization has been applied and which one,
so there are always chances for false positives or false negatives to
happen. According with RFC 3986, section "4.4. Same-Document Reference",
the correct interpretation of a URI as a same-document reference cannot
be hold as guaranteed, thus the mismatch between, for instance, the
decoded fragment identifier "foo bar" and the id attribute value
"foo%20bar", in front of (as I think) a wide majority of good matches,
can be reasonable. Anyway, a kind of double check might be considered,
such as:
- comparing the %-unescaped fragment identifier with the ID of each
element in the DOM;
- upon failure, applying a %-unescape algorithm to the ID, then
comparing again with the fragment identifier and, if matching, marking
the element as a 'possible choice';
- upon a perfect (exact) match, without unescaping the evaluated element
ID, choosing such element as the referenced document part (actually
defined as "the indicated part of the document" in the spec) and stopping;
- without any perfect match in the whole document, choosing the first
'possible choice', if any;
- without any match at all, the search for the referenced document part
fails.
With respect to a "single check" for an exact match, the overall
computational time should increase linearly, thus not being a
performance issue.
Best regards, Alex.
--
Caselle da 1GB, trasmetti allegati fino a 3GB e in piu' IMAP, POP3 e SMTP autenticato? GRATIS solo con Email.it http://www.email.it/f
Sponsor:
RC Auto?
* Con Direct Line risparmi oltre il 30% sulla tua polizza! In più per te, 15% di extra sconto! Scopri subito lofferta!
*
Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=8496&d=5-12
More information about the whatwg
mailing list