[whatwg] URL parsing and same-document references [was: Re: Citing multiple <blockquote> elements in HTML5]
Calogero Alex Baldacchino
alex.baldacchino at email.it
Fri Dec 12 11:36:32 PST 2008
Calogero Alex Baldacchino ha scritto:
> Maybe the above needs a further clarification. Let me start from URL
> parsing (and resolving) rules: after the URL is validated, it's
> divided into its components, but nothing is stated about normalization
> and/or %-encoded characters. I think that applying a somewhat
> normalization may be useful to parse equivalent URLs in a consistent
> manner, helpful when dealing with the interfaces for URL manipulation,
> as described in section 2.5.5, and, last but not least, an improvement
> in relative references matching (especially same-document references).
> A minimum requirement, for standardization sake, may consist of
> decoding any %-encoded characters in the <fragment> production, which
> are part of the <unreserved> production as defined in RFC 3986 with
> the changes defined in HTML 5 specification for URLs parsing and
> restricted to the Unicode ranges representing valid characters for an
> attribute value (those which are not prohibited neither as 'text' nor
> as 'character references'). This way, a character-for-character
> comparison between a fragment identifier and an id attribute value,
> which would have been equivalent but not matching without the
> normalization, should success most of times, because, as a consequence
> of the changes applied by HTML 5 current specification to the
> <unreserved> production, such characters might or might not be
> %-encoded in a valid URL, while an id value is likely to contain them
> non-encoded.
>
> After the above <fragment> normalization, a character-for-character
> comparison would fail if the id value contained any %-encoded triplet
> representing a decoded character, such as "foo%20bar". Anyway, such
> may be a weird thing to deal with, since it can be the %-encoded form
> of "foo bar", but also the decoded form of "foo%2520bar". In other
> words, if we apply the same normalization to two complete URLs, then
> we compare them, the result is quite reliable, but if we start from a
> component (such as a fragment identifier stored in an id attribute
> value) it's not easy to tell whether any normalization has been
> applied and which one, so there are always chances for false positives
> or false negatives to happen. According with RFC 3986, section "4.4.
> Same-Document Reference", the correct interpretation of a URI as a
> same-document reference cannot be hold as guaranteed, thus the
> mismatch between, for instance, the decoded fragment identifier "foo
> bar" and the id attribute value "foo%20bar", in front of (as I think)
> a wide majority of good matches, can be reasonable. Anyway, a kind of
> double check might be considered, such as:
>
> - comparing the %-unescaped fragment identifier with the ID of each
> element in the DOM;
> - upon failure, applying a %-unescape algorithm to the ID, then
> comparing again with the fragment identifier and, if matching, marking
> the element as a 'possible choice';
> - upon a perfect (exact) match, without unescaping the evaluated
> element ID, choosing such element as the referenced document part
> (actually defined as "the indicated part of the document" in the spec)
> and stopping;
> - without any perfect match in the whole document, choosing the first
> 'possible choice', if any;
> - without any match at all, the search for the referenced document
> part fails.
>
> With respect to a "single check" for an exact match, the overall
> computational time should increase linearly, thus not being a
> performance issue.
>
> Best regards, Alex.
The above (but the 'double check' I was suggesting) is about the way
Firefox (2.x and 3.0.4) behaves (both href="#foo%20bar" and, in a
different page, href="./example.html#foo%20bar" match id="foo bar"),
while IE7 and Opera 9.x perform an exact comparison, and show, in the
address bar, an url with eventual blank spaces, thus applying the
relaxation allowed by URL parsing rules, but not conforming to RFC 3986,
as a complete URI string. It seems different browsers implement (more or
less) different normalization/resolution algorithms, leading to
different matches, thus the specification of a uniform behaviour
(whatever one) might be reasonable and useful. Actual resolving
algorithm, while explicitly asking for %-encoding in a path component
and for conformance with RFC 3986 in general, doesn't talk about
fragment identifiers; the referred algorithm for relative resolutions
(section 5.2 of RFC 3986), AIUI, might not require the creation of a
complete URI string, but instead be accomplished by returning an object
holding a separated string for each URI part, thus not necessarily
requiring %-encoding and potentially leaving out to UAs a certain degree
of freedom. Furthermore, about URL decomposition attributes it is said,
'On setting, the new value must first be mutated as described by the
"setter preprocessor" column, then mutated by %-escaping any characters
in the new value that are not valid in the relevant component as given
by the "component" column.'; such seems to refer to the stricter RFC3986
requirements (which in turn might be relaxed, since any part of a
decomposed URL may contain unescaped characters), however, the
'component column' points, for each component, to the corresponding
definition givent for a parsed-URL component, which is not strictly
required to have escaped characters by actual parsing rules. I'd suggest
to re-consider the whole mechanism to avoid any free interpretation and
make each phase/operation (parsing, resolving, attributes setting) more
consistent both with each other and cross-browser, if possible (I'd also
consider one or more DOM methods to help an easy comparison between
URL-strings and/or between component attributes).
Best regards,
Alex.
--
Caselle da 1GB, trasmetti allegati fino a 3GB e in piu' IMAP, POP3 e SMTP autenticato? GRATIS solo con Email.it http://www.email.it/f
Sponsor:
Scopri le supernovità dei games per cellulare! Giocale tutte!
Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=8271&d=12-12
More information about the whatwg
mailing list