[whatwg] Web Addresses vs Legacy Extended IRI (again)

Sun Mar 29 05:37:19 PDT 2009

(In this email I will use URL5 as a short for Web Addresses, as that
previously was the URL part of HTML5)
As subject says, this is the continuation of the thread about LEIRI vs
URL5 archived at
<http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2009-March/018929.html>,
where discussion diverged to "good" vs "bad" standard and the adoption
of URL5 in other Internet-related technologies.
In this email I want to talk only about technical differences in the
processing requirements of URL5 and LEIRI.
Ian Hickson as repeatdly said that URL5 and (LE)IRI are different in
the processing model, last time in
<http://lists.w3.org/Archives/Public/public-html/2009Mar/0693.html>,
adding that the URL5 model is the one used by current applications.
I'm not sure about the last part of his sentence, but this is outside
the scope of this thread.

The current status is:
- RFC3986 to define URIs, their validity and their processing
- RFC3987 to define IRIs, their validity, their processing and their
conversion to URIs
- the IRI-bis draft at
<http://www.w3.org/International/iri-edit/draft-duerst-iri-bis.html>
to define LEIRIs and their conversion to IRIs
- the URL5 document, to define Web Addresses and their conversion to URIs

Let's see if we can find some differences in those documents, that
really need a different technology.
Well, hypertext locations, either URIs, IRIs or URL5s, are sequences
of characters, so the difference must be in the handling of those
characters.
Reasons are taken from the IRI-bis draft and from the URI RFC. Note
that invalidity for URL5 does not mean parse error.

= U+0000 - U+001F: Unicode control C0:
- in a URI: invalid, must be percent-encoded. Processing: stop
- in a IRI: invalid, must be percent-encoded. Processing: stop
Reason: "There is no way to transmit these characters reliably except
potentially in electronic form. Even when in electronic form, some
software components might silently filter out some of these
characters, or may stop processing alltogether when encountering some
of them. These characters may affect text display in subtle,
unnoticable ways or in drastic, global, and irreversible ways
depending on the hardware and software involved."
- in a LEIRI: valid. Processing to IRI: percent-encode
- in a URL5: invalid. Processing to URI: percent-encode

= " " U+0020: Space
- in a URI: invalid, must be percent-encoded. Processing: stop
- in a IRI: invalid, must be percent-encoded. Processing: stop
Reason: "Some formats and applications use space as a delimiter, e.g.
for items in a list"
- in a LEIRI: valid. Processing to IRI: percent-encode
- in a URL5: invalid. Processing to URI: percent-encode

= "<" U+003C, ">" U+003E, '"' U+0022, "\" U+005C, "^" (U+005E), "`"
(U+0060), "{" (U+007B), "|" (U+007C), and "}" (U+007D): Delimiters and
Unwise characters
- in a URI: invalid, must be percent-encoded. Processing: stop
- in a IRI: invalid, must be percent-encoded. Processing: stop
Reason: "Appendix C of [RFC3986] suggests the use of double-quotes
("http://example.com/") and angle brackets (<http://example.com/>) as
delimiters for URIs in plain text." and "Also, "the fact that these
characters are not used in URIs or IRIs has encouraged their use
outside URIs or IRIs in contexts that may include URIs or IRIs."
- in a LEIRI: valid. Processing to IRI: percent-encode
- in a URL5: invalid. Processing to URI: percent-encode
Please note also that all references in this email are delimited by "<" and ">"

= "%" U+0025: Percent sign:
- in a URI: valid if followed by two characters in range [A-Fa-f0-9]
(hexadecimal digit). Processing: emit a percent-encoding token
- in a IRI: valid if followed by two characters in range [A-Fa-f0-9]
(hexadecimal digit). Processing: percent-decode if the char is allowed
without percent-encoding, else emit a percent-econding token.
Processing to URI: none
- in a LEIRI: valid if followed by two characters in range [A-Fa-f0-9]
(hexadecimal digit). Processing to IRI: none.
- in a URL5: valid if followed by two characters in range [A-Fa-f0-9]
(hexadecimal digit). Processing to URI: percent-encode

= ":" , "/" , "?" , "#" , "[" , "]" , "@", "!" , "$" , "&" , "'" , "("
, ")" , "*" , "+" , "," , ";" , "=": Delimiters allowed in URIs
- in a URI: valid but have special meaning, else must be
percent-encoded. Processing: depends on scheme-specific syntax.
- in a IRI: valid but have special meaning, else must be
percent-encoded. Processing: depends on scheme-specific syntax.
- in a LEIRI: valid but have special meaning, else must be
percent-encoded. Processing to IRI: none
- in a URL5: valid but have special meaning, *cannot be
percent-encoded*. Processing to URI: "]" , "[" are automatically
percent-encoded after the host part, the rest is leaved as-is. "#" is
automatically percent-encoded in the fragment identifier.

= U+00A0-U+D7FF , U+F900-FDCF , U+FDF0-FFEF : Non-ASCII Unicode
- in a URI: invalid, must be percent-encoded. Processing: stop
- in a IRI: valid. Processing: none
- in a LEIRI: valid. Processing to IRI: none
- in a URL5: valid. Processing to URI: percent-encode

= U+200E, U+200F, U+202A-202E, U+FFF0-FFFD, U+E000-F8FF,
U+F0000-FFFFD, U+100000-10FFFD, U+E0000-E0FFF: Special, Bidi, non
chars, etc.
- in a URI: invalid, must be percent-encoded. Processing: stop
- in a IRI: invalid, must be percent-encoded. Processing: stop
Reason: "These code points provide functionality beyond that useful in
a (Legacy Extended) IRI"
- in a LEIRI: valid. Processing to IRI: percent-encode
- in a URL5: valid. Processing to URI: percent-encode

=  U+D800-U+DFFF: Surrogate code units
- in a URI: invalid. Processing: stop
- in a IRI: invalid. Processing: stop
- in a LEIRI: invalid. Processing to IRI: stop
Reason: "These do not represent Unicode codepoints"
- in a URL5: invalid. Processing to URI: percent-encode

Summing up, the differences between URL5 and LEIRI are only about the
percent sign and its uses for delimiters.

The fact that "%" is automatically converted to "%25" means that
authors can no more use percent-encoding to allow transmission of
those chars as plain data. Please note that, even if sub-delims are
allowed non encoded, they may have special meaning in a scheme
specific syntax.
One example is "&", which is allowed in URIs, but has a special
meaning in the query-part of HTTP URIs. How can UAs send forms with
"&" in value without causing security problems on the receiving
server? The same for "=", "/" , "?": how can I transmit those chars?
It is forbidden to ask questions in GET forms?

Or on the other side: do I need to percent-decode twice on the
receiving server? What about backward-compatibility with existing
server-side applications that expect to percent-encode just once?

Giovanni