[whatwg] IRIs vs. URIs
L. David Baron
dbaron at dbaron.org
Tue Mar 13 16:29:52 PDT 2007
# For readability, the term URI is used to refer to both ASCII
# URIs and Unicode IRIs, as those terms are defined by [RFC3986]
# and [RFC3987] respectively. On the rare occasions where IRIs
# are not allowed but ASCII URIs are, this is called out
This is rather misleading, since backwards compatible use of URIs is
not ASCII-only. While IRIs are a superset of conformant URIs, IRIs
are a subset of real-world-URIs, since they have the encoding fixed
to UTF-8. Backwards-compatible URI handling tries to send the same
sequence of bytes that was in the document back to the server,
percent-encoded byte-by-byte, by encoding the URI based on the
encoding of the document.
I tend to think it would be good that new uses of URIs/IRIs document
that they are really IRIs and therefore this reverse-encoding
behavior should not be used, but instead encoding should be done as
The repeated language in the spec that something is a URI or IRI
doesn't make sense -- it really does need to be one or the other.
(In Mozilla's codebase such distinctions are easy to implement since
we have to pass along the encoding of the document every time we
create a URI in order to get this backwards-compatible behavior.
Failing to do so makes the code use UTF-8, which means, I think,
that it's an IRI. At least, it's easy to implement if the things
that are URIs and the things that are IRIs go through the same
It would probably be good if the spec documented how the encoding
issues in URIs are actually handled.
(My understanding of this stuff may be a bit off, although this also
isn't the clearest explanation I could make of what I do know about
L. David Baron <URL: http://dbaron.org/ >
Technical Lead, Layout & CSS, Mozilla Corporation
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
More information about the whatwg