[whatwg] IRIs vs. URIs

Wed Mar 14 11:57:47 PDT 2007

On Wednesday 2007-03-14 15:20 +0100, Peter Karlsson wrote:
> L. David Baron on 2007-03-13:
> 
> >I tend to think it would be good that new uses of URIs/IRIs document that 
> >they are really IRIs and therefore this reverse-encoding behavior should 
> >not be used, but instead encoding should be done as UTF-8.
> 
> You cannot have UTF-8 encoding just for the URIs/IRIs, and another encoding 
> for the rest of the source text. To properly parse a URI/IRI in the source 
> document, you must first convert the bytes in the resource into a stream of 
> Unicode characters.

No, it's the *encoding* (conversion from characters to bytes) that
should be done as UTF-8, not the *decoding* (conversion from bytes
to characters).  The URIs within the document are decoded along with
the rest of the document.  But to send them back to the server they
need to be encoded (converted from characters back to bytes) and
then percent-escaped.

If we say they're IRIs then the encoding step is always encoding to
UTF-8.  But the traditional behavior for URIs has been to encode
based on the encoding of the document, which requires tracking, for
every URI, what the encoding of the document, style sheet, or script
that contained it was.  (I don't think Mozilla does this for
scripts, but we do for style sheets and documents.)

-David

-- 
L. David Baron                                <URL: http://dbaron.org/ >
           Technical Lead, Layout & CSS, Mozilla Corporation
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.whatwg.org/pipermail/whatwg-whatwg.org/attachments/20070314/45a33c9a/attachment-0001.pgp>