[whatwg] IRIs vs. URIs

Peter Karlsson peter at opera.com
Thu Mar 15 01:17:26 PDT 2007


L. David Baron on 2007-03-14:

> No, it's the *encoding* (conversion from characters to bytes) that should 
> be done as UTF-8, not the *decoding* (conversion from bytes to 
> characters).

Right. Sorry for the confusion. Perhaps it would be good to specify it, but 
it needs to be specified in a way that is compatible with the current web, 
although I am not sure if there is *one* specification that can cover all 
the current cases.


Bjoern Hoehrmann on 2007-03-14:

> The traditional behavior of Internet Explorer on Western versions of 
> Windows for western web sites using western encodings has been, since the 
> release of IE5b2 I think, to encode the path using UTF-8 and the query 
> string using the document encoding, depending on the send-urls- as-utf-8 
> setting. For example,
>
>  <a href='Björn.html'>...</a>
>
> in an ISO-8859-1 encoded document would result in a request for
>
>  ... Bj%C3%B6rn.html ...

Yes, I remember the havoc that created on our university computer club's 
server when that version was released, as we had used quite a lot of 
Swedish characters in our URLs (which had worked fine from most Windows 
and Unix browsers, until IE5). As you note, the default setting is different 
in different locales, I believe the East Asian locale defaults differ.

> Opera, at least for a considerable amount of time, used UTF-8 for the 
> whole reference, I think independently of encodings, domains, and other 
> environment variables.

There is a setting in Opera to control the behaviour. Either we use UTF-8 
for all URLs (the default), or we use the document encoding. The latter 
setting is popular in some locales, especially Russia (IIRC).


L. David Baron on 2007-03-14:

> I believe Mozilla's behavior is the way it is in order to be compatible 
> enough with IE's behavior to be usable when browsing East Asian Web sites.

I think that there is a spec somewhere (IRI spec?) that states that one 
should try the URL both encoded in UTF-8 and, if that fails, in the locale 
encoding. To my knowledge, there are no browsers that do that.


Also, one needs to consider that there are different parts of the URL that 
needs to be encoded differently. Even if you UTF-8 encode the path 
component, the query part needs to be given some special care and attention. 
In a reference such as

   <a href="http://www.räksmörgås.se/räksmörgås?q=räksmörgås">

there are three components that need to be encoded differently. I can't 
remember how many test cases I have for such URLs, to make sure as many 
possible combinations we can think of will work whenever a new variant that 
isn't working is discovered.

Having a real specification of this would help quite a lot... :-)

-- 
\\//
Peter, software engineer, Opera Software

  The opinions expressed are my own, and not those of my employer.
  Please reply only by follow-ups on the mailing list.



More information about the whatwg mailing list