[whatwg] IRIs vs. URIs
Peter Karlsson
peter at opera.com
Thu Mar 15 01:17:26 PDT 2007
L. David Baron on 2007-03-14:
> No, it's the *encoding* (conversion from characters to bytes) that should
> be done as UTF-8, not the *decoding* (conversion from bytes to
> characters).
Right. Sorry for the confusion. Perhaps it would be good to specify it, but
it needs to be specified in a way that is compatible with the current web,
although I am not sure if there is *one* specification that can cover all
the current cases.
Bjoern Hoehrmann on 2007-03-14:
> The traditional behavior of Internet Explorer on Western versions of
> Windows for western web sites using western encodings has been, since the
> release of IE5b2 I think, to encode the path using UTF-8 and the query
> string using the document encoding, depending on the send-urls- as-utf-8
> setting. For example,
>
> <a href='Björn.html'>...</a>
>
> in an ISO-8859-1 encoded document would result in a request for
>
> ... Bj%C3%B6rn.html ...
Yes, I remember the havoc that created on our university computer club's
server when that version was released, as we had used quite a lot of
Swedish characters in our URLs (which had worked fine from most Windows
and Unix browsers, until IE5). As you note, the default setting is different
in different locales, I believe the East Asian locale defaults differ.
> Opera, at least for a considerable amount of time, used UTF-8 for the
> whole reference, I think independently of encodings, domains, and other
> environment variables.
There is a setting in Opera to control the behaviour. Either we use UTF-8
for all URLs (the default), or we use the document encoding. The latter
setting is popular in some locales, especially Russia (IIRC).
L. David Baron on 2007-03-14:
> I believe Mozilla's behavior is the way it is in order to be compatible
> enough with IE's behavior to be usable when browsing East Asian Web sites.
I think that there is a spec somewhere (IRI spec?) that states that one
should try the URL both encoded in UTF-8 and, if that fails, in the locale
encoding. To my knowledge, there are no browsers that do that.
Also, one needs to consider that there are different parts of the URL that
needs to be encoded differently. Even if you UTF-8 encode the path
component, the query part needs to be given some special care and attention.
In a reference such as
<a href="http://www.räksmörgås.se/räksmörgås?q=räksmörgås">
there are three components that need to be encoded differently. I can't
remember how many test cases I have for such URLs, to make sure as many
possible combinations we can think of will work whenever a new variant that
isn't working is discovered.
Having a real specification of this would help quite a lot... :-)
--
\\//
Peter, software engineer, Opera Software
The opinions expressed are my own, and not those of my employer.
Please reply only by follow-ups on the mailing list.
More information about the whatwg
mailing list