<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body bgcolor="#ffffff" text="#000000">
On 7/24/2010 2:02 PM, Boris Zbarsky wrote:
<blockquote id="mid_4C4A81FF_4060204_mit_edu"
cite="mid:4C4A81FF.4060204@mit.edu" type="cite">On 7/24/10 1:50
AM, Brett Zamir wrote:
<br>
<blockquote id="StationeryCiteGenerated_2" type="cite">
<blockquote id="StationeryCiteGenerated_3" type="cite">I would
be particularly interested in data on this last, across
<br>
different browsers, operating systems, and locales... There
seem to be
<br>
servers out there expecting their URIs in UTF-8 and others
expecting
<br>
them in ISO-8859-1, and it's not clear to me how to make
things work
<br>
with them all.
<br>
</blockquote>
<br>
Seems to me that if they are not in UTF-8, they should be
treated as
<br>
bugs, even if that is not a de jure standard.
<br>
</blockquote>
<br>
Treated as bugs by whom?
<br>
<br>
</blockquote>
By the servers/scripting languages. While it is great that the
browsers are involved in the process, I think it would be reasonable
to invite the other stake-holders to join the discussions.<br>
<blockquote id="mid_4C4A81FF_4060204_mit_edu"
cite="mid:4C4A81FF.4060204@mit.edu" type="cite">The scenario is
that a user types some non-ASCII text in the url bar. This needs
to be url-encoded to actually go on the wire, which raises the
question of what encoding. If the user is using IRIs, the answer
is UTF-8. A number of servers barf if you do this, especially
because some server-side scripting languages (PHP, e.g., last I
checked) default to URI-unescaping via something other than UTF-8.
<br>
<br>
</blockquote>
Hopefully to be fixed in PHP6 with its promise of full Unicode
support... <br>
<br>
Though per <a class="moz-txt-link-freetext" href="http://www.slideshare.net/kfish/unicode-php6-presentation">http://www.slideshare.net/kfish/unicode-php6-presentation</a>
:<br>
<br>
<b>Slide 34: </b>Conversions & Encoding
“HTTP Input Encoding”
<br>
With Unicode semantics switch enabled, we need to convert HTTP input
to Unicode <br>
GET requests have no encoding at all and POST ones rarely come
marked with the encoding<br>
Encoding detection is not reliable<br>
<b>Correctly decoding HTTP input is somewhat of an unsolved problem</b><br>
<br>
<b>Slide 35: </b>Conversions & Encoding
“HTTP Input Encoding”
<br>
PHP will perform lazy decoding <br>
Delays decoding data in $_GET, $_POST, and $_REQUEST until the first
time you access them <br>
Allows user to set expected encoding or just rely on a default one <br>
Allows decoding errors to be handled by the same mechanism <br>
Applications should also use filter extension to filter incoming data<br>
<br>
<blockquote id="mid_4C4A81FF_4060204_mit_edu"
cite="mid:4C4A81FF.4060204@mit.edu" type="cite">So some browser
encode the non-query part of the URI as UTF-8 and the query part
as ... something (user's default filesystem encoding, say, for
lack of a better guess). Others always use UTF-8 (and end up with
some servers not usable). Others... I have no idea. That's why I
want data. ;) In particular, while the "just use UTF-8, and if
the user can't access the site sucks to be the user" approach has
a certain theoretical-purity appeal, it doesn't seem like
something I want to do to my friends and family (always a good
criterion for things you'd like to do to users).
<br>
<br>
</blockquote>
What I meant is to try to get the server systems on board to fix the
issue, including in the long-term. I appreciate you all being
admirably practical champions of present-day compatibility, though
I'd hope there is a vision to make things work better for the
future, even if there will be some inevitable growing pains for a
subset of users (as the lack of standardization no doubt creates
pains for another subset as it is).<br>
<br>
Brett<br>
<br>
</body>
</html>