<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=UTF-8">

  </head>

  <body bgcolor="#ffffff" text="#000000">

    On 7/24/2010 2:02 PM, Boris Zbarsky wrote:

    <blockquote id="mid_4C4A81FF_4060204_mit_edu"

      cite="mid:4C4A81FF.4060204@mit.edu" type="cite">On 7/24/10 1:50

      AM, Brett Zamir wrote:

      <br>

      <blockquote id="StationeryCiteGenerated_2" type="cite">

        <blockquote id="StationeryCiteGenerated_3" type="cite">I would

          be particularly interested in data on this last, across

          <br>

          different browsers, operating systems, and locales... There

          seem to be

          <br>

          servers out there expecting their URIs in UTF-8 and others

          expecting

          <br>

          them in ISO-8859-1, and it's not clear to me how to make

          things work

          <br>

          with them all.

          <br>

        </blockquote>

        <br>

        Seems to me that if they are not in UTF-8, they should be

        treated as

        <br>

        bugs, even if that is not a de jure standard.

        <br>

      </blockquote>

      <br>

      Treated as bugs by whom?

      <br>

      <br>

    </blockquote>

    By the servers/scripting languages. While it is great that the

    browsers are involved in the process, I think it would be reasonable

    to invite the other stake-holders to join the discussions.<br>

    <blockquote id="mid_4C4A81FF_4060204_mit_edu"

      cite="mid:4C4A81FF.4060204@mit.edu" type="cite">The scenario is

      that a user types some non-ASCII text in the url bar. This needs

      to be url-encoded to actually go on the wire, which raises the

      question of what encoding.  If the user is using IRIs, the answer

      is UTF-8.  A number of servers barf if you do this, especially

      because some server-side scripting languages (PHP, e.g., last I

      checked) default to URI-unescaping via something other than UTF-8.

      <br>

      <br>

    </blockquote>

    Hopefully to be fixed in PHP6 with its promise of full Unicode

    support... <br>

    <br>

    Though per <a class="moz-txt-link-freetext" href="http://www.slideshare.net/kfish/unicode-php6-presentation">http://www.slideshare.net/kfish/unicode-php6-presentation</a>

    :<br>

    <br>

    <b>Slide 34: </b>Conversions & Encoding

    “HTTP Input Encoding”

    <br>

    With Unicode semantics switch enabled, we need to convert HTTP input

    to Unicode <br>

    GET requests have no encoding at all and POST ones rarely come

    marked with the encoding<br>

    Encoding detection is not reliable<br>

    <b>Correctly decoding HTTP input is somewhat of an unsolved problem</b><br>

    <br>

    <b>Slide 35: </b>Conversions & Encoding

    “HTTP Input Encoding”

    <br>

    PHP will perform lazy decoding <br>

    Delays decoding data in $_GET, $_POST, and $_REQUEST until the ﬁrst

    time you access them <br>

    Allows user to set expected encoding or just rely on a default one <br>

    Allows decoding errors to be handled by the same mechanism <br>

    Applications should also use ﬁlter extension to ﬁlter incoming data<br>

    <br>

    <blockquote id="mid_4C4A81FF_4060204_mit_edu"

      cite="mid:4C4A81FF.4060204@mit.edu" type="cite">So some browser

      encode the non-query part of the URI as UTF-8 and the query part

      as ... something (user's default filesystem encoding, say, for

      lack of a better guess).  Others always use UTF-8 (and end up with

      some servers not usable).  Others... I have no idea.  That's why I

      want data.  ;)  In particular, while the "just use UTF-8, and if

      the user can't access the site sucks to be the user" approach has

      a certain theoretical-purity appeal, it doesn't seem like

      something I want to do to my friends and family (always a good

      criterion for things you'd like to do to users).

      <br>

      <br>

    </blockquote>

    What I meant is to try to get the server systems on board to fix the

    issue, including in the long-term. I appreciate you all being

    admirably practical champions of present-day compatibility, though

    I'd hope there is a vision to make things work better for the

    future, even if there will be some inevitable growing pains for a

    subset of users (as the lack of standardization no doubt creates

    pains for another subset as it is).<br>

    <br>

    Brett<br>

    <br>

  </body>

</html>