[html5] r6592 - [giow] (2) Try to make the application/x-www-form-urlencoded algorithm work even [...]

Tue Sep 27 12:07:36 PDT 2011

Author: ianh
Date: 2011-09-27 12:07:34 -0700 (Tue, 27 Sep 2011)
New Revision: 6592

Modified:
   complete.html
   index
   source
Log:
[giow] (2) Try to make the application/x-www-form-urlencoded algorithm work even for ISO-2022-JP's crazy escape schemes.
Fixing http://www.w3.org/Bugs/Public/show_bug.cgi?id=12199

Modified: complete.html
===================================================================

--- complete.html	2011-09-26 22:29:39 UTC (rev 6591)
+++ complete.html	2011-09-27 19:07:34 UTC (rev 6592)
@@ -239,7 +239,7 @@
 
   <header class=head id=head><p><a class=logo href=http://www.whatwg.org/><img alt=WHATWG height=101 src=/images/logo width=101></a></p>
    <hgroup><h1>Web Applications 1.0</h1>
-    <h2 class="no-num no-toc">Living Standard — Last Updated 26 September 2011</h2>
+    <h2 class="no-num no-toc">Living Standard — Last Updated 27 September 2011</h2>
    </hgroup><dl><dt>Multiple-page version:</dt>
     <dd><a href=http://www.whatwg.org/specs/web-apps/current-work/complete/>http://www.whatwg.org/specs/web-apps/current-work/complete/</a></dd>
     <dt>One-page version:</dt>
@@ -52591,6 +52591,15 @@
 
   <h5 id=url-encoded-form-data><span class=secno>4.10.22.5 </span>URL-encoded form data</h5>
 
+  <p class=note>This form data set encoding is in many ways an
+  aberrant monstrosity, the result of many years of implementation
+  accidents and compromises leading to a set of requirements necessary
+  for interoperability, but in no way representing good design
+  practices. In particular, readers are cautioned to pay close
+  attention to the twisted details involving repeated (and in some
+  cases nested) conversions between character encodings and byte
+  sequences.</p>
+
   <div class=impl>
 
   <p>The <dfn id=application/x-www-form-urlencoded-encoding-algorithm><code title="">application/x-www-form-urlencoded</code> encoding
@@ -52647,66 +52656,66 @@
 
      <li>
 
-      <p>For each character in the entry's name and value, apply the
+      <p>Encode the entry's name and value using the selected
+      character encoding. The entry's name and value are now byte
+      strings.</p>
+
+     </li>
+
+     <li>
+
+      <p>For each byte in the entry's name and value, apply the
       appropriate subsubsteps from the following list:</p>
 
-      <dl class=switch><dt>The character is a U+0020 SPACE character</dt>
+      <dl class=switch><dt>The byte is 0x20 (U+0020 SPACE if interpreted as ASCII)</dt>
 
-       <dd>Replace the character with a single U+002B PLUS SIGN
-       character (+).</dd>
+       <dd>Replace the byte with a single 0x2B byte (U+002B PLUS SIGN
+       character (+) if interpreted as ASCII).</dd>
 
 
        <!-- * - . 0-9 a-z _ A-Z -->
 
-       <dt>If the character is in the range U+002A, U+002D, U+002E,
-       U+0030 to U+0039, U+0041 to U+005A, U+005F, U+0061 to
-       U+007A</dt>
+       <dt>If the byte is in the range 0x2A, 0x2D, 0x2E, 0x30 to 0x39,
+       0x41 to 0x5A, 0x5F, 0x61 to 0x7A</dt>
 
-       <dd><p>Leave the character as is.</dd>
+       <dd><p>Leave the byte as is.</dd>
 
 
        <dt>Otherwise</dt>
 
        <dd>
 
-        <p>Replace the character with a string formed as follows:</p>
+        <ol><li><p>Let <var title="">s</var> be a string consisting of a
+         U+0025 PERCENT SIGN character (%) followed by two characters
+         in the ranges U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9)
+         and U+0041 LATIN CAPITAL LETTER A to U+0046 LATIN CAPITAL
+         LETTER F representing the hexadecimal value of the byte in
+         question (zero-padded if necessary).</li>
 
-        <ol><li><p>Let <var title="">s</var> be an empty string.</li>
+         <li><p>Encode the string <var title="">s</var> as US-ASCII,
+         so that it is now a byte string.</p>
 
-         <li>
+         <li><p>Replace the byte in question in the name or value
+         being processed by the bytes in <var title="">s</var>,
+         preserving their relative order.</li>
 
-          <p>For each byte <var title="">b</var> of the character when
-          expressed in the selected character encoding in turn, run
-          the appropriate subsubsubstep from the list below:</p>
+        </ol></dd>
 
-          <dl class=switch><dt>If the byte is in the range 0x20, 0x2A, 0x2D, 0x2E,
-           0x30 to 0x39, 0x41 to 0x5A, 0x5F, 0x61 to 0x7A</dt>
+      </dl></li>
 
-           <dd><p>Append to <var title="">s</var> the Unicode
-           character with the code point equal to the byte.</dd>
+     <li>
 
-           <dt>Otherwise</dt>
+      <p>Interpret the entry's name and value as Unicode strings
+      encoded in US-ASCII. (All of the bytes in the string will be in
+      the range 0x00 to 0x7F; the high bit will be zero throughout.)
+      The entry's name and value are now Unicode strings again.</p>
 
-           <dd><p>Append to the string a U+0025 PERCENT SIGN character
-           (%) followed by two characters in the ranges U+0030 DIGIT
-           ZERO (0) to U+0039 DIGIT NINE (9) and U+0041 LATIN CAPITAL
-           LETTER A to U+0046 LATIN CAPITAL LETTER F representing the
-           hexadecimal value of the byte (zero-padded if
-           necessary).</dd>
+     </li>
 
-          </dl></li>
+     <li><p>If the entry's name is "<code title=attr-fe-name-isindex><a href=#attr-fe-name-isindex>isindex</a></code>", its type is "<code title="">text</code>", and this is the first entry in the <var title="">form data set</var>, then append the value to <var title="">result</var> and skip the rest of the substeps for this
+     entry, moving on to the next entry, if any, or the next step in
+     the overall algorithm otherwise.</li>
 
-        </ol></dd>
-
-      </dl></li>
-
-     <li><p>If the entry's name is "<code title=attr-fe-name-isindex><a href=#attr-fe-name-isindex>isindex</a></code>",
-     its type is "<code title="">text</code>", and this is the first
-     entry in the <var title="">form data set</var>, then append the
-     value to <var title="">result</var> and skip the rest of the
-     substeps for this entry, moving on to the next entry, if any, or
-     the next step in the overall algorithm otherwise.</li>
-
      <li><p>If this is not the first entry, append a single U+0026
      AMPERSAND character (&) to <var title="">result</var>.</li>
 
@@ -52799,8 +52808,8 @@
      </li>
 
      <li><p>Convert the <var title="">name</var> and <var title="">value</var> strings to their byte representation in
-     US-ASCII (i.e. convert the Unicode string to a byte
-     string).</li>
+     ISO-8859-1 (i.e. convert the Unicode string to a byte string,
+     mapping code points to byte values directly).</li>
 
      <li><p>Add a pair consisting of <var title="">name</var> and <var title="">value</var> to <var title="">pairs</var>.</li>
 
@@ -52808,9 +52817,8 @@
 
    <li><p>If any of the name-value pairs in <var title="">pairs</var>
    have a name component consisting of the string "<code title="">_charset_</code>" encoded in US-ASCII, and the value
-   component of the first such pair is the name of a supported
-   character encoding, then let <var title="">encoding</var> be that
-   character encoding.</li>
+   component of the first such pair, when decoded as US-ASCII, is the
+   name of a supported character encoding, then let <var title="">encoding</var> be that character encoding.</li>
 
    <li><p>Convert the name and value components of each name-value
    pair in <var title="">pairs</var> to Unicode by interpreting the

Modified: index
===================================================================
--- index	2011-09-26 22:29:39 UTC (rev 6591)
+++ index	2011-09-27 19:07:34 UTC (rev 6592)
@@ -243,7 +243,7 @@
 
   <header class=head id=head><p><a class=logo href=http://www.whatwg.org/><img alt=WHATWG height=101 src=/images/logo width=101></a></p>
    <hgroup><h1 class=allcaps>HTML</h1>
-    <h2 class="no-num no-toc">Living Standard — Last Updated 26 September 2011</h2>
+    <h2 class="no-num no-toc">Living Standard — Last Updated 27 September 2011</h2>
    </hgroup><dl><dt><strong>Web developer edition</strong></dt>
     <dd><strong><a href=http://developers.whatwg.org/>http://developers.whatwg.org/</a></strong></dd>
     <dt>Multiple-page version:</dt>
@@ -52458,6 +52458,15 @@
 
   <h5 id=url-encoded-form-data><span class=secno>4.10.22.5 </span>URL-encoded form data</h5>
 
+  <p class=note>This form data set encoding is in many ways an
+  aberrant monstrosity, the result of many years of implementation
+  accidents and compromises leading to a set of requirements necessary
+  for interoperability, but in no way representing good design
+  practices. In particular, readers are cautioned to pay close
+  attention to the twisted details involving repeated (and in some
+  cases nested) conversions between character encodings and byte
+  sequences.</p>
+
   <div class=impl>
 
   <p>The <dfn id=application/x-www-form-urlencoded-encoding-algorithm><code title="">application/x-www-form-urlencoded</code> encoding
@@ -52514,66 +52523,66 @@
 
      <li>
 
-      <p>For each character in the entry's name and value, apply the
+      <p>Encode the entry's name and value using the selected
+      character encoding. The entry's name and value are now byte
+      strings.</p>
+
+     </li>
+
+     <li>
+
+      <p>For each byte in the entry's name and value, apply the
       appropriate subsubsteps from the following list:</p>
 
-      <dl class=switch><dt>The character is a U+0020 SPACE character</dt>
+      <dl class=switch><dt>The byte is 0x20 (U+0020 SPACE if interpreted as ASCII)</dt>
 
-       <dd>Replace the character with a single U+002B PLUS SIGN
-       character (+).</dd>
+       <dd>Replace the byte with a single 0x2B byte (U+002B PLUS SIGN
+       character (+) if interpreted as ASCII).</dd>
 
 
        <!-- * - . 0-9 a-z _ A-Z -->
 
-       <dt>If the character is in the range U+002A, U+002D, U+002E,
-       U+0030 to U+0039, U+0041 to U+005A, U+005F, U+0061 to
-       U+007A</dt>
+       <dt>If the byte is in the range 0x2A, 0x2D, 0x2E, 0x30 to 0x39,
+       0x41 to 0x5A, 0x5F, 0x61 to 0x7A</dt>
 
-       <dd><p>Leave the character as is.</dd>
+       <dd><p>Leave the byte as is.</dd>
 
 
        <dt>Otherwise</dt>
 
        <dd>
 
-        <p>Replace the character with a string formed as follows:</p>
+        <ol><li><p>Let <var title="">s</var> be a string consisting of a
+         U+0025 PERCENT SIGN character (%) followed by two characters
+         in the ranges U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9)
+         and U+0041 LATIN CAPITAL LETTER A to U+0046 LATIN CAPITAL
+         LETTER F representing the hexadecimal value of the byte in
+         question (zero-padded if necessary).</li>
 
-        <ol><li><p>Let <var title="">s</var> be an empty string.</li>
+         <li><p>Encode the string <var title="">s</var> as US-ASCII,
+         so that it is now a byte string.</p>
 
-         <li>
+         <li><p>Replace the byte in question in the name or value
+         being processed by the bytes in <var title="">s</var>,
+         preserving their relative order.</li>
 
-          <p>For each byte <var title="">b</var> of the character when
-          expressed in the selected character encoding in turn, run
-          the appropriate subsubsubstep from the list below:</p>
+        </ol></dd>
 
-          <dl class=switch><dt>If the byte is in the range 0x20, 0x2A, 0x2D, 0x2E,
-           0x30 to 0x39, 0x41 to 0x5A, 0x5F, 0x61 to 0x7A</dt>
+      </dl></li>
 
-           <dd><p>Append to <var title="">s</var> the Unicode
-           character with the code point equal to the byte.</dd>
+     <li>
 
-           <dt>Otherwise</dt>
+      <p>Interpret the entry's name and value as Unicode strings
+      encoded in US-ASCII. (All of the bytes in the string will be in
+      the range 0x00 to 0x7F; the high bit will be zero throughout.)
+      The entry's name and value are now Unicode strings again.</p>
 
-           <dd><p>Append to the string a U+0025 PERCENT SIGN character
-           (%) followed by two characters in the ranges U+0030 DIGIT
-           ZERO (0) to U+0039 DIGIT NINE (9) and U+0041 LATIN CAPITAL
-           LETTER A to U+0046 LATIN CAPITAL LETTER F representing the
-           hexadecimal value of the byte (zero-padded if
-           necessary).</dd>
+     </li>
 
-          </dl></li>
+     <li><p>If the entry's name is "<code title=attr-fe-name-isindex><a href=#attr-fe-name-isindex>isindex</a></code>", its type is "<code title="">text</code>", and this is the first entry in the <var title="">form data set</var>, then append the value to <var title="">result</var> and skip the rest of the substeps for this
+     entry, moving on to the next entry, if any, or the next step in
+     the overall algorithm otherwise.</li>
 
-        </ol></dd>
-
-      </dl></li>
-
-     <li><p>If the entry's name is "<code title=attr-fe-name-isindex><a href=#attr-fe-name-isindex>isindex</a></code>",
-     its type is "<code title="">text</code>", and this is the first
-     entry in the <var title="">form data set</var>, then append the
-     value to <var title="">result</var> and skip the rest of the
-     substeps for this entry, moving on to the next entry, if any, or
-     the next step in the overall algorithm otherwise.</li>
-
      <li><p>If this is not the first entry, append a single U+0026
      AMPERSAND character (&) to <var title="">result</var>.</li>
 
@@ -52666,8 +52675,8 @@
      </li>
 
      <li><p>Convert the <var title="">name</var> and <var title="">value</var> strings to their byte representation in
-     US-ASCII (i.e. convert the Unicode string to a byte
-     string).</li>
+     ISO-8859-1 (i.e. convert the Unicode string to a byte string,
+     mapping code points to byte values directly).</li>
 
      <li><p>Add a pair consisting of <var title="">name</var> and <var title="">value</var> to <var title="">pairs</var>.</li>
 
@@ -52675,9 +52684,8 @@
 
    <li><p>If any of the name-value pairs in <var title="">pairs</var>
    have a name component consisting of the string "<code title="">_charset_</code>" encoded in US-ASCII, and the value
-   component of the first such pair is the name of a supported
-   character encoding, then let <var title="">encoding</var> be that
-   character encoding.</li>
+   component of the first such pair, when decoded as US-ASCII, is the
+   name of a supported character encoding, then let <var title="">encoding</var> be that character encoding.</li>
 
    <li><p>Convert the name and value components of each name-value
    pair in <var title="">pairs</var> to Unicode by interpreting the

Modified: source
===================================================================
--- source	2011-09-26 22:29:39 UTC (rev 6591)
+++ source	2011-09-27 19:07:34 UTC (rev 6592)
@@ -59183,6 +59183,15 @@
 
   <h5>URL-encoded form data</h5>
 
+  <p class="note">This form data set encoding is in many ways an
+  aberrant monstrosity, the result of many years of implementation
+  accidents and compromises leading to a set of requirements necessary
+  for interoperability, but in no way representing good design
+  practices. In particular, readers are cautioned to pay close
+  attention to the twisted details involving repeated (and in some
+  cases nested) conversions between character encodings and byte
+  sequences.</p>
+
   <div class="impl">
 
   <p>The <dfn><code
@@ -59249,63 +59258,53 @@
 
      <li>
 
-      <p>For each character in the entry's name and value, apply the
+      <p>Encode the entry's name and value using the selected
+      character encoding. The entry's name and value are now byte
+      strings.</p>
+
+     </li>
+
+     <li>
+
+      <p>For each byte in the entry's name and value, apply the
       appropriate subsubsteps from the following list:</p>
 
       <dl class="switch">
 
-       <dt>The character is a U+0020 SPACE character</dt>
+       <dt>The byte is 0x20 (U+0020 SPACE if interpreted as ASCII)</dt>
 
-       <dd>Replace the character with a single U+002B PLUS SIGN
-       character (+).</dd>
+       <dd>Replace the byte with a single 0x2B byte (U+002B PLUS SIGN
+       character (+) if interpreted as ASCII).</dd>
 
 
        <!-- * - . 0-9 a-z _ A-Z -->
 
-       <dt>If the character is in the range U+002A, U+002D, U+002E,
-       U+0030 to U+0039, U+0041 to U+005A, U+005F, U+0061 to
-       U+007A</dt>
+       <dt>If the byte is in the range 0x2A, 0x2D, 0x2E, 0x30 to 0x39,
+       0x41 to 0x5A, 0x5F, 0x61 to 0x7A</dt>
 
-       <dd><p>Leave the character as is.</p></dd>
+       <dd><p>Leave the byte as is.</p></dd>
 
 
        <dt>Otherwise</dt>
 
        <dd>
 
-        <p>Replace the character with a string formed as follows:</p>
-
         <ol>
 
-         <li><p>Let <var title="">s</var> be an empty string.</p></li>
+         <li><p>Let <var title="">s</var> be a string consisting of a
+         U+0025 PERCENT SIGN character (%) followed by two characters
+         in the ranges U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9)
+         and U+0041 LATIN CAPITAL LETTER A to U+0046 LATIN CAPITAL
+         LETTER F representing the hexadecimal value of the byte in
+         question (zero-padded if necessary).</p></li>
 
-         <li>
+         <li><p>Encode the string <var title="">s</var> as US-ASCII,
+         so that it is now a byte string.</p>
 
-          <p>For each byte <var title="">b</var> of the character when
-          expressed in the selected character encoding in turn, run
-          the appropriate subsubsubstep from the list below:</p>
+         <li><p>Replace the byte in question in the name or value
+         being processed by the bytes in <var title="">s</var>,
+         preserving their relative order.</p></li>
 
-          <dl class="switch">
-
-           <dt>If the byte is in the range 0x20, 0x2A, 0x2D, 0x2E,
-           0x30 to 0x39, 0x41 to 0x5A, 0x5F, 0x61 to 0x7A</dt>
-
-           <dd><p>Append to <var title="">s</var> the Unicode
-           character with the code point equal to the byte.</p></dd>
-
-           <dt>Otherwise</dt>
-
-           <dd><p>Append to the string a U+0025 PERCENT SIGN character
-           (%) followed by two characters in the ranges U+0030 DIGIT
-           ZERO (0) to U+0039 DIGIT NINE (9) and U+0041 LATIN CAPITAL
-           LETTER A to U+0046 LATIN CAPITAL LETTER F representing the
-           hexadecimal value of the byte (zero-padded if
-           necessary).</p></dd>
-
-          </dl>
-
-         </li>
-
         </ol>
 
        </dd>
@@ -59314,13 +59313,23 @@
 
      </li>
 
-     <li><p>If the entry's name is "<code title="attr-fe-name-isindex">isindex</code>",
-     its type is "<code title="">text</code>", and this is the first
-     entry in the <var title="">form data set</var>, then append the
-     value to <var title="">result</var> and skip the rest of the
-     substeps for this entry, moving on to the next entry, if any, or
-     the next step in the overall algorithm otherwise.</p></li>
+     <li>
 
+      <p>Interpret the entry's name and value as Unicode strings
+      encoded in US-ASCII. (All of the bytes in the string will be in
+      the range 0x00 to 0x7F; the high bit will be zero throughout.)
+      The entry's name and value are now Unicode strings again.</p>
+
+     </li>
+
+     <li><p>If the entry's name is "<code
+     title="attr-fe-name-isindex">isindex</code>", its type is "<code
+     title="">text</code>", and this is the first entry in the <var
+     title="">form data set</var>, then append the value to <var
+     title="">result</var> and skip the rest of the substeps for this
+     entry, moving on to the next entry, if any, or the next step in
+     the overall algorithm otherwise.</p></li>
+
      <li><p>If this is not the first entry, append a single U+0026
      AMPERSAND character (&) to <var
      title="">result</var>.</p></li>
@@ -59438,8 +59447,8 @@
 
      <li><p>Convert the <var title="">name</var> and <var
      title="">value</var> strings to their byte representation in
-     US-ASCII (i.e. convert the Unicode string to a byte
-     string).</p></li>
+     ISO-8859-1 (i.e. convert the Unicode string to a byte string,
+     mapping code points to byte values directly).</p></li>
 
      <li><p>Add a pair consisting of <var title="">name</var> and <var
      title="">value</var> to <var title="">pairs</var>.</p></li>
@@ -59451,9 +59460,9 @@
    <li><p>If any of the name-value pairs in <var title="">pairs</var>
    have a name component consisting of the string "<code
    title="">_charset_</code>" encoded in US-ASCII, and the value
-   component of the first such pair is the name of a supported
-   character encoding, then let <var title="">encoding</var> be that
-   character encoding.</p></li>
+   component of the first such pair, when decoded as US-ASCII, is the
+   name of a supported character encoding, then let <var
+   title="">encoding</var> be that character encoding.</p></li>
 
    <li><p>Convert the name and value components of each name-value
    pair in <var title="">pairs</var> to Unicode by interpreting the