[html5] r1927 - [] (0) Make content-sniffing 'better': make the text/binary case actually work o [...]

Wed Jul 23 20:06:26 PDT 2008

Author: ianh
Date: 2008-07-23 20:06:25 -0700 (Wed, 23 Jul 2008)
New Revision: 1927

Modified:
   index
   source
Log:
[] (0) Make content-sniffing 'better': make the text/binary case actually work out what the binary data might be; make the unknown type case determine the text/plain cases as a first-class citizen instead of falling back on the text/binary algorithm; fix minor grammatical things.

Modified: index
===================================================================

--- index	2008-07-24 02:28:52 UTC (rev 1926)
+++ index	2008-07-24 03:06:25 UTC (rev 1927)
@@ -6104,8 +6104,8 @@
      of bytes already available.
 
    <li>
-    <p>If <var title="">n</var> is 4 or more, and the first bytes of the file
-     match one of the following byte sets:</p>
+    <p>If <var title="">n</var> is 4 or more, and the first bytes of the
+     resource match one of the following byte sets:</p>
 
     <table>
      <thead>
@@ -6151,37 +6151,50 @@
         
     </table>
 
-    <p>...then the sniffed type of the resource is "text/plain".</p>
+    <p>...then the sniffed type of the resource is "text/plain". Abort these
+     steps.</p>
 
    <li>
-    <p>Otherwise, if any of the first <var title="">n</var> bytes of the
-     resource are in one of the following byte ranges:</p>
-    <!-- This byte list is based on RFC 2046 Section 4.1.2. Characters
-    in the range 0x00-0x1F, with the exception of 0x09, 0x0A, 0x0C,
-    0x0D (ASCII for TAB, LF, FF, and CR), and character 0x1B
-    (reportedly used by some encodings as a shift escape), are
-    invalid. Thus, if we see them, we assume it's not text. -->
-    
-    <ul class=brief>
-     <li> 0x00 - 0x08
+    <p>If none of the first <var title="">n</var> bytes of the resource are
+     <a href="#binary">binary data bytes</a> then the sniffed type of the
+     resource is "text/plain". Abort these steps.
 
-     <li> 0x0B
+   <li>
+    <p>If the first bytes of the resource match one of the byte sequences in
+     the "pattern" column of the table in the <i title="content-type
+     sniffing: unknown type"><a href="#content-type7">unknown type</a></i>
+     section below, ignoring any rows whose cell in the "security" column
+     says "scriptable" (or "n/a"), then the sniffed type of the resource is
+     the type given in the corresponding cell in the "sniffed type" column on
+     that row; abort these steps.</p>
 
-     <li> 0x0E - 0x1A
+    <p class=warning>It is critical that this step not ever return a
+     scriptable type (e.g. text/html), as otherwise that would allow a
+     privilege escalation attack.</p>
 
-     <li> 0x1C - 0x1F
-    </ul>
+   <li>
+    <p>Otherwise, the sniffed type of the resource is
+     "application/octet-stream".
+  </ol>
 
-    <p>...then the sniffed type of the resource is
-     "application/octet-stream".</p>
+  <p>Bytes covered by the following ranges are <dfn id=binary>binary data
+   bytes</dfn>:</p>
+  <!-- This byte list is based on RFC 2046 Section 4.1.2. Characters
+  in the range 0x00-0x1F, with the exception of 0x09, 0x0A, 0x0C, 0x0D
+  (ASCII for TAB, LF, FF, and CR), and character 0x1B (reportedly used
+  by some encodings as a shift escape), are invalid. Thus, if we see
+  them, we assume it's not text. -->
 
-    <p class=big-issue>maybe we should invoke the "Content-Type sniffing:
-     image" section now, falling back on "application/octet-stream".</p>
+  <ul class=brief>
+   <li> 0x00 - 0x08
 
-   <li>
-    <p>Otherwise, the sniffed type of the resource is "text/plain".
-  </ol>
+   <li> 0x0B
 
+   <li> 0x0E - 0x1A
+
+   <li> 0x1C - 0x1F
+  </ul>
+
   <h4 id=content-type2><span class=secno>2.7.4 </span><dfn
    id=content-type7>Content-Type sniffing: unknown type</dfn></h4>
 
@@ -6288,11 +6301,17 @@
     </dl>
 
    <li>
-    <p>As a last-ditch effort, jump to the <a href="#content-type6"
-     title="content-type sniffing: text or binary">text or binary</a>
-     section.
+    <p>If none of the first <var title="">n</var> bytes of the resource are
+     <a href="#binary">binary data bytes</a> then the sniffed type of the
+     resource is "text/plain". Abort these steps.
+
+   <li>
+    <p>Otherwise, the sniffed type of the resource is
+     "application/octet-stream".
   </ol>
 
+  <p>The table used by the above algorithm is:
+
   <table>
    <thead>
     <tr>
@@ -6300,6 +6319,8 @@
 
      <th rowspan=2>Sniffed type
 
+     <th rowspan=2>Security
+
      <th rowspan=2>Comment
 
     <tr>
@@ -6316,6 +6337,8 @@
 
      <td>text/html
 
+     <td>Scriptable
+
      <td>The string "<code title=""><!DOCTYPE HTML</code>" in US-ASCII or
       compatible encodings, case-insensitively.
 
@@ -6327,6 +6350,8 @@
 
      <td>text/html
 
+     <td>Scriptable
+
      <td>The string "<code title=""><HTML</code>" in US-ASCII or
       compatible encodings, case-insensitively, possibly with leading spaces.
       
@@ -6339,6 +6364,8 @@
 
      <td>text/html
 
+     <td>Scriptable
+
      <td>The string "<code title=""><HEAD</code>" in US-ASCII or
       compatible encodings, case-insensitively, possibly with leading spaces.
       
@@ -6351,6 +6378,8 @@
 
      <td>text/html
 
+     <td>Scriptable
+
      <td>The string "<code title=""><SCRIPT</code>" in US-ASCII or
       compatible encodings, case-insensitively, possibly with leading spaces.
       
@@ -6364,6 +6393,8 @@
 
      <td>application/pdf
 
+     <td>Scriptable
+
      <td>The string "<code title="">%PDF-</code>", the PDF signature.
 
     <tr>
@@ -6375,17 +6406,56 @@
 
      <td>application/postscript
 
+     <td>Safe
+
      <td>The string "<code title="">%!PS-Adobe-</code>", the PostScript
-      signature. <!-- copied from the section below -->
+      signature. <!-- copied from the text or binary section above -->
 
    <tbody>
     <tr>
+     <td>FF FF 00 00
+
+     <td>FE FF 00 00
+
+     <td>text/plain
+
+     <td>n/a
+
+     <td>UTF-16BE BOM <!-- followed by at least one character -->
+
+    <tr>
+     <td>FF FF 00 00
+
+     <td>FF FF 00 00
+
+     <td>text/plain
+
+     <td>n/a
+
+     <td>UTF-16LE BOM <!-- followed by at least one character -->
+
+    <tr>
+     <td>FF FF FF 00
+
+     <td>EF BB BF 00
+
+     <td>text/plain
+
+     <td>n/a
+
+     <td>UTF-8 BOM <!-- followed by at least one character -->
+      <!-- based on the table in the image section below -->
+
+   <tbody>
+    <tr>
      <td>FF FF FF FF FF FF
 
      <td>47 49 46 38 37 61 <!-- GIF87a -->
 
      <td>image/gif
 
+     <td>Safe
+
      <td>The string "<code title="">GIF87a</code>", a GIF signature.
 
     <tr>
@@ -6395,6 +6465,8 @@
 
      <td>image/gif
 
+     <td>Safe
+
      <td>The string "<code title="">GIF89a</code>", a GIF signature.
 
     <tr>
@@ -6405,6 +6477,8 @@
 
      <td>image/png
 
+     <td>Safe
+
      <td>The PNG signature.
 
     <tr>
@@ -6415,6 +6489,8 @@
 
      <td>image/jpeg
 
+     <td>Safe
+
      <td>A JPEG SOI marker followed by the first byte of another marker.
 
     <tr>
@@ -6424,6 +6500,8 @@
 
      <td>image/bmp
 
+     <td>Safe
+
      <td>The string "<code title="">BM</code>", a BMP signature.
 
     <tr>
@@ -6433,10 +6511,15 @@
 
      <td>image/vnd.microsoft.icon
 
+     <td>Safe
+
      <td>A 0 word following by a 1 word, a Windows Icon file format
       signature.
   </table>
 
+  <p class=big-issue>I'd like to add types like MPEG, AVI, Flash, Java, etc,
+   to the above table.
+
   <p>User agents may support further types if desired, by implicitly adding
    to the above table. However, user agents should not use any other patterns
    for types already mentioned in the table above, as this could then be used
@@ -6444,11 +6527,15 @@
    determine that content is not HTML and thus safe from XSS attacks, but
    then a user agent detects it as HTML anyway and allows script to execute).
 
+  <p>The column marked "security" is used by the algorithm in the "text or
+   binary" section, to avoid sniffing <code title="">text/plain</code>
+   content as a type that can be used for a privilege escalation attack.
+
   <h4 id=content-type3><span class=secno>2.7.5 </span><dfn
    id=content-type8>Content-Type sniffing: image</dfn></h4>
 
-  <p>If the first bytes of the file match one of the byte sequences in the
-   first columns of the following table, then the sniffed type of the
+  <p>If the first bytes of the resource match one of the byte sequences in
+   the first column of the following table, then the sniffed type of the
    resource is the type given in the corresponding cell in the second column
    on the same row:
 

Modified: source
===================================================================
--- source	2008-07-24 02:28:52 UTC (rev 1926)
+++ source	2008-07-24 03:06:25 UTC (rev 1927)
@@ -4234,7 +4234,7 @@
    <li>
 
     <p>If <var title="">n</var> is 4 or more, and the first bytes of
-    the file match one of the following byte sets:</p>
+    the resource match one of the following byte sets:</p>
 
     <table>
      <thead>
@@ -4268,41 +4268,54 @@
 -->
     </table>
 
-    <p>...then the sniffed type of the resource is "text/plain".</p>
+    <p>...then the sniffed type of the resource is "text/plain". Abort
+    these steps.</p>
 
    </li>
 
-   <li><p>Otherwise, if any of the first <var title="">n</var> bytes
-   of the resource are in one of the following byte ranges:</p>
+   <li><p>If none of the first <var title="">n</var> bytes of the
+   resource are <span>binary data bytes</span> then the sniffed type
+   of the resource is "text/plain". Abort these steps.</p></li>
 
-    <!-- This byte list is based on RFC 2046 Section 4.1.2. Characters
-    in the range 0x00-0x1F, with the exception of 0x09, 0x0A, 0x0C,
-    0x0D (ASCII for TAB, LF, FF, and CR), and character 0x1B
-    (reportedly used by some encodings as a shift escape), are
-    invalid. Thus, if we see them, we assume it's not text. -->
+   <li>
 
-    <ul class="brief">
-     <li> 0x00 - 0x08 </li>
-     <li> 0x0B </li>
-     <li> 0x0E - 0x1A </li>
-     <li> 0x1C - 0x1F </li>
-    </ul>
+    <p>If the first bytes of the resource match one of the byte
+    sequences in the "pattern" column of the table in the <i
+    title="content-type sniffing: unknown type">unknown type</i>
+    section below, ignoring any rows whose cell in the "security"
+    column says "scriptable" (or "n/a"), then the sniffed type of the
+    resource is the type given in the corresponding cell in the
+    "sniffed type" column on that row; abort these steps.</p>
 
-   <p>...then the sniffed type of the resource is
-   "application/octet-stream".</p>
+    <p class="warning">It is critical that this step not ever return a
+    scriptable type (e.g. text/html), as otherwise that would allow a
+    privilege escalation attack.</p>
 
-   <p class="big-issue">maybe we should invoke the "Content-Type
-   sniffing: image" section now, falling back on
-   "application/octet-stream".</p>
-
    </li>
 
    <li><p>Otherwise, the sniffed type of the resource is
-   "text/plain".</p></li>
+   "application/octet-stream".</p></li>
 
   </ol>
 
+  <p>Bytes covered by the following ranges are <dfn>binary data
+  bytes</dfn>:</p>
 
+  <!-- This byte list is based on RFC 2046 Section 4.1.2. Characters
+  in the range 0x00-0x1F, with the exception of 0x09, 0x0A, 0x0C, 0x0D
+  (ASCII for TAB, LF, FF, and CR), and character 0x1B (reportedly used
+  by some encodings as a shift escape), are invalid. Thus, if we see
+  them, we assume it's not text. -->
+
+  <ul class="brief">
+   <li> 0x00 - 0x08 </li>
+   <li> 0x0B </li>
+   <li> 0x0E - 0x1A </li>
+   <li> 0x1C - 0x1F </li>
+  </ul>
+
+
+
   <h4><dfn>Content-Type sniffing: unknown type</dfn></h4>
 
   <ol>
@@ -4433,17 +4446,23 @@
 
    </li>
 
-   <li><p>As a last-ditch effort, jump to the <span
-   title="content-type sniffing: text or binary">text or binary</span>
-   section.</p></li>
+   <li><p>If none of the first <var title="">n</var> bytes of the
+   resource are <span>binary data bytes</span> then the sniffed type
+   of the resource is "text/plain". Abort these steps.</p></li>
 
+   <li><p>Otherwise, the sniffed type of the resource is
+   "application/octet-stream".</p></li>
+
   </ol>
 
+  <p>The table used by the above algorithm is:</p>
+
   <table>
    <thead>
     <tr>
      <th colspan="2">Bytes in Hexadecimal
      <th rowspan="2">Sniffed type
+     <th rowspan="2">Security
      <th rowspan="2">Comment
     <tr>
      <th>Mask
@@ -4453,67 +4472,104 @@
      <td>FF FF DF DF DF DF DF DF DF FF DF DF DF DF
      <td>3C 21 44 4F 43 54 59 50 45 20 48 54 4D 4C <!-- "<!DOCTYPE HTML" --> <!-- common in static data -->
      <td>text/html
+     <td>Scriptable
      <td>The string "<code title=""><!DOCTYPE HTML</code>" in US-ASCII or compatible encodings, case-insensitively.
     <tr>
      <td>FF FF DF DF DF DF
      <td><em>WS</em> 3C 48 54 4D 4C <!-- "<HTML" --> <!-- common in static data -->
      <td>text/html
+     <td>Scriptable
      <td>The string "<code title=""><HTML</code>" in US-ASCII or compatible encodings, case-insensitively, possibly with leading spaces.
     <tr>
      <td>FF FF DF DF DF DF
      <td><em>WS</em> 3C 48 45 41 44 <!-- "<HEAD" --> <!-- common in static data -->
      <td>text/html
+     <td>Scriptable
      <td>The string "<code title=""><HEAD</code>" in US-ASCII or compatible encodings, case-insensitively, possibly with leading spaces.
     <tr>
      <td>FF FF DF DF DF DF DF DF
      <td><em>WS</em> 3C 53 43 52 49 50 54 <!-- "<SCRIPT" --> <!-- common in dynamic data -->
      <td>text/html
+     <td>Scriptable
      <td>The string "<code title=""><SCRIPT</code>" in US-ASCII or compatible encodings, case-insensitively, possibly with leading spaces.
     <tr>
      <td>FF FF FF FF FF
      <td>25 50 44 46 2D <!-- "%PDF-" (from http://lxr.mozilla.org/seamonkey/source/netwerk/streamconv/converters/nsUnknownDecoder.cpp#321) -->
      <td>application/pdf
+     <td>Scriptable
      <td>The string "<code title="">%PDF-</code>", the PDF signature.
     <tr>
      <td>FF FF FF FF FF FF FF FF FF FF FF
      <td>25 21 50 53 2D 41 64 6F 62 65 2D <!-- "%!PS-Adobe-" (from http://lxr.mozilla.org/seamonkey/source/netwerk/streamconv/converters/nsUnknownDecoder.cpp#321) -->
      <td>application/postscript
+     <td>Safe
      <td>The string "<code title="">%!PS-Adobe-</code>", the PostScript signature.
 
-   <!-- copied from the section below -->
+   <!-- copied from the text or binary section above -->
    <tbody>
     <tr>
+     <td>FF FF 00 00
+     <td>FE FF 00 00
+     <td>text/plain
+     <td>n/a
+     <td>UTF-16BE BOM <!-- followed by at least one character -->
+    <tr>
+     <td>FF FF 00 00
+     <td>FF FF 00 00
+     <td>text/plain
+     <td>n/a
+     <td>UTF-16LE BOM <!-- followed by at least one character -->
+    <tr>
+     <td>FF FF FF 00
+     <td>EF BB BF 00
+     <td>text/plain
+     <td>n/a
+     <td>UTF-8 BOM <!-- followed by at least one character -->
+
+   <!-- based on the table in the image section below -->
+   <tbody>
+    <tr>
      <td>FF FF FF FF FF FF
      <td>47 49 46 38 37 61 <!-- GIF87a -->
      <td>image/gif
+     <td>Safe
      <td>The string "<code title="">GIF87a</code>", a GIF signature.
     <tr>
      <td>FF FF FF FF FF FF
      <td>47 49 46 38 39 61 <!-- GIF89a -->
      <td>image/gif
+     <td>Safe
      <td>The string "<code title="">GIF89a</code>", a GIF signature.
     <tr>
      <td>FF FF FF FF FF FF FF FF
      <td>89 50 4E 47 0D 0A 1A 0A <!-- [TAB]PNG[CR][LF][EOF][LF]; 137 80 78 71 13 10 26 10 -->
      <td>image/png
+     <td>Safe
      <td>The PNG signature.
     <tr>
      <td>FF FF FF
      <td>FF D8 FF <!-- SOI marker followed by the first byte of another marker -->
      <td>image/jpeg
+     <td>Safe
      <td>A JPEG SOI marker followed by the first byte of another marker.
     <tr>
      <td>FF FF
      <td>42 4D
      <td>image/bmp
+     <td>Safe
      <td>The string "<code title="">BM</code>", a BMP signature.
     <tr>
      <td>FF FF FF FF
      <td>00 00 01 00
      <td>image/vnd.microsoft.icon
+     <td>Safe
      <td>A 0 word following by a 1 word, a Windows Icon file format signature.
+
   </table>
 
+  <p class="big-issue">I'd like to add types like MPEG, AVI, Flash,
+  Java, etc, to the above table.</p>
+
   <p>User agents may support further types if desired, by implicitly
   adding to the above table. However, user agents should not use any
   other patterns for types already mentioned in the table above, as
@@ -4522,13 +4578,18 @@
   and thus safe from XSS attacks, but then a user agent detects it as
   HTML anyway and allows script to execute).</p>
 
+  <p>The column marked "security" is used by the algorithm in the
+  "text or binary" section, to avoid sniffing <code
+  title="">text/plain</code> content as a type that can be used for a
+  privilege escalation attack.</p>
 
+
   <h4><dfn>Content-Type sniffing: image</dfn></h4>
 
-  <p>If the first bytes of the file match one of the byte sequences in
-  the first columns of the following table, then the sniffed type of
-  the resource is the type given in the corresponding cell in the
-  second column on the same row:</p>
+  <p>If the first bytes of the resource match one of the byte
+  sequences in the first column of the following table, then the
+  sniffed type of the resource is the type given in the corresponding
+  cell in the second column on the same row:</p>
 
   <table>
    <thead>