[html5] r1013 - /
whatwg at whatwg.org
whatwg at whatwg.org
Mon Aug 20 17:06:17 PDT 2007
Author: ianh
Date: 2007-08-20 17:06:16 -0700 (Mon, 20 Aug 2007)
New Revision: 1013
Modified:
index
source
Log:
[] (0) Adjust the sniffing rules to take into account data I've been provided regarding the frequency of occurance of strings in typical Web content.
Modified: index
===================================================================
--- index 2007-08-16 22:21:02 UTC (rev 1012)
+++ index 2007-08-21 00:06:16 UTC (rev 1013)
@@ -22,7 +22,7 @@
<h1 id=html-5>HTML 5</h1>
- <h2 class="no-num no-toc" id=working>Working Draft — 16 August 2007</h2>
+ <h2 class="no-num no-toc" id=working>Working Draft — 21 August 2007</h2>
<p>You can take part in this work. <a
href="http://www.whatwg.org/mailing-list">Join the working group's
@@ -26058,10 +26058,18 @@
the resource (in lowercase<!-- XXX ASCII case folding -->, ignoring any
parameters). If there is no such type, jump to the <em
title="content-type sniffing: unknown type"><a
- href="#content-type5">unknown type</a></em> step below.</p>
+ href="#content-type5">unknown type</a></em> step below.
- <p class=big-issue>...or if the type has no slash or is */*? Probably we
- should also sniff in that case.
+ <li>
+ <p>If <var title="">official type</var> is "unknown/unknown" or
+ "application/unknown", jump to the <em title="content-type sniffing:
+ unknown type"><a href="#content-type5">unknown type</a></em> step below.</p>
+ <!-- In a study looking at many billions of pages whose first five
+ characters were "<HTML", "unknown/unknown" was used to label
+ documents about once for every 5000 pages labelled "text/html", and
+ "application/unknown" was used about once for every 35000 pages
+ labelled "text/html". -->
+
<li>
<p>If <var title="">official type</var> ends in "+xml", or if it is
@@ -26186,26 +26194,97 @@
<li>
<p>For each row in the table below:</p>
- <ol>
- <li>Let <var title="">pattern length</var> be the length of the pattern
- (number of bytes described by the cell in the second column of the
- row).
+ <dl class=switch>
+ <dt>If the row has no bytes with a trailing asterisk:
- <li>If <var title="">pattern length</var> is smaller than <var
- title="">stream length</var> then skip this row.
+ <dd>
+ <ol>
+ <li>Let <var title="">pattern length</var> be the length of the
+ pattern (number of bytes described by the cell in the second column
+ of the row).
- <li>Apply the "and" operator to the first <var title="">pattern
- length</var> bytes of the resource and the given mask (the bytes in the
- cell of first column of that row), and let the result be the <var
- title="">data</var>.
+ <li>If <var title="">pattern length</var> is smaller than <var
+ title="">stream length</var> then skip this row.
- <li>If the bytes of the <var title="">data</var> matches the given
- pattern bytes exactly, then the sniffed type of the resource is the
- type given in the cell of the third column in that row; abort these
- steps.
- </ol>
+ <li>Apply the "and" operator to the first <var title="">pattern
+ length</var> bytes of the resource and the given mask (the bytes in
+ the cell of first column of that row), and let the result be the <var
+ title="">data</var>.
+ <li>If the bytes of the <var title="">data</var> matches the given
+ pattern bytes exactly, then the sniffed type of the resource is the
+ type given in the cell of the third column in that row; abort these
+ steps.
+ </ol>
+
+ <dt>If the row has an asterisk after one of the bytes:
+
+ <dd>
+ </dl>
+
<li>
+ <p>Let <var title="">index<sub>pattern</sub></var> be an index into the
+ mask and pattern byte strings of the row.
+
+ <li>
+ <p>Let <var title="">index<sub>stream</sub></var> be an index into the
+ byte stream being examined.
+
+ <li>
+ <p><em>Loop</em>: If <var title="">index<sub>stream</sub></var> points
+ beyond the end of the byte stream, then this row doesn't match, skip
+ this row.
+
+ <li>
+ <p><em>Byte</em>: Examine the <var
+ title="">index<sub>stream</sub></var>th byte of the byte stream as
+ follows:</p>
+
+ <dl class=switch>
+ <dt>If the <var title="">index<sub>stream</sub></var>th byte of the
+ pattern does not have an asterisk after:
+
+ <dd>
+ <p>If the "and" operator, applied to the <var
+ title="">index<sub>stream</sub></var>th byte of the stream and the
+ <var title="">index<sub>pattern</sub></var>th byte of the mask, yield
+ a value different that the <var
+ title="">index<sub>pattern</sub></var>th byte of the pattern, then
+ skip this row.</p>
+
+ <p>Otherwise, increment <var title="">index<sub>pattern</sub></var> to
+ the next byte in the mask and pattern and <var
+ title="">index<sub>stream</sub></var> to the next byte in the byte
+ stream.</p>
+
+ <dt>Otherwies, if the <var title="">index<sub>stream</sub></var>th byte
+ of the pattern <em>does</em> have an asterisk after it:
+
+ <dd>
+ <p>If the "and" operator, applied to the <var
+ title="">index<sub>stream</sub></var>th byte of the stream and the
+ <var title="">index<sub>pattern</sub></var>th byte of the mask, yield
+ a value different that the <var
+ title="">index<sub>pattern</sub></var>th byte of the pattern, then
+ increment only the <var title="">index<sub>pattern</sub></var> to the
+ next byte in the mask and pattern and jump to top of the <em>byte</em>
+ step in this algorithm.</p>
+
+ <p>Otherwise, increment only the <var
+ title="">index<sub>stream</sub></var> to the next byte in the byte
+ stream.</p>
+ </dl>
+
+ <li>
+ <p>If <var title="">index<sub>pattern</sub></var> does not point beyond
+ the end of the mask and pattern byte strings, then return to the
+ <em>loop</em> step in this algorithm.
+
+ <li>
+ <p>Otherwise, the sniffed type of the resource is the type given in the
+ cell of the third column in that row; abort these steps.
+
+ <li>
<p>As a last-ditch effort, jump to the <a href="#content-type4"
title="content-type sniffing: text or binary">text or binary</a>
section.
@@ -26230,6 +26309,7 @@
<td>FF FF DF DF DF DF DF DF DF FF DF DF DF DF
<td>3C 21 44 4F 43 54 59 50 45 20 48 54 4D 4C <!-- "<!DOCTYPE HTML" -->
+ <!-- common in static data -->
<td>text/html
@@ -26237,24 +26317,38 @@
compatible encodings, case-insensitively.
<tr>
- <td>FF DF DF DF DF
+ <td>FF* FF DF DF DF DF
- <td>3C 48 54 4D 4C <!-- "<HTML" -->
+ <td>20* 3C 48 54 4D 4C <!-- "<HTML" --> <!-- common in static data -->
<td>text/html
<td>The string "<code title=""><HTML</code>" in US-ASCII or
- compatible encodings, case-insensitively.
+ compatible encodings, case-insensitively, possibly with leading spaces.
+
<tr>
- <td>FF DF DF DF DF DF DF
+ <td>FF* FF DF DF DF DF
- <td>3C 53 43 52 49 50 54 <!-- "<SCRIPT" -->
+ <td>20* 3C 48 45 41 44 <!-- "<HEAD" --> <!-- common in static data -->
<td>text/html
+ <td>The string "<code title=""><HEAD</code>" in US-ASCII or
+ compatible encodings, case-insensitively, possibly with leading spaces.
+
+
+ <tr>
+ <td>FF* FF DF DF DF DF DF DF
+
+ <td>20* 3C 53 43 52 49 50 54 <!-- "<SCRIPT" -->
+ <!-- common in dynamic data -->
+
+ <td>text/html
+
<td>The string "<code title=""><SCRIPT</code>" in US-ASCII or
- compatible encodings, case-insensitively.
+ compatible encodings, case-insensitively, possibly with leading spaces.
+
<tr>
<td>FF FF FF FF FF
@@ -26575,7 +26669,10 @@
<p>For HTTP resources, only the Content-Type HTTP header contributes any
data; the explicit type of the resource is then the value of that header,
- interpreted as described by the HTTP specifications. <a
+ interpreted as described by the HTTP specifications. If the Content-Type
+ HTTP header is present but it cannot be interpreted as described by the
+ HTTP specifications (e.g. because its value doesn't contain a U+002F
+ SOLIDUS ('/') character), then the resource has no type information. <a
href="#refsHTTP">[HTTP]</a>
<p>For resources fetched from the filesystem, user agents should use
Modified: source
===================================================================
--- source 2007-08-16 22:21:02 UTC (rev 1012)
+++ source 2007-08-21 00:06:16 UTC (rev 1013)
@@ -23585,9 +23585,17 @@
resource (in lowercase<!-- XXX ASCII case folding -->, ignoring any
parameters). If there is no such type, jump to the <em
title="content-type sniffing: unknown type">unknown type</em> step
- below.</p> <p class="big-issue">...or if the type has no slash or
- is */*? Probably we should also sniff in that case.</p></li>
+ below.</p></li>
+ <li><p>If <var title="">official type</var> is "unknown/unknown" or
+ "application/unknown", jump to the <em title="content-type
+ sniffing: unknown type">unknown type</em> step below.</p></p>
+ <!-- In a study looking at many billions of pages whose first five
+ characters were "<HTML", "unknown/unknown" was used to label
+ documents about once for every 5000 pages labelled "text/html", and
+ "application/unknown" was used about once for every 35000 pages
+ labelled "text/html". -->
+
<li><p>If <var title="">official type</var> ends in "+xml", or if
it is either "text/xml" or "application/xml", then the the sniffed
type of the resource is <var title="">official type</var>; return
@@ -23690,27 +23698,117 @@
<li><p>For each row in the table below:</p>
- <ol>
+ <dl class="switch">
- <li>Let <var title="">pattern length</var> be the length of the
- pattern (number of bytes described by the cell in the second
- column of the row).</li>
+ <dt>If the row has no bytes with a trailing asterisk:</dt>
- <li>If <var title="">pattern length</var> is smaller than <var
- title="">stream length</var> then skip this row.</li>
+ <dd>
- <li>Apply the "and" operator to the first <var title="">pattern
- length</var> bytes of the resource and the given mask (the bytes
- in the cell of first column of that row), and let the result be
- the <var title="">data</var>.</li>
+ <ol>
- <li>If the bytes of the <var title="">data</var> matches the
- given pattern bytes exactly, then the sniffed type of the
- resource is the type given in the cell of the third column in
- that row; abort these steps.</li>
+ <li>Let <var title="">pattern length</var> be the length of the
+ pattern (number of bytes described by the cell in the second
+ column of the row).</li>
- </ol>
+ <li>If <var title="">pattern length</var> is smaller than <var
+ title="">stream length</var> then skip this row.</li>
+ <li>Apply the "and" operator to the first <var title="">pattern
+ length</var> bytes of the resource and the given mask (the
+ bytes in the cell of first column of that row), and let the
+ result be the <var title="">data</var>.</li>
+
+ <li>If the bytes of the <var title="">data</var> matches the
+ given pattern bytes exactly, then the sniffed type of the
+ resource is the type given in the cell of the third column in
+ that row; abort these steps.</li>
+
+ </ol>
+
+ </dd>
+
+ <dt>If the row has an asterisk after one of the bytes:</dt>
+
+ <dd>
+
+ <li><p>Let <var title="">index<sub>pattern</sub></var> be an
+ index into the mask and pattern byte strings of the
+ row.</p></li>
+
+ <li><p>Let <var title="">index<sub>stream</sub></var> be an
+ index into the byte stream being examined.</p></li>
+
+ <li><p><em>Loop</em>: If <var
+ title="">index<sub>stream</sub></var> points beyond the end of
+ the byte stream, then this row doesn't match, skip this
+ row.</p></li>
+
+ <li>
+
+ <p><em>Byte</em>: Examine the <var
+ title="">index<sub>stream</sub></var>th byte of the byte
+ stream as follows:</p>
+
+ <dl class="switch">
+
+ <dt>If the <var title="">index<sub>stream</sub></var>th byte of
+ the pattern does not have an asterisk after:</dt>
+
+ <dd>
+
+ <p>If the "and" operator, applied to the <var
+ title="">index<sub>stream</sub></var>th byte of the stream
+ and the <var title="">index<sub>pattern</sub></var>th byte
+ of the mask, yield a value different that the <var
+ title="">index<sub>pattern</sub></var>th byte of the
+ pattern, then skip this row.</p>
+
+ <p>Otherwise, increment <var
+ title="">index<sub>pattern</sub></var> to the next byte in
+ the mask and pattern and <var
+ title="">index<sub>stream</sub></var> to the next byte in
+ the byte stream.</p>
+
+ </dd>
+
+ <dt>Otherwies, if the <var
+ title="">index<sub>stream</sub></var>th byte of the pattern
+ <em>does</em> have an asterisk after it:</dt>
+
+ <dd>
+
+ <p>If the "and" operator, applied to the <var
+ title="">index<sub>stream</sub></var>th byte of the stream
+ and the <var title="">index<sub>pattern</sub></var>th byte
+ of the mask, yield a value different that the <var
+ title="">index<sub>pattern</sub></var>th byte of the
+ pattern, then increment only the <var
+ title="">index<sub>pattern</sub></var> to the next byte in
+ the mask and pattern and jump to top of the <em>byte</em>
+ step in this algorithm.</p>
+
+ <p>Otherwise, increment only the <var
+ title="">index<sub>stream</sub></var> to the next byte in
+ the byte stream.</p>
+
+ </dd>
+
+ </dl>
+
+ </li>
+
+ <li><p>If <var title="">index<sub>pattern</sub></var> does not
+ point beyond the end of the mask and pattern byte strings, then
+ return to the <em>loop</em> step in this algorithm.</p></li>
+
+ <li><p>Otherwise, the sniffed type of the resource is the type
+ given in the cell of the third column in that row; abort these
+ steps.</p></li>
+
+ </dd>
+
+ </dl>
+
</li>
<li><p>As a last-ditch effort, jump to the <span
@@ -23731,20 +23829,25 @@
<tbody>
<tr>
<td>FF FF DF DF DF DF DF DF DF FF DF DF DF DF
- <td>3C 21 44 4F 43 54 59 50 45 20 48 54 4D 4C <!-- "<!DOCTYPE HTML" -->
+ <td>3C 21 44 4F 43 54 59 50 45 20 48 54 4D 4C <!-- "<!DOCTYPE HTML" --> <!-- common in static data -->
<td>text/html
<td>The string "<code title=""><!DOCTYPE HTML</code>" in US-ASCII or compatible encodings, case-insensitively.
<tr>
- <td>FF DF DF DF DF
- <td>3C 48 54 4D 4C <!-- "<HTML" -->
+ <td>FF* FF DF DF DF DF
+ <td>20* 3C 48 54 4D 4C <!-- "<HTML" --> <!-- common in static data -->
<td>text/html
- <td>The string "<code title=""><HTML</code>" in US-ASCII or compatible encodings, case-insensitively.
+ <td>The string "<code title=""><HTML</code>" in US-ASCII or compatible encodings, case-insensitively, possibly with leading spaces.
<tr>
- <td>FF DF DF DF DF DF DF
- <td>3C 53 43 52 49 50 54 <!-- "<SCRIPT" -->
+ <td>FF* FF DF DF DF DF
+ <td>20* 3C 48 45 41 44 <!-- "<HEAD" --> <!-- common in static data -->
<td>text/html
- <td>The string "<code title=""><SCRIPT</code>" in US-ASCII or compatible encodings, case-insensitively.
+ <td>The string "<code title=""><HEAD</code>" in US-ASCII or compatible encodings, case-insensitively, possibly with leading spaces.
<tr>
+ <td>FF* FF DF DF DF DF DF DF
+ <td>20* 3C 53 43 52 49 50 54 <!-- "<SCRIPT" --> <!-- common in dynamic data -->
+ <td>text/html
+ <td>The string "<code title=""><SCRIPT</code>" in US-ASCII or compatible encodings, case-insensitively, possibly with leading spaces.
+ <tr>
<td>FF FF FF FF FF
<td>25 50 44 46 2D <!-- "%PDF-" (from http://lxr.mozilla.org/seamonkey/source/netwerk/streamconv/converters/nsUnknownDecoder.cpp#321) -->
<td>application/pdf
@@ -24023,8 +24126,11 @@
<p>For HTTP resources, only the Content-Type HTTP header contributes
any data; the explicit type of the resource is then the value of
- that header, interpreted as described by the HTTP specifications. <a
- href="#refsHTTP">[HTTP]</a></p>
+ that header, interpreted as described by the HTTP specifications. If
+ the Content-Type HTTP header is present but it cannot be interpreted
+ as described by the HTTP specifications (e.g. because its value
+ doesn't contain a U+002F SOLIDUS ('/') character), then the resource
+ has no type information. <a href="#refsHTTP">[HTTP]</a></p>
<p>For resources fetched from the filesystem, user agents should use
platform-specific conventions, e.g. operating system extension/type
More information about the Commit-Watchers
mailing list