[html5] r8722 - [e] (0) Adjust notes on encoding detection Fixing https://www.w3.org/Bugs/Public [...]

Wed Aug 27 16:12:40 PDT 2014

Author: ianh
Date: 2014-08-27 16:12:36 -0700 (Wed, 27 Aug 2014)
New Revision: 8722

Modified:
   complete.html
   index
   source
Log:
[e] (0) Adjust notes on encoding detection
Fixing https://www.w3.org/Bugs/Public/show_bug.cgi?id=25534
Affected topics: HTML Syntax and Parsing

Modified: complete.html
===================================================================

--- complete.html	2014-08-27 00:03:22 UTC (rev 8721)
+++ complete.html	2014-08-27 23:12:36 UTC (rev 8722)
@@ -291,7 +291,7 @@
   </style><link rel=stylesheet href=status.css><body onload=init()>
   <header id=head class="head with-buttons">
    <p><a href=//www.whatwg.org/ class=logo><img src=/images/logo width=101 alt=WHATWG height=101></a></p>
-   <hgroup><h1 class=allcaps>HTML</h1><h2 id=living-standard-—-last-updated-[date:-01-jan-1901] class="no-num no-toc">Living Standard — Last Updated <span class=pubdate>26 August 2014</span></h2></hgroup>
+   <hgroup><h1 class=allcaps>HTML</h1><h2 id=living-standard-—-last-updated-[date:-01-jan-1901] class="no-num no-toc">Living Standard — Last Updated <span class=pubdate>27 August 2014</span></h2></hgroup>
    
    <nav>
     <div>
@@ -71079,11 +71079,18 @@
     encoding, then return that encoding, with the <a href=#concept-encoding-confidence id=determining-the-character-encoding:concept-encoding-confidence-8>confidence</a> <i>tentative</i>, and abort these steps.
     <a href=#refsUNIVCHARDET>[UNIVCHARDET]</a></p>
 
-    <p class=note>The UTF-8 encoding has a highly detectable bit pattern. Documents that contain
-    bytes with values greater than 0x7F which match the UTF-8 pattern are very likely to be UTF-8,
-    while documents with byte sequences that do not match it are very likely not. User-agents are
-    therefore encouraged to search for this common encoding. <a href=#refsPPUTF8>[PPUTF8]</a> <a href=#refsUTF8DET>[UTF8DET]</a></p>
+    <p class=note>User agents are generally discouraged from attempting to autodetect encodings
+    for resources obtained over the network, since doing so involves inherently non-interoperable
+    heuristics. Attempting to detect encodings based on an HTML document's preamble is especially
+    tricky since HTML markup typically uses only ASCII characters, and HTML documents tend to begin
+    with a lot of markup rather than with text content.</p>
 
+    <p class=note>The UTF-8 encoding has a highly detectable bit pattern. Files from the local
+    file system that contain bytes with values greater than 0x7F which match the UTF-8 pattern are
+    very likely to be UTF-8, while documents with byte sequences that do not match it are very
+    likely not. When a user agent can examine the whole file, rather than just the preamble,
+    detecting for UTF-8 specifically can be especially effective. <a href=#refsPPUTF8>[PPUTF8]</a> <a href=#refsUTF8DET>[UTF8DET]</a></p>
+
    <li>
 
     <p>Otherwise, return an implementation-defined or user-specified default character encoding,

Modified: index
===================================================================
--- index	2014-08-27 00:03:22 UTC (rev 8721)
+++ index	2014-08-27 23:12:36 UTC (rev 8722)
@@ -291,7 +291,7 @@
   </style><link rel=stylesheet href=status.css><body onload=init()>
   <header id=head class="head with-buttons">
    <p><a href=//www.whatwg.org/ class=logo><img src=/images/logo width=101 alt=WHATWG height=101></a></p>
-   <hgroup><h1 class=allcaps>HTML</h1><h2 id=living-standard-—-last-updated-[date:-01-jan-1901] class="no-num no-toc">Living Standard — Last Updated <span class=pubdate>26 August 2014</span></h2></hgroup>
+   <hgroup><h1 class=allcaps>HTML</h1><h2 id=living-standard-—-last-updated-[date:-01-jan-1901] class="no-num no-toc">Living Standard — Last Updated <span class=pubdate>27 August 2014</span></h2></hgroup>
    
    <nav>
     <div>
@@ -71079,11 +71079,18 @@
     encoding, then return that encoding, with the <a href=#concept-encoding-confidence id=determining-the-character-encoding:concept-encoding-confidence-8>confidence</a> <i>tentative</i>, and abort these steps.
     <a href=#refsUNIVCHARDET>[UNIVCHARDET]</a></p>
 
-    <p class=note>The UTF-8 encoding has a highly detectable bit pattern. Documents that contain
-    bytes with values greater than 0x7F which match the UTF-8 pattern are very likely to be UTF-8,
-    while documents with byte sequences that do not match it are very likely not. User-agents are
-    therefore encouraged to search for this common encoding. <a href=#refsPPUTF8>[PPUTF8]</a> <a href=#refsUTF8DET>[UTF8DET]</a></p>
+    <p class=note>User agents are generally discouraged from attempting to autodetect encodings
+    for resources obtained over the network, since doing so involves inherently non-interoperable
+    heuristics. Attempting to detect encodings based on an HTML document's preamble is especially
+    tricky since HTML markup typically uses only ASCII characters, and HTML documents tend to begin
+    with a lot of markup rather than with text content.</p>
 
+    <p class=note>The UTF-8 encoding has a highly detectable bit pattern. Files from the local
+    file system that contain bytes with values greater than 0x7F which match the UTF-8 pattern are
+    very likely to be UTF-8, while documents with byte sequences that do not match it are very
+    likely not. When a user agent can examine the whole file, rather than just the preamble,
+    detecting for UTF-8 specifically can be especially effective. <a href=#refsPPUTF8>[PPUTF8]</a> <a href=#refsUTF8DET>[UTF8DET]</a></p>
+
    <li>
 
     <p>Otherwise, return an implementation-defined or user-specified default character encoding,

Modified: source
===================================================================
--- source	2014-08-27 00:03:22 UTC (rev 8721)
+++ source	2014-08-27 23:12:36 UTC (rev 8722)
@@ -95700,11 +95700,19 @@
     data-x="concept-encoding-confidence">confidence</span> <i>tentative</i>, and abort these steps.
     <ref spec=UNIVCHARDET></p>
 
-    <p class="note">The UTF-8 encoding has a highly detectable bit pattern. Documents that contain
-    bytes with values greater than 0x7F which match the UTF-8 pattern are very likely to be UTF-8,
-    while documents with byte sequences that do not match it are very likely not. User-agents are
-    therefore encouraged to search for this common encoding. <ref spec=PPUTF8> <ref spec=UTF8DET></p>
+    <p class="note">User agents are generally discouraged from attempting to autodetect encodings
+    for resources obtained over the network, since doing so involves inherently non-interoperable
+    heuristics. Attempting to detect encodings based on an HTML document's preamble is especially
+    tricky since HTML markup typically uses only ASCII characters, and HTML documents tend to begin
+    with a lot of markup rather than with text content.</p>
 
+    <p class="note">The UTF-8 encoding has a highly detectable bit pattern. Files from the local
+    file system that contain bytes with values greater than 0x7F which match the UTF-8 pattern are
+    very likely to be UTF-8, while documents with byte sequences that do not match it are very
+    likely not. When a user agent can examine the whole file, rather than just the preamble,
+    detecting for UTF-8 specifically can be especially effective. <ref spec=PPUTF8> <ref
+    spec=UTF8DET></p>
+
    </li>
 
    <li>