[html5] r3304 - [e] (0) Write some explanatory text around the HTML parser.

whatwg at whatwg.org whatwg at whatwg.org
Mon Jun 22 18:33:41 PDT 2009


Author: ianh
Date: 2009-06-22 18:33:39 -0700 (Mon, 22 Jun 2009)
New Revision: 3304

Modified:
   index
   source
Log:
[e] (0) Write some explanatory text around the HTML parser.

Modified: index
===================================================================
--- index	2009-06-17 07:12:07 UTC (rev 3303)
+++ index	2009-06-23 01:33:39 UTC (rev 3304)
@@ -39,7 +39,7 @@
   <div class=head>
    <p><a class=logo href=http://www.whatwg.org/ rel=home><img alt=WHATWG src=/images/logo></a></p>
    <h1>HTML 5</h1>
-   <h2 class="no-num no-toc" id=draft-standard-—-date:-01-jan-1901>Draft Standard — 17 June 2009</h2>
+   <h2 class="no-num no-toc" id=draft-standard-—-date:-01-jan-1901>Draft Standard — 23 June 2009</h2>
    <p>You can take part in this work. <a href=http://www.whatwg.org/mailing-list>Join the working group's discussion list.</a></p>
    <p><strong>Web designers!</strong> We have a <a href=http://blog.whatwg.org/faq/>FAQ</a>, a <a href=http://forums.whatwg.org/>forum</a>, and a <a href=http://www.whatwg.org/mailing-list#help>help mailing list</a> for you!</p>
    <dl><dt>Multiple-page version:</dt>
@@ -992,7 +992,12 @@
        <li><a href=#the-after-after-body-insertion-mode><span class=secno>9.2.5.24 </span>The "after after body" insertion mode</a></li>
        <li><a href=#the-after-after-frameset-insertion-mode><span class=secno>9.2.5.25 </span>The "after after frameset" insertion mode</a></ol></li>
      <li><a href=#the-end><span class=secno>9.2.6 </span>The end</a></li>
-     <li><a href=#coercing-an-html-dom-into-an-infoset><span class=secno>9.2.7 </span>Coercing an HTML DOM into an infoset</a></ol></li>
+     <li><a href=#coercing-an-html-dom-into-an-infoset><span class=secno>9.2.7 </span>Coercing an HTML DOM into an infoset</a></li>
+     <li><a href=#an-introduction-to-error-handling-in-the-parser><span class=secno>9.2.8 </span>An introduction to error handling in the parser</a>
+      <ol>
+       <li><a href=#misnested-tags:-b-i-/b-/i><span class=secno>9.2.8.1 </span>Misnested tags: <b><i></b></i></a></li>
+       <li><a href=#misnested-tags:-b-p-/b-/p><span class=secno>9.2.8.2 </span>Misnested tags: <b><p></b></p></a></li>
+       <li><a href=#unexpected-markup-in-tables><span class=secno>9.2.8.3 </span>Unexpected markup in tables</a></ol></ol></li>
    <li><a href=#namespaces><span class=secno>9.3 </span>Namespaces</a></li>
    <li><a href=#serializing-html-fragments><span class=secno>9.4 </span>Serializing HTML fragments</a></li>
    <li><a href=#parsing-html-fragments><span class=secno>9.5 </span>Parsing HTML fragments</a></li>
@@ -58369,6 +58374,7 @@
   pause flag</dfn>, which must be initially set to false.</p>
 
 
+
   <h4 id=the-input-stream><span class=secno>9.2.2 </span>The <dfn>input stream</dfn></h4>
 
   <p>The stream of Unicode characters that comprises the input to the
@@ -59192,9 +59198,14 @@
   category, and scope markers. The scope markers are inserted when
   entering <code><a href=#the-applet-element>applet</a></code> elements, buttons, <code><a href=#the-object-element>object</a></code>
   elements, marquees, table cells, and table captions, and are used to
-  prevent formatting from "leaking" into <code><a href=#the-applet-element>applet</a></code> elements,
-  buttons, <code><a href=#the-object-element>object</a></code> elements, marquees, and tables.</p>
+  prevent formatting from "leaking" <em>into</em> <code><a href=#the-applet-element>applet</a></code>
+  elements, buttons, <code><a href=#the-object-element>object</a></code> elements, marquees, and
+  tables.</p>
 
+  <p class=note>The scope markers are unrelated to the concept of an
+  element being <a href=#has-an-element-in-scope title="has an element in scope">in
+  scope</a>.</p>
+
   <p>In addition, each element in the <a href=#list-of-active-formatting-elements>list of active formatting
   elements</a> is associated with the token for which it was
   created, so that further elements can be created for that token if
@@ -60970,9 +60981,9 @@
   must be inserted into the <i><a href=#foster-parent-element>foster parent element</a></i>, and the
   <a href=#current-table>current table</a> must be marked as
   <dfn id=tainted>tainted</dfn>. (Once the <a href=#current-table>current table</a> has been
-  <a href=#tainted>tainted</a>, whitespace characters are inserted into the
-  <i><a href=#foster-parent-element>foster parent element</a></i> instead of the <a href=#current-node>current
-  node</a>.)</p>
+  <a href=#tainted>tainted</a>, <a href=#space-character title="space character">space
+  characters</a> are inserted into the <i><a href=#foster-parent-element>foster parent element</a></i>
+  instead of the <a href=#current-node>current node</a>.)</p>
 
   <p>The <dfn id=foster-parent-element>foster parent element</dfn> is the parent element of the
   last <code><a href=#the-table-element>table</a></code> element in the <a href=#stack-of-open-elements>stack of open
@@ -64400,8 +64411,193 @@
 
 
 
-  <h3 id=namespaces><span class=secno>9.3 </span>Namespaces</h3>
+  <h4 id=an-introduction-to-error-handling-in-the-parser><span class=secno>9.2.8 </span>An introduction to error handling in the parser</h4>
 
+  <p><em>This section is non-normative.</em></p>
+
+  <p>This section examines some erroneous markup and discusses how
+  the <a href=#html-parser>HTML parser</a> handles these cases.</p>
+
+
+  <h5 id=misnested-tags:-b-i-/b-/i><span class=secno>9.2.8.1 </span>Misnested tags: <b><i></b></i></h5>
+
+  <p><em>This section is non-normative.</em></p>
+
+  <p>The most-often discussed example of erroneous markup is as
+  follows:</p>
+
+  <pre><p>1<b>2<i>3</b>4</i>5</p></pre>
+
+  <p>The parsing of this markup is straightforward up to the "3". At
+  this point, the DOM looks like this:</p>
+
+  <ul class=domTree><li class=t1><code><a href=#the-html-element>html</a></code><ul><li class=t1><code><a href=#the-head-element>head</a></code><li class=t1><code><a href=#the-body-element>body</a></code><ul><li class=t1><code><a href=#the-p-element>p</a></code><ul><li class=t3><code>#text</code>: <span title="">1</span><li class=t1><code><a href=#the-b-element>b</a></code><ul><li class=t3><code>#text</code>: <span title="">2</span><li class=t1><code><a href=#the-i-element>i</a></code><ul><li class=t3><code>#text</code>: <span title="">3</span></ul></ul></ul></ul></ul></ul><p>Here, the <a href=#stack-of-open-elements>stack of open elements</a> has five elements
+  on it: <code><a href=#the-html-element>html</a></code>, <code><a href=#the-body-element>body</a></code>, <code><a href=#the-p-element>p</a></code>,
+  <code><a href=#the-b-element>b</a></code>, and <code><a href=#the-i-element>i</a></code>. The <a href=#list-of-active-formatting-elements>list of active
+  formatting elements</a> just has two: <code><a href=#the-b-element>b</a></code> and
+  <code><a href=#the-i-element>i</a></code>. The <a href=#insertion-mode>insertion mode</a> is "<a href=#parsing-main-inbody title="insertion mode: in body">in body</a>".</p>
+
+  <p>Upon receiving the end tag token with the tag name "b", the "<a href=#adoptionAgency>adoption agency algorithm</a>" is
+  invoked. This is a simple case, in that the <var title="">formatting
+  element</var> is the <code><a href=#the-b-element>b</a></code> element, and there is no
+  <var title="">furthest block</var>. Thus, the <a href=#stack-of-open-elements>stack of open
+  elements</a> ends up with just three elements: <code><a href=#the-html-element>html</a></code>,
+  <code><a href=#the-body-element>body</a></code>, and <code><a href=#the-p-element>p</a></code>, while the <a href=#list-of-active-formatting-elements>list of
+  active formatting elements</a> has just one: <code><a href=#the-i-element>i</a></code>. The
+  DOM tree is unmodified at this point.</p>
+
+  <p>The next token is a character ("4"), triggers the <a href=#reconstruct-the-active-formatting-elements title="reconstruct the active formatting elements">reconstruction of
+  the active formatting elements</a>, in this case just the
+  <code><a href=#the-i-element>i</a></code> element. A new <code><a href=#the-i-element>i</a></code> element is thus created
+  for the "4" text node. After the end tag token for the "i" is also
+  received, and the "5" text node is inserted, the DOM looks as
+  follows:</p>
+
+  <ul class=domTree><li class=t1><code><a href=#the-html-element>html</a></code><ul><li class=t1><code><a href=#the-head-element>head</a></code><li class=t1><code><a href=#the-body-element>body</a></code><ul><li class=t1><code><a href=#the-p-element>p</a></code><ul><li class=t3><code>#text</code>: <span title="">1</span><li class=t1><code><a href=#the-b-element>b</a></code><ul><li class=t3><code>#text</code>: <span title="">2</span><li class=t1><code><a href=#the-i-element>i</a></code><ul><li class=t3><code>#text</code>: <span title="">3</span></ul></ul><li class=t1><code><a href=#the-i-element>i</a></code><ul><li class=t3><code>#text</code>: <span title="">4</span></ul><li class=t3><code>#text</code>: <span title="">5</span></ul></ul></ul></ul><h5 id=misnested-tags:-b-p-/b-/p><span class=secno>9.2.8.2 </span>Misnested tags: <b><p></b></p></h5>
+
+  <p><em>This section is non-normative.</em></p>
+
+  <p>A case similar to the previous one is the following:</p>
+
+  <pre><b>1<p>2</b>3</p></pre>
+
+  <p>Up to the "2" the parsing here is straightforward:</p>
+
+  <ul class=domTree><li class=t1><code><a href=#the-html-element>html</a></code><ul><li class=t1><code><a href=#the-head-element>head</a></code><li class=t1><code><a href=#the-body-element>body</a></code><ul><li class=t1><code><a href=#the-b-element>b</a></code><ul><li class=t3><code>#text</code>: <span title="">1</span><li class=t1><code><a href=#the-p-element>p</a></code><ul><li class=t3><code>#text</code>: <span title="">2</span></ul></ul></ul></ul></ul><p>The interesting part is when the end tag token with the tag name
+  "b" is parsed.</p>
+
+  <p>Before that token is seen, the <a href=#stack-of-open-elements>stack of open
+  elements</a> has four elements on it: <code><a href=#the-html-element>html</a></code>,
+  <code><a href=#the-body-element>body</a></code>, <code><a href=#the-b-element>b</a></code>, and <code><a href=#the-p-element>p</a></code>. The
+  <a href=#list-of-active-formatting-elements>list of active formatting elements</a> just has the one:
+  <code><a href=#the-b-element>b</a></code>. The <a href=#insertion-mode>insertion mode</a> is "<a href=#parsing-main-inbody title="insertion mode: in body">in body</a>".</p>
+
+  <p>Upon receiving the end tag token with the tag name "b", the "<a href=#adoptionAgency>adoption agency algorithm</a>" is invoked, as
+  in the previous example. However, in this case, there <em>is</em> a
+  <var title="">furthest block</var>, namely the <code><a href=#the-p-element>p</a></code> element. Thus,
+  this time the adoption agency algorithm isn't skipped over.</p>
+
+  <p>The <var title="">common ancestor</var> is the <code><a href=#the-body-element>body</a></code>
+  element. A conceptual "bookmark" marks the position of the
+  <code><a href=#the-b-element>b</a></code> in the <a href=#list-of-active-formatting-elements>list of active formatting
+  elements</a>, but since that list has only one element in it,
+  it won't have much effect.</p>
+
+  <p>As the algorithm progresses, <var title="">node</var> ends up set
+  to the formatting element (<code><a href=#the-b-element>b</a></code>), and <var title="">last
+  node</var> ends up set to the <var title="">furthest block</var>
+  (<code><a href=#the-p-element>p</a></code>).</p>
+
+  <p>The <var title="">last node</var> gets appended (moved) to the
+  <var title="">common ancestor</var>, so that the DOM looks like:</p>
+
+  <ul class=domTree><li class=t1><code><a href=#the-html-element>html</a></code><ul><li class=t1><code><a href=#the-head-element>head</a></code><li class=t1><code><a href=#the-body-element>body</a></code><ul><li class=t1><code><a href=#the-b-element>b</a></code><ul><li class=t3><code>#text</code>: <span title="">1</span></ul><li class=t1><code><a href=#the-p-element>p</a></code><ul><li class=t3><code>#text</code>: <span title="">2</span></ul></ul></ul></ul><p>A new <code><a href=#the-b-element>b</a></code> element is created, and the children of the
+  <code><a href=#the-p-element>p</a></code> element are moved to it:</p>
+
+  <ul class=domTree><li class=t1><code><a href=#the-html-element>html</a></code><ul><li class=t1><code><a href=#the-head-element>head</a></code><li class=t1><code><a href=#the-body-element>body</a></code><ul><li class=t1><code><a href=#the-b-element>b</a></code><ul><li class=t3><code>#text</code>: <span title="">1</span></ul><li class=t1><code><a href=#the-p-element>p</a></code></ul></ul></ul><ul class=domTree><li class=t1><code><a href=#the-b-element>b</a></code><ul><li class=t3><code>#text</code>: <span title="">2</span></ul></ul><p>Finally, the new <code><a href=#the-b-element>b</a></code> element is appended to the
+  <code><a href=#the-p-element>p</a></code> element, so that the DOM looks like:</p>
+
+  <ul class=domTree><li class=t1><code><a href=#the-html-element>html</a></code><ul><li class=t1><code><a href=#the-head-element>head</a></code><li class=t1><code><a href=#the-body-element>body</a></code><ul><li class=t1><code><a href=#the-b-element>b</a></code><ul><li class=t3><code>#text</code>: <span title="">1</span></ul><li class=t1><code><a href=#the-p-element>p</a></code><ul><li class=t1><code><a href=#the-b-element>b</a></code><ul><li class=t3><code>#text</code>: <span title="">2</span></ul></ul></ul></ul></ul><p>The <code><a href=#the-b-element>b</a></code> element is removed from the <a href=#list-of-active-formatting-elements>list of
+  active formatting elements</a> and the <a href=#stack-of-open-elements>stack of open
+  elements</a>, so that when the "3" is parsed, it is appended to
+  the <code><a href=#the-p-element>p</a></code> element:</p>
+
+  <ul class=domTree><li class=t1><code><a href=#the-html-element>html</a></code><ul><li class=t1><code><a href=#the-head-element>head</a></code><li class=t1><code><a href=#the-body-element>body</a></code><ul><li class=t1><code><a href=#the-b-element>b</a></code><ul><li class=t3><code>#text</code>: <span title="">1</span></ul><li class=t1><code><a href=#the-p-element>p</a></code><ul><li class=t1><code><a href=#the-b-element>b</a></code><ul><li class=t3><code>#text</code>: <span title="">2</span></ul><li class=t3><code>#text</code>: <span title="">3</span></ul></ul></ul></ul><h5 id=unexpected-markup-in-tables><span class=secno>9.2.8.3 </span>Unexpected markup in tables</h5>
+
+  <p><em>This section is non-normative.</em></p>
+
+  <p>Error handling in tables is, for historical reasons, especially
+  strange. For example, consider the following markup:</p>
+
+  <pre><table><strong><b></strong><tr><td>aaa</td></tr><strong>bbb</strong></table>ccc</pre>
+
+  <p>The highlighted <code><a href=#the-b-element>b</a></code> element start tag is not allowed
+  directly inside a table like that, and the parser handles this case
+  by placing the element <em>before</em> the table. (This is called <i title="foster parent"><a href=#foster-parent>foster parenting</a></i>.) This can be seen by
+  examining the DOM tree as it stands just after the
+  <code><a href=#the-table-element>table</a></code> element's start tag has been seen:</p>
+
+  <ul class=domTree><li class=t1><code><a href=#the-html-element>html</a></code><ul><li class=t1><code><a href=#the-head-element>head</a></code><li class=t1><code><a href=#the-body-element>body</a></code><ul><li class=t1><code><a href=#the-table-element>table</a></code></ul></ul></ul><p>...and then immediately after the <code><a href=#the-b-element>b</a></code> element start
+  tag has been seen:</p>
+
+  <ul class=domTree><li class=t1><code><a href=#the-html-element>html</a></code><ul><li class=t1><code><a href=#the-head-element>head</a></code><li class=t1><code><a href=#the-body-element>body</a></code><ul><li class=t1><code><a href=#the-b-element>b</a></code><li class=t1><code><a href=#the-table-element>table</a></code></ul></ul></ul><p>At this point, the <a href=#stack-of-open-elements>stack of open elements</a> has on it
+  the elements <code><a href=#the-html-element>html</a></code>, <code><a href=#the-body-element>body</a></code>,
+  <code><a href=#the-table-element>table</a></code>, and <code><a href=#the-b-element>b</a></code> (in that order, despite the
+  resulting DOM tree); the <a href=#list-of-active-formatting-elements>list of active formatting
+  elements</a> just has the <code><a href=#the-b-element>b</a></code> element in it; the
+  <a href=#insertion-mode>insertion mode</a> is "<a href=#parsing-main-intable title="insertion mode: in
+  table">in table</a>"; and the <code><a href=#the-table-element>table</a></code> element is
+  <a href=#tainted>tainted</a>.</p>
+
+  <p>The <code><a href=#the-tr-element>tr</a></code> start tag causes the <code><a href=#the-b-element>b</a></code> element
+  to be popped off the stack and a <code><a href=#the-tbody-element>tbody</a></code> start tag to be
+  implied; the <code><a href=#the-tbody-element>tbody</a></code> and <code><a href=#the-tr-element>tr</a></code> elements are
+  then handled in a rather straight-forward manner, taking the parser
+  through the "<a href=#parsing-main-intbody title="insertion mode: in table body">in table
+  body</a>" and "<a href=#parsing-main-intr title="insertion mode: in row">in
+  row</a>" insertion modes, after which the DOM looks as
+  follows:</p>
+
+  <ul class=domTree><li class=t1><code><a href=#the-html-element>html</a></code><ul><li class=t1><code><a href=#the-head-element>head</a></code><li class=t1><code><a href=#the-body-element>body</a></code><ul><li class=t1><code><a href=#the-b-element>b</a></code><li class=t1><code><a href=#the-table-element>table</a></code><ul><li class=t1><code><a href=#the-tbody-element>tbody</a></code><ul><li class=t1><code><a href=#the-tr-element>tr</a></code></ul></ul></ul></ul></ul><p>Here, the <a href=#stack-of-open-elements>stack of open elements</a> has on it the
+  elements <code><a href=#the-html-element>html</a></code>, <code><a href=#the-body-element>body</a></code>, <code><a href=#the-table-element>table</a></code>,
+  <code><a href=#the-tbody-element>tbody</a></code>, and <code><a href=#the-tr-element>tr</a></code>; the <a href=#list-of-active-formatting-elements>list of active
+  formatting elements</a> still has the <code><a href=#the-b-element>b</a></code> element in
+  it; the <a href=#insertion-mode>insertion mode</a> is "<a href=#parsing-main-intr title="insertion mode:
+  in row">in row</a>"; and the <code><a href=#the-table-element>table</a></code> element is still
+  <a href=#tainted>tainted</a>.</p>
+
+  <p>The <code><a href=#the-td-element>td</a></code> element start tag token, after putting a
+  <code><a href=#the-td-element>td</a></code> element on the tree, puts a marker on the <a href=#list-of-active-formatting-elements>list
+  of active formatting elements</a> (it also switches to the "<a href=#parsing-main-intd title="insertion mode: in cell">in cell</a>" <a href=#insertion-mode>insertion
+  mode</a>).</p>
+
+  <ul class=domTree><li class=t1><code><a href=#the-html-element>html</a></code><ul><li class=t1><code><a href=#the-head-element>head</a></code><li class=t1><code><a href=#the-body-element>body</a></code><ul><li class=t1><code><a href=#the-b-element>b</a></code><li class=t1><code><a href=#the-table-element>table</a></code><ul><li class=t1><code><a href=#the-tbody-element>tbody</a></code><ul><li class=t1><code><a href=#the-tr-element>tr</a></code><ul><li class=t1><code><a href=#the-td-element>td</a></code></ul></ul></ul></ul></ul></ul><p>The marker means that when the "aaa" character tokens are seen,
+  no <code><a href=#the-b-element>b</a></code> element is created to hold the resulting text
+  node:</p>
+
+  <ul class=domTree><li class=t1><code><a href=#the-html-element>html</a></code><ul><li class=t1><code><a href=#the-head-element>head</a></code><li class=t1><code><a href=#the-body-element>body</a></code><ul><li class=t1><code><a href=#the-b-element>b</a></code><li class=t1><code><a href=#the-table-element>table</a></code><ul><li class=t1><code><a href=#the-tbody-element>tbody</a></code><ul><li class=t1><code><a href=#the-tr-element>tr</a></code><ul><li class=t1><code><a href=#the-td-element>td</a></code><ul><li class=t3><code>#text</code>: <span title="">aaa</span></ul></ul></ul></ul></ul></ul></ul><p>The end tags are handled in a straight-forward manner; after
+  handling them, the <a href=#stack-of-open-elements>stack of open elements</a> has on it the
+  elements <code><a href=#the-html-element>html</a></code>, <code><a href=#the-body-element>body</a></code>, <code><a href=#the-table-element>table</a></code>,
+  and <code><a href=#the-tbody-element>tbody</a></code>; the <a href=#list-of-active-formatting-elements>list of active formatting
+  elements</a> still has the <code><a href=#the-b-element>b</a></code> element in it (the
+  marker having been removed by the "td" end tag token); the
+  <a href=#insertion-mode>insertion mode</a> is "<a href=#parsing-main-intbody title="insertion mode: in
+  table body">in table body</a>"; and the <code><a href=#the-table-element>table</a></code>
+  element is still <a href=#tainted>tainted</a>.</p>
+
+  <p>Thus it is that the "bbb" character tokens are found. When <a href=#reconstruct-the-active-formatting-elements title="reconstruct the active formatting elements">the active
+  formatting elements are reconstructed</a>, a <code><a href=#the-b-element>b</a></code>
+  element is created and <a href=#foster-parent title="foster parent">foster
+  parented</a>, and then the "bbb" text node is appended to it:</p>
+
+  <ul class=domTree><li class=t1><code><a href=#the-html-element>html</a></code><ul><li class=t1><code><a href=#the-head-element>head</a></code><li class=t1><code><a href=#the-body-element>body</a></code><ul><li class=t1><code><a href=#the-b-element>b</a></code><li class=t1><code><a href=#the-b-element>b</a></code><ul><li class=t3><code>#text</code>: <span title="">bbb</span></ul><li class=t1><code><a href=#the-table-element>table</a></code><ul><li class=t1><code><a href=#the-tbody-element>tbody</a></code><ul><li class=t1><code><a href=#the-tr-element>tr</a></code><ul><li class=t1><code><a href=#the-td-element>td</a></code><ul><li class=t3><code>#text</code>: <span title="">aaa</span></ul></ul></ul></ul></ul></ul></ul><p>The <a href=#stack-of-open-elements>stack of open elements</a> has on it the elements
+  <code><a href=#the-html-element>html</a></code>, <code><a href=#the-body-element>body</a></code>, <code><a href=#the-table-element>table</a></code>,
+  <code><a href=#the-tbody-element>tbody</a></code>, and the new <code><a href=#the-b-element>b</a></code> (again, note that
+  this doesn't match the resulting tree!); the <a href=#list-of-active-formatting-elements>list of active
+  formatting elements</a> has the new <code><a href=#the-b-element>b</a></code> element in it;
+  the <a href=#insertion-mode>insertion mode</a> is still "<a href=#parsing-main-intbody title="insertion
+  mode: in table body">in table body</a>"; and the
+  <code><a href=#the-table-element>table</a></code> element is still <a href=#tainted>tainted</a>.</p>
+
+  <p>Had the character tokens been <a href=#space-character title="space character">space
+  characters</a> instead of "bbb", the result would have been the
+  same, but only because the table is <a href=#tainted>tainted</a>. Had the
+  <code><a href=#the-b-element>b</a></code> element's start tag been before the
+  <code><a href=#the-table-element>table</a></code> instead of after, then the table wouldn't have
+  been <a href=#tainted>tainted</a> and such <a href=#space-character title="space
+  character">space characters</a> would just be appended to the
+  <code><a href=#the-tbody-element>tbody</a></code> element.</p>
+
+  <p>Finally, the <code><a href=#the-table-element>table</a></code> is closed by a "table" end
+  tag. This pops all the nodes from the <a href=#stack-of-open-elements>stack of open
+  elements</a> up to and including the <code><a href=#the-table-element>table</a></code> element,
+  but it doesn't affect the <a href=#list-of-active-formatting-elements>list of active formatting
+  elements</a>, so the "ccc" character tokens after the table
+  result in yet another <code><a href=#the-b-element>b</a></code> element being created, this
+  time after the table:</p>
+
+  <ul class=domTree><li class=t1><code><a href=#the-html-element>html</a></code><ul><li class=t1><code><a href=#the-head-element>head</a></code><li class=t1><code><a href=#the-body-element>body</a></code><ul><li class=t1><code><a href=#the-b-element>b</a></code><li class=t1><code><a href=#the-b-element>b</a></code><ul><li class=t3><code>#text</code>: <span title="">bbb</span></ul><li class=t1><code><a href=#the-table-element>table</a></code><ul><li class=t1><code><a href=#the-tbody-element>tbody</a></code><ul><li class=t1><code><a href=#the-tr-element>tr</a></code><ul><li class=t1><code><a href=#the-td-element>td</a></code><ul><li class=t3><code>#text</code>: <span title="">aaa</span></ul></ul></ul></ul><li class=t1><code><a href=#the-b-element>b</a></code><ul><li class=t3><code>#text</code>: <span title="">ccc</span></ul></ul></ul></ul><h3 id=namespaces><span class=secno>9.3 </span>Namespaces</h3>
+
   <p>The <dfn id=html-namespace-0>HTML namespace</dfn> is: <code>http://www.w3.org/1999/xhtml</code></p>
 
   <p>The <dfn id=mathml-namespace>MathML namespace</dfn> is: <code>http://www.w3.org/1998/Math/MathML</code></p>

Modified: source
===================================================================
--- source	2009-06-17 07:12:07 UTC (rev 3303)
+++ source	2009-06-23 01:33:39 UTC (rev 3304)
@@ -71813,6 +71813,7 @@
   pause flag</dfn>, which must be initially set to false.</p>
 
 
+
   <h4>The <dfn>input stream</dfn></h4>
 
   <p>The stream of Unicode characters that comprises the input to the
@@ -72783,9 +72784,14 @@
   category, and scope markers. The scope markers are inserted when
   entering <code>applet</code> elements, buttons, <code>object</code>
   elements, marquees, table cells, and table captions, and are used to
-  prevent formatting from "leaking" into <code>applet</code> elements,
-  buttons, <code>object</code> elements, marquees, and tables.</p>
+  prevent formatting from "leaking" <em>into</em> <code>applet</code>
+  elements, buttons, <code>object</code> elements, marquees, and
+  tables.</p>
 
+  <p class="note">The scope markers are unrelated to the concept of an
+  element being <span title="has an element in scope">in
+  scope</span>.</p>
+
   <p>In addition, each element in the <span>list of active formatting
   elements</span> is associated with the token for which it was
   created, so that further elements can be created for that token if
@@ -74806,9 +74812,9 @@
   must be inserted into the <i>foster parent element</i>, and the
   <span>current table</span> must be marked as
   <dfn>tainted</dfn>. (Once the <span>current table</span> has been
-  <span>tainted</span>, whitespace characters are inserted into the
-  <i>foster parent element</i> instead of the <span>current
-  node</span>.)</p>
+  <span>tainted</span>, <span title="space character">space
+  characters</span> are inserted into the <i>foster parent element</i>
+  instead of the <span>current node</span>.)</p>
 
   <p>The <dfn>foster parent element</dfn> is the parent element of the
   last <code>table</code> element in the <span>stack of open
@@ -78546,6 +78552,233 @@
 
 
 
+  <h4>An introduction to error handling in the parser</h4>
+
+  <p><em>This section is non-normative.</em></p>
+
+  <p>This section examines some erroneous markup and discusses how
+  the <span>HTML parser</span> handles these cases.</p>
+
+
+  <h5>Misnested tags: <b><i></b></i></h5>
+
+  <p><em>This section is non-normative.</em></p>
+
+  <p>The most-often discussed example of erroneous markup is as
+  follows:</p>
+
+  <pre><p>1<b>2<i>3</b>4</i>5</p></pre>
+
+  <p>The parsing of this markup is straightforward up to the "3". At
+  this point, the DOM looks like this:</p>
+
+  <ul class="domTree"><li class="t1"><code>html</code><ul><li class="t1"><code>head</code></li><li class="t1"><code>body</code><ul><li class="t1"><code>p</code><ul><li class="t3"><code>#text</code>: <span title="">1</span></li><li class="t1"><code>b</code><ul><li class="t3"><code>#text</code>: <span title="">2</span></li><li class="t1"><code>i</code><ul><li class="t3"><code>#text</code>: <span title="">3</span></li></ul></li></ul></li></ul></li></ul></li></ul></li></ul>
+
+  <p>Here, the <span>stack of open elements</span> has five elements
+  on it: <code>html</code>, <code>body</code>, <code>p</code>,
+  <code>b</code>, and <code>i</code>. The <span>list of active
+  formatting elements</span> just has two: <code>b</code> and
+  <code>i</code>. The <span>insertion mode</span> is "<span
+  title="insertion mode: in body">in body</span>".</p>
+
+  <p>Upon receiving the end tag token with the tag name "b", the "<a
+  href="#adoptionAgency">adoption agency algorithm</a>" is
+  invoked. This is a simple case, in that the <var title="">formatting
+  element</var> is the <code>b</code> element, and there is no
+  <var title="">furthest block</var>. Thus, the <span>stack of open
+  elements</span> ends up with just three elements: <code>html</code>,
+  <code>body</code>, and <code>p</code>, while the <span>list of
+  active formatting elements</span> has just one: <code>i</code>. The
+  DOM tree is unmodified at this point.</p>
+
+  <p>The next token is a character ("4"), triggers the <span
+  title="reconstruct the active formatting elements">reconstruction of
+  the active formatting elements</span>, in this case just the
+  <code>i</code> element. A new <code>i</code> element is thus created
+  for the "4" text node. After the end tag token for the "i" is also
+  received, and the "5" text node is inserted, the DOM looks as
+  follows:</p>
+
+  <ul class="domTree"><li class="t1"><code>html</code><ul><li class="t1"><code>head</code></li><li class="t1"><code>body</code><ul><li class="t1"><code>p</code><ul><li class="t3"><code>#text</code>: <span title="">1</span></li><li class="t1"><code>b</code><ul><li class="t3"><code>#text</code>: <span title="">2</span></li><li class="t1"><code>i</code><ul><li class="t3"><code>#text</code>: <span title="">3</span></li></ul></li></ul></li><li class="t1"><code>i</code><ul><li class="t3"><code>#text</code>: <span title="">4</span></li></ul></li><li class="t3"><code>#text</code>: <span title="">5</span></li></ul></li></ul></li></ul></li></ul>
+
+
+  <h5>Misnested tags: <b><p></b></p></h5>
+
+  <p><em>This section is non-normative.</em></p>
+
+  <p>A case similar to the previous one is the following:</p>
+
+  <pre><b>1<p>2</b>3</p></pre>
+
+  <p>Up to the "2" the parsing here is straightforward:</p>
+
+  <ul class="domTree"><li class="t1"><code>html</code><ul><li class="t1"><code>head</code></li><li class="t1"><code>body</code><ul><li class="t1"><code>b</code><ul><li class="t3"><code>#text</code>: <span title="">1</span></li><li class="t1"><code>p</code><ul><li class="t3"><code>#text</code>: <span title="">2</span></li></ul></li></ul></li></ul></li></ul></li></ul>
+
+  <p>The interesting part is when the end tag token with the tag name
+  "b" is parsed.</p>
+
+  <p>Before that token is seen, the <span>stack of open
+  elements</span> has four elements on it: <code>html</code>,
+  <code>body</code>, <code>b</code>, and <code>p</code>. The
+  <span>list of active formatting elements</span> just has the one:
+  <code>b</code>. The <span>insertion mode</span> is "<span
+  title="insertion mode: in body">in body</span>".</p>
+
+  <p>Upon receiving the end tag token with the tag name "b", the "<a
+  href="#adoptionAgency">adoption agency algorithm</a>" is invoked, as
+  in the previous example. However, in this case, there <em>is</em> a
+  <var title="">furthest block</var>, namely the <code>p</code> element. Thus,
+  this time the adoption agency algorithm isn't skipped over.</p>
+
+  <p>The <var title="">common ancestor</var> is the <code>body</code>
+  element. A conceptual "bookmark" marks the position of the
+  <code>b</code> in the <span>list of active formatting
+  elements</span>, but since that list has only one element in it,
+  it won't have much effect.</p>
+
+  <p>As the algorithm progresses, <var title="">node</var> ends up set
+  to the formatting element (<code>b</code>), and <var title="">last
+  node</var> ends up set to the <var title="">furthest block</var>
+  (<code>p</code>).</p>
+
+  <p>The <var title="">last node</var> gets appended (moved) to the
+  <var title="">common ancestor</var>, so that the DOM looks like:</p>
+
+  <ul class="domTree"><li class="t1"><code>html</code><ul><li class="t1"><code>head</code></li><li class="t1"><code>body</code><ul><li class="t1"><code>b</code><ul><li class="t3"><code>#text</code>: <span title="">1</span></li></ul></li><li class="t1"><code>p</code><ul><li class="t3"><code>#text</code>: <span title="">2</span></li></ul></li></ul></li></ul></li></ul>
+
+  <p>A new <code>b</code> element is created, and the children of the
+  <code>p</code> element are moved to it:</p>
+
+  <ul class="domTree"><li class="t1"><code>html</code><ul><li class="t1"><code>head</code></li><li class="t1"><code>body</code><ul><li class="t1"><code>b</code><ul><li class="t3"><code>#text</code>: <span title="">1</span></li></ul></li><li class="t1"><code>p</code></li></ul></li></ul></li></ul>
+  <ul class="domTree"><li class="t1"><code>b</code><ul><li class="t3"><code>#text</code>: <span title="">2</span></li></ul></li></ul>
+
+  <p>Finally, the new <code>b</code> element is appended to the
+  <code>p</code> element, so that the DOM looks like:</p>
+
+  <ul class="domTree"><li class="t1"><code>html</code><ul><li class="t1"><code>head</code></li><li class="t1"><code>body</code><ul><li class="t1"><code>b</code><ul><li class="t3"><code>#text</code>: <span title="">1</span></li></ul></li><li class="t1"><code>p</code><ul><li class="t1"><code>b</code><ul><li class="t3"><code>#text</code>: <span title="">2</span></li></ul></li></ul></li></ul></li></ul></li></ul>
+
+  <p>The <code>b</code> element is removed from the <span>list of
+  active formatting elements</span> and the <span>stack of open
+  elements</span>, so that when the "3" is parsed, it is appended to
+  the <code>p</code> element:</p>
+
+  <ul class="domTree"><li class="t1"><code>html</code><ul><li class="t1"><code>head</code></li><li class="t1"><code>body</code><ul><li class="t1"><code>b</code><ul><li class="t3"><code>#text</code>: <span title="">1</span></li></ul></li><li class="t1"><code>p</code><ul><li class="t1"><code>b</code><ul><li class="t3"><code>#text</code>: <span title="">2</span></li></ul></li><li class="t3"><code>#text</code>: <span title="">3</span></li></ul></li></ul></li></ul></li></ul>
+
+
+  <h5>Unexpected markup in tables</h5>
+
+  <p><em>This section is non-normative.</em></p>
+
+  <p>Error handling in tables is, for historical reasons, especially
+  strange. For example, consider the following markup:</p>
+
+  <pre><table><strong><b></strong><tr><td>aaa</td></tr><strong>bbb</strong></table>ccc</pre>
+
+  <p>The highlighted <code>b</code> element start tag is not allowed
+  directly inside a table like that, and the parser handles this case
+  by placing the element <em>before</em> the table. (This is called <i
+  title="foster parent">foster parenting</i>.) This can be seen by
+  examining the DOM tree as it stands just after the
+  <code>table</code> element's start tag has been seen:</p>
+
+  <ul class="domTree"><li class="t1"><code>html</code><ul><li class="t1"><code>head</code></li><li class="t1"><code>body</code><ul><li class="t1"><code>table</code></li></ul></li></ul></li></ul>
+
+  <p>...and then immediately after the <code>b</code> element start
+  tag has been seen:</p>
+
+  <ul class="domTree"><li class="t1"><code>html</code><ul><li class="t1"><code>head</code></li><li class="t1"><code>body</code><ul><li class="t1"><code>b</code></li><li class="t1"><code>table</code></li></ul></li></ul></li></ul>
+
+  <p>At this point, the <span>stack of open elements</span> has on it
+  the elements <code>html</code>, <code>body</code>,
+  <code>table</code>, and <code>b</code> (in that order, despite the
+  resulting DOM tree); the <span>list of active formatting
+  elements</span> just has the <code>b</code> element in it; the
+  <span>insertion mode</span> is "<span title="insertion mode: in
+  table">in table</span>"; and the <code>table</code> element is
+  <span>tainted</span>.</p>
+
+  <p>The <code>tr</code> start tag causes the <code>b</code> element
+  to be popped off the stack and a <code>tbody</code> start tag to be
+  implied; the <code>tbody</code> and <code>tr</code> elements are
+  then handled in a rather straight-forward manner, taking the parser
+  through the "<span title="insertion mode: in table body">in table
+  body</span>" and "<span title="insertion mode: in row">in
+  row</span>" insertion modes, after which the DOM looks as
+  follows:</p>
+
+  <ul class="domTree"><li class="t1"><code>html</code><ul><li class="t1"><code>head</code></li><li class="t1"><code>body</code><ul><li class="t1"><code>b</code></li><li class="t1"><code>table</code><ul><li class="t1"><code>tbody</code><ul><li class="t1"><code>tr</code></li></ul></li></ul></li></ul></li></ul></li></ul>
+
+  <p>Here, the <span>stack of open elements</span> has on it the
+  elements <code>html</code>, <code>body</code>, <code>table</code>,
+  <code>tbody</code>, and <code>tr</code>; the <span>list of active
+  formatting elements</span> still has the <code>b</code> element in
+  it; the <span>insertion mode</span> is "<span title="insertion mode:
+  in row">in row</span>"; and the <code>table</code> element is still
+  <span>tainted</span>.</p>
+
+  <p>The <code>td</code> element start tag token, after putting a
+  <code>td</code> element on the tree, puts a marker on the <span>list
+  of active formatting elements</span> (it also switches to the "<span
+  title="insertion mode: in cell">in cell</span>" <span>insertion
+  mode</span>).</p>
+
+  <ul class="domTree"><li class="t1"><code>html</code><ul><li class="t1"><code>head</code></li><li class="t1"><code>body</code><ul><li class="t1"><code>b</code></li><li class="t1"><code>table</code><ul><li class="t1"><code>tbody</code><ul><li class="t1"><code>tr</code><ul><li class="t1"><code>td</code></li></ul></li></ul></li></ul></li></ul></li></ul></li></ul>
+
+  <p>The marker means that when the "aaa" character tokens are seen,
+  no <code>b</code> element is created to hold the resulting text
+  node:</p>
+
+  <ul class="domTree"><li class="t1"><code>html</code><ul><li class="t1"><code>head</code></li><li class="t1"><code>body</code><ul><li class="t1"><code>b</code></li><li class="t1"><code>table</code><ul><li class="t1"><code>tbody</code><ul><li class="t1"><code>tr</code><ul><li class="t1"><code>td</code><ul><li class="t3"><code>#text</code>: <span title="">aaa</span></li></ul></li></ul></li></ul></li></ul></li></ul></li></ul></li></ul>
+
+  <p>The end tags are handled in a straight-forward manner; after
+  handling them, the <span>stack of open elements</span> has on it the
+  elements <code>html</code>, <code>body</code>, <code>table</code>,
+  and <code>tbody</code>; the <span>list of active formatting
+  elements</span> still has the <code>b</code> element in it (the
+  marker having been removed by the "td" end tag token); the
+  <span>insertion mode</span> is "<span title="insertion mode: in
+  table body">in table body</span>"; and the <code>table</code>
+  element is still <span>tainted</span>.</p>
+
+  <p>Thus it is that the "bbb" character tokens are found. When <span
+  title="reconstruct the active formatting elements">the active
+  formatting elements are reconstructed</span>, a <code>b</code>
+  element is created and <span title="foster parent">foster
+  parented</span>, and then the "bbb" text node is appended to it:</p>
+
+  <ul class="domTree"><li class="t1"><code>html</code><ul><li class="t1"><code>head</code></li><li class="t1"><code>body</code><ul><li class="t1"><code>b</code></li><li class="t1"><code>b</code><ul><li class="t3"><code>#text</code>: <span title="">bbb</span></li></ul></li><li class="t1"><code>table</code><ul><li class="t1"><code>tbody</code><ul><li class="t1"><code>tr</code><ul><li class="t1"><code>td</code><ul><li class="t3"><code>#text</code>: <span title="">aaa</span></li></ul></li></ul></li></ul></li></ul></li></ul></li></ul></li></ul>
+
+  <p>The <span>stack of open elements</span> has on it the elements
+  <code>html</code>, <code>body</code>, <code>table</code>,
+  <code>tbody</code>, and the new <code>b</code> (again, note that
+  this doesn't match the resulting tree!); the <span>list of active
+  formatting elements</span> has the new <code>b</code> element in it;
+  the <span>insertion mode</span> is still "<span title="insertion
+  mode: in table body">in table body</span>"; and the
+  <code>table</code> element is still <span>tainted</span>.</p>
+
+  <p>Had the character tokens been <span title="space character">space
+  characters</span> instead of "bbb", the result would have been the
+  same, but only because the table is <span>tainted</span>. Had the
+  <code>b</code> element's start tag been before the
+  <code>table</code> instead of after, then the table wouldn't have
+  been <span>tainted</span> and such <span title="space
+  character">space characters</span> would just be appended to the
+  <code>tbody</code> element.</p>
+
+  <p>Finally, the <code>table</code> is closed by a "table" end
+  tag. This pops all the nodes from the <span>stack of open
+  elements</span> up to and including the <code>table</code> element,
+  but it doesn't affect the <span>list of active formatting
+  elements</span>, so the "ccc" character tokens after the table
+  result in yet another <code>b</code> element being created, this
+  time after the table:</p>
+
+  <ul class="domTree"><li class="t1"><code>html</code><ul><li class="t1"><code>head</code></li><li class="t1"><code>body</code><ul><li class="t1"><code>b</code></li><li class="t1"><code>b</code><ul><li class="t3"><code>#text</code>: <span title="">bbb</span></li></ul></li><li class="t1"><code>table</code><ul><li class="t1"><code>tbody</code><ul><li class="t1"><code>tr</code><ul><li class="t1"><code>td</code><ul><li class="t3"><code>#text</code>: <span title="">aaa</span></li></ul></li></ul></li></ul></li></ul></li><li class="t1"><code>b</code><ul><li class="t3"><code>#text</code>: <span title="">ccc</span></li></ul></li></ul></li></ul></li></ul>
+
+
+
+
   <h3>Namespaces</h3>
 
   <p>The <dfn>HTML namespace</dfn> is: <code>http://www.w3.org/1999/xhtml</code></p>




More information about the Commit-Watchers mailing list