[html5] r1907 - [t] (0) Provide a way to mutate the DOM into an infoset. (Bug 5808) (credit: hs)
whatwg at whatwg.org
whatwg at whatwg.org
Tue Jul 22 19:02:21 PDT 2008
Author: ianh
Date: 2008-07-22 19:02:20 -0700 (Tue, 22 Jul 2008)
New Revision: 1907
Modified:
index
source
Log:
[t] (0) Provide a way to mutate the DOM into an infoset. (Bug 5808) (credit: hs)
Modified: index
===================================================================
--- index 2008-07-23 01:04:16 UTC (rev 1906)
+++ index 2008-07-23 02:02:20 UTC (rev 1907)
@@ -1947,6 +1947,9 @@
</ul>
<li><a href="#the-end"><span class=secno>8.2.6 </span>The end</a>
+
+ <li><a href="#coercing"><span class=secno>8.2.7 </span>Coercing an
+ HTML DOM into an infoset</a>
</ul>
<li><a href="#namespaces"><span class=secno>8.3 </span>Namespaces</a>
@@ -50998,6 +51001,130 @@
/parser/htmlparser/src/nsElementTable.cpp, line 1901 - // Ex: <H1><LI><H1><LI>. Inner LI has the potential of getting nested
-->
+ <h4 id=coercing><span class=secno>8.2.7 </span>Coercing an HTML DOM into an
+ infoset</h4>
+
+ <p>When an application uses an <a href="#html-0">HTML parser</a> in
+ conjunction with an XML pipeline, it is possible that the constructed DOM
+ is not compatible with the XML tool chain in certain subtle ways. For
+ example, an XML toolchain might not be able to represent attributes with
+ the name <code title="">xmlns</code>, since they conflict with the
+ Namespaces in XML syntax. <a href="#refsXMLNS">[XMLNS]</a>
+
+ <p>There is also some data that the <a href="#html-0">HTML parser</a>
+ generates that isn't included in the DOM itself.
+
+ <p>To allow tools to apply a consistent set of adjustments to the output of
+ their <a href="#html-0">HTML parser</a> to allow for compatibility with
+ the rest of their XML toolchain, this section documents a set of mutations
+ and conventions that will convert the output of the <a href="#html-0">HTML
+ parser</a> for any arbitrary input into an XML Infoset that doesn't have
+ any problematic characteristics.
+
+ <p>Tools that cannot convey the out-of-band information using out-of-band
+ mechanisms, or that cannot convey the DOM exact as prescribed by this
+ specification, may either ignore the offending information or DOM feature,
+ or may represent it internally in the DOM using the conventions described
+ below.
+
+ <p>These conventions are not conforming HTML, and user agents must not
+ output such syntax outside of their XML pipeline.
+
+ <dl>
+ <dt>The <code>DocumentType</code> node's <code title="">name</code>, <code
+ title="">publicId</code>, and <code title="">systemId</code> attributes
+
+ <dd>If the XML API being used doesn't support DOCTYPEs, tools may drop
+ DOCTYPEs altogether or create a set of three attributes on the root
+ element, named <code title="">__doctype_name__</code>, <code
+ title="">__doctype_publicid__</code>, and <code
+ title="">__doctype_systemid__</code>, respectively, whose values are the
+ values that would have been put on the <code>DocumentType</code> node.
+
+ <dt>The document being set to <i><a href="#no-quirks">no quirks
+ mode</a></i>, <i><a href="#limited1">limited quirks mode</a></i>, or
+ <i><a href="#quirks">quirks mode</a></i>
+
+ <dd>To convey this information, create an attribute <code
+ title="">__mode__</code> on the root element, with values "noquirks",
+ "limitedquirks", or "quirks" respectively.
+
+ <dt>Elements that have a namespace without appropriate <code
+ title="">xmlns</code> attributes being in scope
+
+ <dd>Construct the DOM as if appropriate namespace declarations were in
+ scope.
+
+ <dt>Elements whose names contain U+003A COLON (:) characters or characters
+ that cannot be represented in XML element names
+
+ <dd>Drop the element and all its children, or replace any offending
+ characters with a U+005F LOW LINE (_) character.
+
+ <dt>Attributes named <code title="">xmlns</code> whose values match the
+ namespace of the element node
+
+ <dd>Construct the DOM as if these were default namespace declarations.
+
+ <dt>Attributes named <code title="">xmlns:xlink</code> whose values match
+ the <a href="#xlink">XLink namespace</a>, on elements whose namespace is
+ not the <a href="#html-namespace0">HTML namespace</a>
+
+ <dd>Construct the DOM as if these were namespace prefix declarations.
+
+ <dt>Other attributes whose names are <code title="">xmlns</code> or start
+ with <code title="">xmlns:</code>
+
+ <dd>Drop the attributes or add two U+005F LOW LINE (_) characters to the
+ start of the attributes' names and replace any U+003A COLON (:)
+ characters with a U+005F LOW LINE (_) character.
+
+ <dt>Other attributes in no namespace whose names contain U+003A COLON (:)
+ characters
+
+ <dt>Attributes whose names contain characters that cannot be represented
+ in XML attribute names
+
+ <dd>Drop the attributes or replace any offending characters with a U+005F
+ LOW LINE (_) character, dropping any attributes where doing this would
+ cause an attribute name clash.
+
+ <dt>Form controls being associated with forms that aren't their nearest
+ ancestor (use of the <a href="#form-element"><code>form</code> element
+ pointer</a>
+
+ <dd>Create an attribute <code title="">__formid__</code> on the form, with
+ a value unique amongst <code title="">__formid__</code> attributes in the
+ document, and create an attribute <code title="">__form__</code> on the
+ form control, whose value matches the unique identifier given to the
+ form.
+
+ <dt>Any U+000C FORM FEED (FF) character
+
+ <dd>Replace the character with a U+0020 SPACE character.
+
+ <dt>Any other literal non-XML character
+
+ <dd>Replace the character with a U+FFFD REPLACEMENT CHARACTER.
+
+ <dt>A comment that contains two adjacent U+002D HYPHEN-MINUS characters
+ (--).
+
+ <dd>Insert a U+0020 SPACE character between them.
+ </dl>
+
+ <p>Tools that use these conventions should guard against documents that
+ include markup that clashes with them by always dropping all attributes in
+ the document that start with two U+005F LOW LINE (_) characters.
+
+ <p class=note>These conventions apply <em>after</em> the <a
+ href="#html-0">HTML parser</a>'s rules have been applied. For example, a
+ <code title=""><a::></code> start tag will be closed by a <code
+ title=""></a::></code> end tag, and never by a <code
+ title=""></a__></code> end tag, even if the user agent is using the
+ rules above to then generate an actual element in the DOM with the name
+ <code title="">a__</code> for that start tag.
+
<h3 id=namespaces><span class=secno>8.3 </span>Namespaces</h3>
<p>The <dfn id=html-namespace0>HTML namespace</dfn> is:
Modified: source
===================================================================
--- source 2008-07-23 01:04:16 UTC (rev 1906)
+++ source 2008-07-23 02:02:20 UTC (rev 1907)
@@ -48089,6 +48089,149 @@
/parser/htmlparser/src/nsElementTable.cpp, line 1901 - // Ex: <H1><LI><H1><LI>. Inner LI has the potential of getting nested
-->
+
+ <h4>Coercing an HTML DOM into an infoset</h4>
+
+ <p>When an application uses an <span>HTML parser</span> in
+ conjunction with an XML pipeline, it is possible that the
+ constructed DOM is not compatible with the XML tool chain in certain
+ subtle ways. For example, an XML toolchain might not be able to
+ represent attributes with the name <code title="">xmlns</code>,
+ since they conflict with the Namespaces in XML syntax. <a
+ href="#refsXMLNS">[XMLNS]</a></p>
+
+ <p>There is also some data that the <span>HTML parser</span>
+ generates that isn't included in the DOM itself.</p>
+
+ <p>To allow tools to apply a consistent set of adjustments to the
+ output of their <span>HTML parser</span> to allow for compatibility
+ with the rest of their XML toolchain, this section documents a set
+ of mutations and conventions that will convert the output of the
+ <span>HTML parser</span> for any arbitrary input into an XML Infoset
+ that doesn't have any problematic characteristics.</p>
+
+ <p>Tools that cannot convey the out-of-band information using
+ out-of-band mechanisms, or that cannot convey the DOM exact as
+ prescribed by this specification, may either ignore the offending
+ information or DOM feature, or may represent it internally in the
+ DOM using the conventions described below.</p>
+
+ <p>These conventions are not conforming HTML, and user agents must
+ not output such syntax outside of their XML pipeline.</p>
+
+ <dl>
+
+ <dt>The <code>DocumentType</code> node's <code
+ title="">name</code>, <code title="">publicId</code>, and <code
+ title="">systemId</code> attributes</dt>
+
+ <dd>If the XML API being used doesn't support DOCTYPEs, tools may
+ drop DOCTYPEs altogether or create a set of three attributes on the
+ root element, named <code title="">__doctype_name__</code>, <code
+ title="">__doctype_publicid__</code>, and <code
+ title="">__doctype_systemid__</code>, respectively, whose values
+ are the values that would have been put on the
+ <code>DocumentType</code> node.</dd>
+
+
+ <dt>The document being set to <i>no quirks mode</i>, <i>limited
+ quirks mode</i>, or <i>quirks mode</i></dt>
+
+ <dd>To convey this information, create an attribute <code
+ title="">__mode__</code> on the root element, with values
+ "noquirks", "limitedquirks", or "quirks" respectively.</dd>
+
+
+ <dt>Elements that have a namespace without appropriate <code
+ title="">xmlns</code> attributes being in scope</dt>
+
+ <dd>Construct the DOM as if appropriate namespace declarations were
+ in scope.</dd>
+
+
+ <dt>Elements whose names contain U+003A COLON (:) characters or
+ characters that cannot be represented in XML element names</dt>
+
+ <dd>Drop the element and all its children, or replace any offending
+ characters with a U+005F LOW LINE (_) character.</dd>
+
+
+ <dt>Attributes named <code title="">xmlns</code> whose values match
+ the namespace of the element node</dt>
+
+ <dd>Construct the DOM as if these were default namespace
+ declarations.</dd>
+
+
+ <dt>Attributes named <code title="">xmlns:xlink</code> whose values
+ match the <span>XLink namespace</span>, on elements whose namespace
+ is not the <span>HTML namespace</span></dt>
+
+ <dd>Construct the DOM as if these were namespace prefix
+ declarations.</dd>
+
+
+ <dt>Other attributes whose names are <code title="">xmlns</code> or
+ start with <code title="">xmlns:</code></dt>
+
+ <dd>Drop the attributes or add two U+005F LOW LINE (_) characters
+ to the start of the attributes' names and replace any U+003A COLON
+ (:) characters with a U+005F LOW LINE (_) character.</dd>
+
+
+ <dt>Other attributes in no namespace whose names contain U+003A
+ COLON (:) characters</dt>
+ <dt>Attributes whose names contain characters that cannot be
+ represented in XML attribute names</dt>
+
+ <dd>Drop the attributes or replace any offending characters with a
+ U+005F LOW LINE (_) character, dropping any attributes where doing
+ this would cause an attribute name clash.</dd>
+
+
+ <dt>Form controls being associated with forms that aren't their
+ nearest ancestor (use of the <span><code>form</code> element
+ pointer</span)</dt>
+
+ <dd>Create an attribute <code title="">__formid__</code> on the
+ form, with a value unique amongst <code title="">__formid__</code>
+ attributes in the document, and create an attribute <code
+ title="">__form__</code> on the form control, whose value matches
+ the unique identifier given to the form.</dd>
+
+
+ <dt>Any U+000C FORM FEED (FF) character</dt>
+
+ <dd>Replace the character with a U+0020 SPACE character.</dd>
+
+
+ <dt>Any other literal non-XML character</dt>
+
+ <dd>Replace the character with a U+FFFD REPLACEMENT CHARACTER.</dd>
+
+
+ <dt>A comment that contains two adjacent U+002D HYPHEN-MINUS
+ characters (--).</dt>
+
+ <dd>Insert a U+0020 SPACE character between them.</dd>
+
+ </dl>
+
+ <p>Tools that use these conventions should guard against documents
+ that include markup that clashes with them by always dropping all
+ attributes in the document that start with two U+005F LOW LINE (_)
+ characters.</p>
+
+ <p class="note">These conventions apply <em>after</em> the
+ <span>HTML parser</span>'s rules have been applied. For example, a
+ <code title=""><a::></code> start tag will be closed by a <code
+ title=""></a::></code> end tag, and never by a <code
+ title=""></a__></code> end tag, even if the user agent is using
+ the rules above to then generate an actual element in the DOM with
+ the name <code title="">a__</code> for that start tag.</p>
+
+
+
<h3>Namespaces</h3>
<p>The <dfn>HTML namespace</dfn> is: <code>http://www.w3.org/1999/xhtml</code></p>
More information about the Commit-Watchers
mailing list