[whatwg] several messages about XML syntax and HTML5

Mon Dec 4 15:26:07 PST 2006

Sam Ruby wrote:
> James Graham wrote:
>> As I understand it, the full chain of events should look like this:
>>
>>  [Internal data model in server]
>>                 |
>>                 |
>>        HTML 5 Serializer
>>                 |
>>                 |
>>             {Network}
>>                 |
>>                 |
>>           HTML 5 Parser
>>                 |
>>                 |
>>  [Whatever client tools you like]
>
> This only works if the internal-data-model to HTML5 conversion is 
> lossless.

The potentially-lossy-conversion argument is rather pointless when you 
consider that reserialising XHTML as HTML has, for all practical 
purposes, is almost exactly the same or better serving XHTML as text/html.

The main difference is that instead of the conversion to HTML5 happening 
on the server side, as in that diagram, the browser receives XHTML which 
it then attempts to treat as HTML anyway.  What practical difference is 
there?  The following example illustrates this.

Say the following was your XHTML document.  I'm only including the 
doctype because it's necessary for the example, not because it's useful 
to have in XHTML at all.

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
   <title>Example</title>
</head>
<body>
   <p>This document cannot be converted losslessy because:
     <ul>
       <li>A paragraph cannot contain a ul in HTML</li>
     </ul>
     and they will become siblings instead.</p>
</body>
</html>

There are 3 scenarios.  In scenario 1, it's sent unchanged as XML. In 
scenario 2, the XHTML is serialised to HTML on the server side.  In 
scenario 3, it's sent unchanged as text/html.

*Scenario 1: XHTML as XML*
When parsed by the browser using an XML parser, it produces the 
following DOM:
(whitespace nodes omitted and all elements are in the XHTML namespace)

* #DOCTYPE html
* html
     - ("http://www.w3.org/2000/xmlns/", "xmlns")
     - ("http://www.w3.org/XML/1998/namespace", "xml:lang")
   * head
     * title
       * #text: Example
   * body
     * p
       * #text: This document cannot be converted losslessy because:
       * ul
         * li
           * #text: A paragraph cannot contain a ul in HTML
       * #text: and they will become siblings instead.

*Scenario 2: Reserialising as HTML*

* Because a <p> cannot contain a <ul>, the document gets converted into 
the following:

<!DOCTYPE html>
<html lang="en">
<head>
   <title>Example</title>
</head>
<body>
   <p>This document cannot be converted losslessy because:
     </p><ul>
       <li>A paragraph cannot contain a ul in HTML</li>
     </ul><p>
     and they will become siblings instead.</p>
</body>
</html>

In this simple example, there were 4 changes:
* Removal of xmlns
* Changed xml:lang to lang
* The <p> element had to end immediately before the <ul>
* Created a new paragraph after the UL for the remaining sentence.

When parsed, the browser will produce a DOM that looks like this:

* #DOCTYPE html
* html
     - ("", "lang")
   * head
     * title
       * #text: Example
   * body
     * p
       * #text: This document cannot be converted losslessy because:
     * ul
       * li
         * #text: A paragraph cannot contain a ul in HTML
     * p
       * #text: and they will become siblings instead.

*Scenario 3: XHTML as text/html*

This relies on browser error recovery.  The document is sent unchanged 
and produces the following DOM:

* #DOCTYPE html
* html
     - ("", "xmlns")
     - ("", "xml:lang")
   * head
     * title
       * #text: Example
   * body
     * p
       * #text: This document cannot be converted losslessy because:
     * ul
       * li
         * #text: A paragraph cannot contain a ul in HTML
     * #text: and they will become siblings instead.

In this final case, the DOM is similar to scenario 2; except for the 
following:

* The "xmlns" and "xml:lang" attributes in the null namespace.
* The lack of the "lang" attribute in the null namespace.
* The final text node has become child of body, instead of a p element.

You've ended up with a lossy conversion of your XHTML in both text/html 
cases.  In fact, it's marginally better when you perform the 
reserialisation yourself because you get to make smarter decisions.

The point is that complaining about the inability to perform lossless 
conversion in some cases is not really practically relevant for anyone 
who's willing to serve their XHTML documents as text/html anyway – the 
end result is practically same, if not better, when you reserialise it 
yourself.

This issue has been around for years, ever since XHTML 1.0 began 
wreaking havoc on the world, yet it doesn't seem to have particularly 
bothered anyone trying to use it, or even promoting it.

You just need to realise that, if you wish to have your documents 
reserialised as HTML or even wrongly serve XHTML as text/html, you need 
to take care to avoid features which will result in a lossy conversion, 
or put up with the minor discrepancies.

-- 
Lachlan Hunt
http://lachy.id.au/