[whatwg] several messages about XML syntax and HTML5
Lachlan Hunt
lachlan.hunt at lachy.id.au
Mon Dec 4 15:26:07 PST 2006
Sam Ruby wrote:
> James Graham wrote:
>> As I understand it, the full chain of events should look like this:
>>
>> [Internal data model in server]
>> |
>> |
>> HTML 5 Serializer
>> |
>> |
>> {Network}
>> |
>> |
>> HTML 5 Parser
>> |
>> |
>> [Whatever client tools you like]
>
> This only works if the internal-data-model to HTML5 conversion is
> lossless.
The potentially-lossy-conversion argument is rather pointless when you
consider that reserialising XHTML as HTML has, for all practical
purposes, is almost exactly the same or better serving XHTML as text/html.
The main difference is that instead of the conversion to HTML5 happening
on the server side, as in that diagram, the browser receives XHTML which
it then attempts to treat as HTML anyway. What practical difference is
there? The following example illustrates this.
Say the following was your XHTML document. I'm only including the
doctype because it's necessary for the example, not because it's useful
to have in XHTML at all.
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>Example</title>
</head>
<body>
<p>This document cannot be converted losslessy because:
<ul>
<li>A paragraph cannot contain a ul in HTML</li>
</ul>
and they will become siblings instead.</p>
</body>
</html>
There are 3 scenarios. In scenario 1, it's sent unchanged as XML. In
scenario 2, the XHTML is serialised to HTML on the server side. In
scenario 3, it's sent unchanged as text/html.
*Scenario 1: XHTML as XML*
When parsed by the browser using an XML parser, it produces the
following DOM:
(whitespace nodes omitted and all elements are in the XHTML namespace)
* #DOCTYPE html
* html
- ("http://www.w3.org/2000/xmlns/", "xmlns")
- ("http://www.w3.org/XML/1998/namespace", "xml:lang")
* head
* title
* #text: Example
* body
* p
* #text: This document cannot be converted losslessy because:
* ul
* li
* #text: A paragraph cannot contain a ul in HTML
* #text: and they will become siblings instead.
*Scenario 2: Reserialising as HTML*
* Because a <p> cannot contain a <ul>, the document gets converted into
the following:
<!DOCTYPE html>
<html lang="en">
<head>
<title>Example</title>
</head>
<body>
<p>This document cannot be converted losslessy because:
</p><ul>
<li>A paragraph cannot contain a ul in HTML</li>
</ul><p>
and they will become siblings instead.</p>
</body>
</html>
In this simple example, there were 4 changes:
* Removal of xmlns
* Changed xml:lang to lang
* The <p> element had to end immediately before the <ul>
* Created a new paragraph after the UL for the remaining sentence.
When parsed, the browser will produce a DOM that looks like this:
* #DOCTYPE html
* html
- ("", "lang")
* head
* title
* #text: Example
* body
* p
* #text: This document cannot be converted losslessy because:
* ul
* li
* #text: A paragraph cannot contain a ul in HTML
* p
* #text: and they will become siblings instead.
*Scenario 3: XHTML as text/html*
This relies on browser error recovery. The document is sent unchanged
and produces the following DOM:
* #DOCTYPE html
* html
- ("", "xmlns")
- ("", "xml:lang")
* head
* title
* #text: Example
* body
* p
* #text: This document cannot be converted losslessy because:
* ul
* li
* #text: A paragraph cannot contain a ul in HTML
* #text: and they will become siblings instead.
In this final case, the DOM is similar to scenario 2; except for the
following:
* The "xmlns" and "xml:lang" attributes in the null namespace.
* The lack of the "lang" attribute in the null namespace.
* The final text node has become child of body, instead of a p element.
You've ended up with a lossy conversion of your XHTML in both text/html
cases. In fact, it's marginally better when you perform the
reserialisation yourself because you get to make smarter decisions.
The point is that complaining about the inability to perform lossless
conversion in some cases is not really practically relevant for anyone
who's willing to serve their XHTML documents as text/html anyway – the
end result is practically same, if not better, when you reserialise it
yourself.
This issue has been around for years, ever since XHTML 1.0 began
wreaking havoc on the world, yet it doesn't seem to have particularly
bothered anyone trying to use it, or even promoting it.
You just need to realise that, if you wish to have your documents
reserialised as HTML or even wrongly serve XHTML as text/html, you need
to take care to avoid features which will result in a lossy conversion,
or put up with the minor discrepancies.
--
Lachlan Hunt
http://lachy.id.au/
More information about the whatwg
mailing list