[Imps] [whatwg] Standard DOM Serialization? [was :Common Subset]
hsivonen at iki.fi
Mon Dec 11 07:29:36 PST 2006
On Dec 10, 2006, at 02:57, Sam Ruby wrote:
> Henri Sivonen wrote:
>> On Dec 10, 2006, at 02:09, Sam Ruby wrote:
>>> I am asking whether there is interest in identifying ONE standard
>>> serialization that everybody who wishes to comply with could do so.
>> Why? For digital signatures? For comparing parse trees from
>> different parsers?
> My train of thought started with the sharing of test cases, and
> when coupled with the discussion on the common subset; when put
> together I was wondering if there would be a relation between the two.
> I (obviously) hadn't considered innerHTML. *IF* there were
> interest in changing this (something which I presume is *NOT* the
> case) and *IF* a common subset between XHTML5 and HTML5 was viable
> (plausible but not certain) *THEN* the confusing difference in
> meaning between innerHTML in an XML and HTML context could be
I am interested in defining a format for dumping the document tree as
a sequence of bytes that can be compared against a reference sequence
of bytes for testing. I wouldn't expect test case writers to
necessarily write the reference bytes by hand, but the first parser
implementor who gets a particular test case right could share the
tree dump with other implementors.
Making the format a subset of HTML5 instead of defining something
like ESIS would be nice, because checking if the reference dumps map
onto themselves could also be run as a test.
However, I don't think that authors should be in any way be
encouraged to make an effort to use such Canonical HTML5 on the Web.
If such suggestions started to float around, next I'd be asked to add
an option to the conformance checker to check if a document is a
Canonical HTML5 document. If I didn't, people would keep pestering me
about it. If I did, some authors would think that making their
documents Canonical HTML5 documents had some value. Then someone
would come around and explain to them that Canonical HTML5 doesn't
really provide any value over HTML5 in general, because UAs have to
accept non-canonical HTML5 anyway. Then those who made the effort
would hate me for inducing them to waste their time on ensuring that
they target a subset.
Here's a quick draft:
* Comments are not serialized (rationale: bogus comments may
contain "-->" which would break the assumption of canonical documents
mapping onto themselves)
* Everything is encoded as UTF-8 *with* the BOM for easy sniffing.
* Start with "<!DOCTYPE html>" followed by LF.
* For element start, write "<" followed by the element name,
followed by attributes, followed by ">".
- Sort attributes lexicographically according to the attribute
name. (By code point or by UTF-16 code unit?)
- For each attribute, write: space, attribute name, equals sign
and the value in double quotes with <, >, & and " escaped.
- xmlns is not written even if permitted.
* Write character data with <, > and & escaped.
* Write a single LF after the end tag of the root element.
* Character data (white space) outside the root element is ignored.
hsivonen at iki.fi
More information about the Implementors