[Imps] [whatwg] Standard DOM Serialization? [was :Common Subset]

Mon Dec 11 07:29:36 PST 2006

On Dec 10, 2006, at 02:57, Sam Ruby wrote:

> Henri Sivonen wrote:
>> On Dec 10, 2006, at 02:09, Sam Ruby wrote:
>>> I am asking whether there is interest in identifying ONE standard  
>>> serialization that everybody who wishes to comply with could do so.
>> Why? For digital signatures? For comparing parse trees from  
>> different parsers?
>
> My train of thought started with the sharing of test cases, and  
> when coupled with the discussion on the common subset; when put  
> together I was wondering if there would be a relation between the two.
>
> I (obviously) hadn't considered innerHTML.  *IF* there were  
> interest in changing this (something which I presume is *NOT* the  
> case) and *IF* a common subset between XHTML5 and HTML5 was viable  
> (plausible but not certain) *THEN* the confusing difference in  
> meaning between innerHTML in an XML and HTML context could be  
> resolved.

I am interested in defining a format for dumping the document tree as  
a sequence of bytes that can be compared against a reference sequence  
of bytes for testing. I wouldn't expect test case writers to  
necessarily write the reference bytes by hand, but the first parser  
implementor who gets a particular test case right could share the  
tree dump with other implementors.

Making the format a subset of HTML5 instead of defining something  
like ESIS would be nice, because checking if the reference dumps map  
onto themselves could also be run as a test.

However, I don't think that authors should be in any way be  
encouraged to make an effort to use such Canonical HTML5 on the Web.  
If such suggestions started to float around, next I'd be asked to add  
an option to the conformance checker to check if a document is a  
Canonical HTML5 document. If I didn't, people would keep pestering me  
about it. If I did, some authors would think that making their  
documents Canonical HTML5 documents had some value. Then someone  
would come around and explain to them that Canonical HTML5 doesn't  
really provide any value over HTML5 in general, because UAs have to  
accept non-canonical HTML5 anyway. Then those who made the effort  
would hate me for inducing them to waste their time on ensuring that  
they target a subset.

Here's a quick draft:
  * Comments are not serialized (rationale: bogus comments may  
contain "-->" which would break the assumption of canonical documents  
mapping onto themselves)
  * Everything is encoded as UTF-8 *with* the BOM for easy sniffing.
  * Start with "<!DOCTYPE html>" followed by LF.
  * For element start, write "<" followed by the element name,  
followed by attributes, followed by ">".
    - Sort attributes lexicographically according to the attribute  
name. (By code point or by UTF-16 code unit?)
    - For each attribute, write: space, attribute name, equals sign  
and the value in double quotes with <, >, & and " escaped.
    - xmlns is not written even if permitted.
  * Write character data with <, > and & escaped.
  * Write a single LF after the end tag of the root element.
  * Character data (white space) outside the root element is ignored.

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/