[whatwg] DOM serialization: A new level perhaps?

Fri Nov 16 09:09:21 PST 2012

I've been looking at the new DOM serialization spec, and I have a concern.
 The serialization algorithms assume that the document is being serialized
without having been parsed from a block of source code before.  It doesn't
preserve existing formatting.

For reference:  http://domparsing.spec.whatwg.org/#concept-serialize-xml

Example:  If I parse this:

<class name="Computer Science"
       code="CSCI"
      >
  <student id="X1234"/>
</class>

The resulting DOM is fine.  When I reserialize it, I get:

<class name="Computer Science" code="CSCI">
  <student id="X1234"/>
</class>

When I perform a diff in SVN or Mercurial before committing, the source
code shows changes that I did not intend.  Forget the fact that the latter
is 'prettier' - the authoring style has been altered.  This may be
significant when there are several attributes on an element, pushing the
line length well above 80 characters.

I'm starting some experiments using Mozilla's SAX parser and JavaScript to
attempt to figure out some solutions.  First, the DOMParser may elect to
cache starting and closing tags for elements (and the original source for
non-elements) as a private property of nodes which the DOMSerializer may
use if the node has not mutated.  Second, I'm going to try defining a
"MarkupIndent" data type in JS to try capturing the source document's
indentation style(s), so that when a node changes (adding a new attribute,
for example), the DOMSerializer will have some information to make an
educated guess about how the author wants it formatted.

I expect within a month or so I can deliver a working implementation with
documentation and tests.  I hope that we could add a chapter to the
DOMParsing spec, making this an optional feature.

--
"The first step in confirming there is a bug in someone else's work is
confirming there are no bugs in your own."
-- Alexander J. Vincent, June 30, 2001