[whatwg] Problems with the Atom Conversion algorithm.

Fri Jun 11 16:48:40 PDT 2010

On http://www.詹姆斯.com/blog/2010/06/html5-atom-gone-wrong, a comparison
is made between an example Atom feed (presumably constructed from blog
metadata) and one constructed by the HTML algorithm reading over the
example blog page.  Not all of these differences are valid, but some
are, and should be fixed in the HTML algo.

1. The HTML algo puts the url for atom:link elements in the content of
the <link>.  It should be in the @href of the <link>. (Issue 1 in the
blog post)

2. The <title> of atom entries is constrained to contain text only,
but this "text" can include properly-escaped markup in practice.  The
HTML algo strips that markup out and just uses the textContent of the
appropriate heading.  Some practices, such as using a "sarcastic
<del>" in a heading, are adversely impacted by this - the meaning of
"I <del>don't</del> like HTML" and "I don't like HTML" are
completely opposite.  The HTML algo should use the escaped innerHTML
of the appropriate heading instead. (Issue 3 in the blog post)

3. The HTML algo sets the @type attribute on atom:content to "xml" in
some circumstances.  It should be "xhtml". (Issue 4 in the blog post)

4. The HTML algo should include an <xml:base> element in the produced
feed so that relative links work correctly.  Alternately, it should
make all links absolute.  (Issue 8 in the blog post)

5. I'm not 100% certain on this one, but I think that, in the current
step 15.8 of the HTML algo, it should produce a <div> element in the
XHTML namespace.  The algo currently doesn't specify a namespace for
the element. (Issue 5 in the blog post)

Issues 2, 6, and 7 in the blog post appear to be a result of the post
author either reading the spec incorrectly or writing a bad page to
begin with.  There are potential problems around Issue 2, but this
blog post did not run into them.

Issue 9 in the blog post is true, but can't be simply fixed.  In most
circumstances this won't matter - most blogs are written by a single
author.

The issues listed in the blog post:
1. The URLs for <link> elements should be stored in the @href
attribute and not in the link content.
2. The values used for <id> should be both stable and unique. Using a
copy of the permalink meets neither requirement.
3. Stripping the markup from <title> elements has resulted in one
title changing its meaning entirely.
4. The @type attribute on <content> elements should be "xhtml" for
XHTML content, and not "xml".
5. The XHTML <div> element that is an immediate child of <content> is
not correctly namespaced.
6. The dates in the <published> elements are incorrectly formatted and
in the wrong timezone.
7. The <updated> elements are merely duplicates of the <published>
elements, failing to detect the correct update times.
8. Without an @base attribute, relative URLs inside the <content>
elements will not be correctly resolved.
9. The <author> elements are missing altogether since the algorithm is
only capable of recognising feed-level authors, at best.

~TJ