[whatwg] Removing the need for separate feeds

Wed Jun 10 15:15:13 PDT 2009

On Fri, 22 May 2009, Dan Brickley wrote:
> On 22/5/09 09:21, Ian Hickson wrote:
> > On Fri, 22 May 2009, Henri Sivonen wrote:
> > > On May 22, 2009, at 09:01, Ian Hickson wrote:
> > > >    USE CASE: Remove the need for feeds to restate the content of HTML
> > > > pages
> > > >    (i.e. replace Atom with HTML).
> > > Did you do some kind of "Is this Good for the Web?" analysis on this
> > > one? That is, do things get better if there's yet another feed format?
> > 
> > As far as I can tell, things get better if the feed format and the default
> > output format are the same, yes. Generally, redundant information has
> > tended to lead to problems.
> 
> Would this include having a mechanism (microdata? xml islands?) that preserves
> extension markup from Atom feeds? eg. see
> http://www.ibm.com/developerworks/xml/library/x-extatom1/

Actually the algorithm to convert HTML to Atom doesn't even support all of 
Atom, let alone extensions. However, it's quite possible to extend HTML 
itself if it is to be used as a native feed format, as described here:

   http://wiki.whatwg.org/wiki/FAQ#HTML5_should_support_a_way_for_anyone_to_invent_new_elements.21

On Fri, 22 May 2009, Adrian Sutton wrote:
> On 22/05/2009 08:21, "Ian Hickson" <ian at hixie.ch> wrote:
> > As far as I can tell, things get better if the feed format and the 
> > default output format are the same, yes. Generally, redundant 
> > information has tended to lead to problems.
> 
> Can you point to examples of this in relation to the use of feeds in 
> particular?

Smylers listed more than I could think of:

On Fri, 22 May 2009, Smylers wrote:
> 
> I can't find examples right now, but I have encountered various problems 
> along these lines in the past, including:
> 
> * The feed suddenly becomes empty.
> * A new blog has a 'feed' link, but it never works.
> * A blog's feed URL changes, but doesn't redirect.
> * A feed is misformatted in a way which causes it to be ignored.
> * The content of a feed is misformatted, such that in a feed reader its
>   display is mangled, such as HTML tags and entities showing, or spaces
>   having been squeezed out from around tags such that linked words don't
>   have spaces around them.
> * The content of a feed has certain critical information, such as an
>   image, stripped from it, such that it makes no sense, or has a
>   different meaning from the full post.
> * The content of a feed has certain critical mark-up stripped from it,
>   such as <sup> around exponents in a mathematical expression rendering
>   "36" where "3 to the power of 6" was intended.
> 
> In all cases the HTML version of the blog had correctly displaying and 
> updating content; only the feed was affected by the issues.  This 
> usually left the author unaware of the problem, as they don't subscribe 
> to their own blog.

On Fri, 22 May 2009, Adrian Sutton wrote:
>
> This feels a lot like jumping the shark and solving a problem that has 
> already been solved at one end (syndicating content) and doesn't exist 
> at the other (syndicated content being out of sync with the HTML 
> version).

It seems like defining how one converts HTML to Atom is useful in general 
even if -- maybe even especially if -- the desire is to use Atom.

On Fri, 22 May 2009, Eduard Pascual wrote:
>
> While redundant *source* information easily leads to problems, for what 
> I have seen the sites using feeds tend to be almost always dynamic: both 
> the HTML pages and the feeds are generated via server scripts from the 
> *same set of source data*, normally from a database. This is especially 
> true for blogs, and any other CMS-based site, since CMSs normally rely a 
> lot on databases and server-side scripting. So on these cases we don't 
> actually have redundant information, but just multiple ways to retrieve 
> the same information.

That seems plausible, yes.

> For manually authored pages and feeds things would be different; but are 
> there really a significant ammount of such cases out there? I can't say 
> I have seen the entire web (who can?), but among what I have seen, I 
> have never encountered any hand authored feed, except for code examples 
> and similar "experimental" stuff.

On Fri, 22 May 2009, Toby Inkster wrote:
> 
> Surely this proves the need for a way of extracting feeds from HTML?

I don't know if it proves it per se, but it certainly indicates that there 
is a possible need.

I added the section on how to convert HTML pages to Atom based on requests 
over the years and most recently specifically in the context of the 
microdata section. It doesn't replace Atom, nor is anyone required to 
author HTML in any particular way because of this; it merely provides a 
migration path if one is desired. I think enabling this kind of 
interoperability between standards can only be good.

On Fri, 22 May 2009, Adrian Sutton wrote:
> On 22/05/2009 11:36, "Toby Inkster" <mail at tobyinkster.co.uk> wrote:
> > 
> > You never see manually written feeds because people can't be bothered 
> > to manually write feeds. So the people who manually author HTML simply 
> > don't bother providing feeds at all.
> > 
> > If an HTML page can *be* a feed, this allows manually authored HTML 
> > pages to be subscribed to in feed readers.
> 
> For this to make sense, these people would also be manually adding new 
> entries to the top of the page and dropping old ones off the bottom all 
> by hand.  Feeds aren't used for checking for updates to a page - they're 
> used to check for updates for a site (or section of a site). There are 
> very few cases where every item in a feed corresponds to the same page, 
> even where the entries may be aggregated into a single index page.

I actually do see this happen from time to time, and it used to be quite 
common; but I agree that CMSes have made this practice rarer over time.

On Fri, 22 May 2009, Philip Taylor wrote:
> 
> Perhaps a page like http://philip.html5.org/data.html - people might 
> want to subscribe in their feed reader to see all the exciting updates, 
> and the markup is all hand-written. It's not at all like a blog, but 
> maybe it's data that could be usefully represented with Atom.
> 
> Currently the markup looks like:
> 
>   <ol>
>     <li><a href="http://philip.html5.org/data/abbr-acronym.txt"><code>abbr</code>,
> <code>acronym</code> titles and contents.</a> <!-- 2008-02-03 -->
>     <li><a href="http://philip.html5.org/data/spaced-uris.txt">URIs
> containing spaces.</a> <!-- 2008-02-02 -->
>     ...
>   </ol>
> 
> If I understand the spec correctly, I would have to write something like:
> 
>   <ol>
>     <li>
>       <article pubdate="2008-02-03T00:00:00Z">
>         <h1><a href="http://philip.html5.org/data/abbr-acronym.txt"
> rel="bookmark"><code>abbr</code>, <code>acronym</code> titles and
> contents.</a></h1>
>       </article>
>     <li>
>       <article pubdate="2008-02-02T00:00:00Z">
>         <h1><a href="http://philip.html5.org/data/spaced-uris.txt"
> rel="bookmark">URIs containing spaces.</a></h1>
>       </article>
>     ...
>   </ol>
> 
> and then it would hopefully work.

Sounds right.

On Fri, 22 May 2009, Adrian Sutton wrote:
> 
> Given the arguments for justifying the cost of additional attributes 
> I've seen go by on this list, this is probably the weakest I've seen and 
> somehow it made it into the draft.

The difference is that this particular feature doesn't require any support 
from any tools other than HTML to Atom convertors.

> HTML 5 doesn't need to solve every possible problem, nor should it try 
> to.

Agreed.

On Fri, 22 May 2009, Brett Zamir wrote:
>
> I also wonder if feeds being accessible in HTML might give rise, as with 
> stylesheets and scripts contained in the head (convenient as those can 
> be too), to excessive bandwidth, as agents repeatedly request updates to 
> a whole HTML page containing a lot of other data.
> 
> (If we had external entities working though, that might be different for 
> XHTML at least, as the file could be included easily as well as reside 
> in its own independently discoverable location (via <link/>)...)

I would imagine that high-traffic sites would continue using dedicated 
low-bandwidth feeds.

On Sun, 24 May 2009, Eduard Pascual wrote:
> 
> Now, having seen some of the cases, I must say that this addition
> looks like a good idea, but it still needs some work (some issues and
> shortcommings have already been highlighted).

If there are any issues still outstanding, please let me know.

> There are cases where keeping a separate feed is still a good idea, most 
> prominently for site-wide feeds (because it's not possible to put all 
> the relevant stuff into a single HTML document, unless such document is 
> made for that purpose, but that would be a separate feed on itself 
> then), and for cases where the traffic on the feed is significantly 
> higher than for the document and/or the size of the document is 
> significantly bigger than the feed. These cases, however, are just 
> unaffected by the addition, and shouldn't prevent the relveant ones to 
> take benefit of it.

Agreed.

On Sat, 23 May 2009, Kornel Lesinski wrote:
> On Fri, 22 May 2009 07:01:51 +0100, Ian Hickson <ian at hixie.ch> wrote:
> > 
> > It doesn't collect the blogroll or the blog post tags yet, mostly 
> > because I'm not sure how to do that. Any suggestions of improvements 
> > are naturally welcome.
> 
> There's hAtom that solves this problem already, and appears to have been 
> proliferated by popular blogging software:
>
> http://search.yahoo.com/search?p=searchmonkeyid%3Acom.yahoo.page.uf.hatom
> 
> but I doubt that many users take advantage of it. Almost all of these 
> pages have standard feeds as well (and all of them can provide them via 
> hAtom2Atom proxy).
> 
> Maybe a better approach would be to extend hAtom or define extraction in 
> terms of hAtom? (e.g. make <div class="hentry"> and <article> 
> interchangeable?)

The HTML-to-Atom convertor algorithm ended up being based exclusively on 
native HTML semantics, but I agree that it could be merged with hAtom or 
something like hAtom to provide an even higher-fidelity conversion.

> # For each article element article that does not have an ancestor 
> # article element
> 
> That excludes possibility of syndicating article's comments from markup 
> like this:
> 
> <body>
> <article>
> 	post
> 	<article>comment</article>
> 	<article>comment</article>
> </article>
> </body>
> 
> Feed with only single entry "post" or "post comment comment" would not 
> be useful.

Yes. I couldn't figure out how to include comments in an Atom feed along 
with the articles to which they apply.

We could update the HTML-to-Atom algorithm to support extracting 
subarticles if that would be helpful, though.

> Another problem is that algorithm cannot create <summary>. Perhaps 
> <summary> could be assumed if there's alternate link and article doesn't 
> contain more than one header? Or has entire contents wrapped in 
> <blockquote>?

I'm not sure those really match the <summary> concept. The algorithm 
definitely doesn't support everything Atom does.

> I haven't noticed any way to exclude articles from the feed (except hack 
> <article><article>...</article></article>). I may have news that's not 
> important enough to justify notification of all subscribers. Are 
> trackbacks and tweets appropriate for <article>? I might want to show 
> them on my page, but wouldn't want to repost them in my feeds.

This is the kind of thing that would need more fine-grained control, and 
for which I'd recommend extending this algorithm to work with hAtom.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'