[whatwg] Trying to work out the problems solved by RDFa

Fri Jan 9 13:00:11 PST 2009

On Fri, Jan 9, 2009 at 2:17 PM, Ben Adida <ben at adida.net> wrote:
> Tab Atkins Jr. wrote:
>> Actually, SearchMonkey is an excellent use case, and provides a
>> problem statement.
>
> I'm surprised, but very happily so, that you agree.
>
> My confusion stems from the fact that Ian clearly mentioned SearchMonkey
> in his email a few days ago, then proceeded to say it wasn't a good use
> case.

I apologize; looking back into my archives, it appears there was an
entire subthread specifically about SearchMonkey!  Also, Ian did
indeed mention it in his first email in this thread.  He actually gave
it more attention than any other single use-case, though.  I'll quote
the relevant part:

> On Tue, 26 Aug 2008, Ben Adida wrote:
> >
> > Here's one example. This is not the only way that RDFa can be helpful,
> > but it should help make things more concrete:
> >
> >   http://developer.yahoo.com/searchmonkey/
> >
> > Using semantic markup in HTML (microformats and, soon, RDFa), you, as a
> > publisher, can choose to surface more relevant information straight into
> > Yahoo search results.
>
> This doesn't seem to require RDFa or any generic data syntax at all. Since
> the system is site-specific anyway (you have to list the URLs you wish to
> act against), the same kind of mechanism could be done by just extracting
> the data straight out of the page. This would have the advantage of
> working with any Web page without requiring the page to be written using a
> particular syntax.
>
> However, if SearchMonkey is an example of a use case, then we should
> determine the requirements for this feature. It seems, based on reading
> the documentation, that it basically boils down to:
>
>  * Pages should be able to expose nested lists of name-value pairs on a
>   page-by-page basis.
>
>  * It should be possible to define globally-unique names, but the syntax
>   should be optimised for a set of predefined vocabularies.
>
>  * Adding this data to a page should be easy.
>
>  * The syntax for adding this data should encourage the data to remain
>   accurate when the page is changed.
>
>  * The syntax should be resilient to intentional copy-and-paste authoring:
>   people copying data into the page from a page that already has data
>   should not have to know about any declarations far from the data.
>
>  * The syntax should be resilient to unintentional copy-and-paste
>   authoring: people copying markup from the page who do not know about
>   these features should not inadvertently mark up their page with
>   inapplicable data.
>
> Are there any other requirements that we can derive from SearchMonkey?

I agree with Ian in that SearchMonkey is not *necessarily* speaking in
favor of RDFa; that may be what caused you to think he was dismissing
it.  In truth, Ian is merely trying to take current examples of RDFa
use and distill them into their essence.  (To grab my previous
example, it is similar to seeing what all the various rounded-corners
hacks were doing, without necessarily implying that the final solution
will be anything like them.  It's important to distill the actual
problems that users are solving from the details of particular
solutions they are using.)

Like I said, I think SearchMonkey sounds absolutely awesome, and
genuinely useful on a level I haven't yet seen any apps of similar
nature reach.  I'm exclusively a Google user, but that's something I'd
love to have ported over.  It's similar in nature to IE8's
Accelerators, in that it's an opt-in application for users that
reduces clicks to get to information they actively decide they want.

However, Ian has a point in his first paragraph.  SearchMonkey does
*not* do auto-discovery; it relies entirely on site owners telling it
precisely what data to extract, where it's allowed to extract it from,
and how to present it.  It is likely that this can be done entirely
within the confines of current html, and the fact that SearchMonkey
can use Microformats suggests that this is true.  A possible approach
is a site-owner producing an ad-hoc microformat (little m) that the
crawler can match against pages and index the information of, and then
offer to the SearchMonkey application for presentation as the
developer wills.  This would require specified parsing rules for such
things (which, as mentioned in an earlier email, the big-m
Microformats community is working on).

The question is, would this be sufficient?  Are other approaches
easier for authors?  RDFa, as noted, already has a specified parsing
model.  Does this make it easier for authors to design data templates?
 Easier to communicate templates to a crawler?  Easier to deploy in a
site?  Easier to parse for a crawler?

SearchMonkey makes mention of developers producing SearchMonkey apps
without the explicit permission of site owners.  This use would almost
certainly be better served with a looser data discovery model than
RDFa, so that a site owner doesn't have to explicitly comply in order
for others to extract useful data from their pages.  How important is
this?

These are precisely the sort of questions I think Ian wants and needs
asked.  SearchMonkey is an awesome app; do we need to do anything to
support it and similar apps?  *Can* anything we do support it, or is
it best served by solutions that ignore us completely?  Yes,
SearchMonkey operates on metadata, and the problem space doesn't allow
natural-language processing to stand in for it; it is not clear,
though, that a strict markup approach is best for authors or users.
Nevertheless, it is an excellent use-case to distill requirements from
so we *can* determine if a spec-based solution is desirable.

~TJ