[whatwg] Trying to work out the problems solved by RDFa

Fri Jan 9 14:13:43 PST 2009

On Fri, Jan 9, 2009 at 3:22 PM, Ben Adida <ben at adida.net> wrote:
> Tab Atkins Jr. wrote:
>> However, Ian has a point in his first paragraph.  SearchMonkey does
>> *not* do auto-discovery; it relies entirely on site owners telling it
>> precisely what data to extract, where it's allowed to extract it from,
>> and how to present it.
>
> That's incorrect.
>
> You can build a SearchMonkey infobar that is set to function on all URLs
> (just use "*" in your URL field.)
>
> For example, the Creative Commons SearchMonkey application:
>
> http://gallery.search.yahoo.com/application?smid=kVf.s
>
> (currently broken because of a recent change in the SearchMonkey PHP API
> that we need to address, so here's a photo:
>
> http://www.flickr.com/photos/ysearchblog/2869419185/
> )
>
> By adding the CC RDFa markup to your page, it will show up with the
> infobar in Yahoo searches.

Ah, hadn't considered a net-wide SearchMonkey script.  Interesting.

This brings up different issues, however.  Something I see
immediately: Say I'm a scammer.  I know that the CC SearchMonkey app
is in wide use (pretend, here).  I start putting CC-RDF data in spam
blog comments, with my own spammy stuff in the relevant fields.  Now
people don't even have to click on the blog link in the search results
and read my obviously spammy comment to be introduced to my offers for
discount Viagra!  They'll just see a little CC bar, click on it to
have it open in-place, and there I am.  I could even hide my link in
legitimate license data, so that people only hit my malicious site
when they click the link to see more information about the license.

Issues like these make wide-scale auto-trusted use of metadata
difficult.  It also makes me more reluctant to want it in the spec
yet.  I'd rather see the community work out these problems first.  It
may be that there's a relatively simple solution.  It may be that the
crawlers can reliably distinguish between ham and spam CC data.  But
then, it may be that there *is* no good solution enabling us to use
this approach, and this kind of metadata on arbitrary sites just can't
be trusted.

I, personally, don't know the answer to this yet.  I suspect that you
don't, either; if the arbitrary-site CC infobar works at all, it's
because few people *use* CC RDF yet, and so it's still limited to a
community with implicit trust.

> So site-specific microformats are clearly less powerful. And
> vocabulary-specific microformats, while useful, are also not as useful
> here (consider a SearchMonkey application that picks up CC-licensed
> items, be they video, audio, books, scientific data, etc... Different
> microformats = development hell.)

Indeed, they are less powerful.  As I explored above, though, too much
power can be damning. It may be that the site-specific little-m
microformat (or something equivalent, allowing a developer to extract
metadata through actively targeting site structure) is powerful enough
to be useful, but weak enough to *remain* useful in the face of abuse.

(Also, I know CC is sort of the darling of the RDFa community, but
there's significant enough debate over in-band vs out-of-band
licensing info, etc. that detracts from the core issues we're trying
to discuss here that it's probably not the best example to use.)

> Have you read the RDFa Primer?
> http://www.w3.org/TR/xhtml-rdfa-primer/
>
> It describes (pre-SearchMonkey) the kind of applications that can be
> built with RDFa. SearchMonkey is an ideal example, but it's by no means
> the only one.

Yup; I was an active participant in this discussion when it started
last August.  The example applications discussed in the paper,
unfortunately, are precisely the kind where trusting metadata is
likely a *bad* idea.  For example, finding reviews of shows produced
by friends of Alice, using foaf and hreview, is rife with opportunity
for spamming.  SearchMonkey seems to avoid this for the most part;
when designing applications for particular URLs, at least, you are
relying on relatively trustworthy data, not arbitrary data scattered
across the web.  Perhaps something similar has application within
trusted networks, but in that case it comprises a completely different
use case than what SearchMonkey hits, with possibly different
requirements.

~TJ