[whatwg] Trying to work out the problems solved by RDFa

Fri Jan 9 20:35:27 PST 2009

Ben Adida ha scritto:
> Ian Hickson wrote:
>   
>> We have to make sure that whatever we specify in HTML5 actually is going 
>> to be useful for the purpose it is intended for. If a feature intended for 
>> wide-scale automated data extraction is especially susceptible to spamming 
>> attacks, then it is unlikely to be useful for wide-scale automated data 
>> extraction.
>>     
>
> It's no more susceptible to spam than existing HTML, as per my previous
> response.
>
>   

Perhaps this is why general purpose search engines do not rely 
(entirely) on metadata and markup semantics to classify content, nor 
does Yahoo with SearchMonkey. SearchMonkey documentation points out that 
metadata never affects page ranks, nor is semantics interpreted for any 
purpose; metadata only affects additional informations presented to the 
user at the user will, and if the user chose to get informations of a 
certain kind (gathered by a certain data service), thus spammy metadata 
can be thought as circumscribed in this case, they might corrupt 
SearchMonkey additional data, but not the user's overall experience with 
the search engine. From this point of view, SearchMonkey is some kind of 
wide-range but small-scale use case (with respect to each tool and each 
site the user might enable), because the user can easily choose which 
sources to trust (e.g. which data services to use, or which sites to 
look for additional infos), and in any case he can get enough infos 
without metadata.

On the other hand, a client UA implementing a feature entirely based on 
metadata couldn't easily circumscribe abused metadata and bring valid 
informations to the user attention, nor could the average user take 
easily trusted and spammy sites apart, because he wouldn't understand 
the problem (and a site with spammy metadata might still contain 
informations users were interested in previously, or in a different 
context), whereas in SearchMonkey the average user would notice 
something doesn't work in enhanced results, but he'd also get the basic 
infos he was looking for. Thus there are different requirements to be 
taken into account for different scenarios (SearchMonkey and client UA 
are such different scenarios)

Moreover, SearchMonkey is a kind of centralised service based on 
distributed metadata, it doesn't need collaboration by any other UA 
(that is, it doesn't need support for metadata in other software) by 
default (whereas it allows custom data services to autonomously extract 
metadata, but always for the purposes of SearchMonkey), it only requires 
that web sites adhering to the project (or just willing to provide 
additional infos) embed some kind of metadata only for the purpose of 
making them available to SearchMonkey services, or at least that authors 
create appropriate metadata and send them to Yahoo (in the form of 
dataRSS embedded in a Atom document). That is, SearchMonkey seems to me 
a clear example of a use case for metadata not requiring any changes to 
html5 spec, since any kind of supported metadata are used by 
SearchMonkey as if they were custom, private metadata; whatever happens 
to such metadata client-side, even if they're just stripped by a 
browser, doesn't really matter.

Furthermore, SearchMonkey supports several kinds of metadata, not only 
RDFa, but also eRDF, microformats and dataRSS external to the document. 
So, why should SearchMonkey be the reason to introduce explicit support 
to RDFa and not also for eRDF, which doesn't require new attributes, but 
just a parser? One might think one solution is better than the other, 
and this might be true in theory, but what really counts is what people 
do find easier to use, and this might be determined by experience with 
SearchMonkey (that is, let's see what people use more often, then decide 
what's more needed).

Moreover, RDFa is thought for xhtml, thus it can't be introduced in html 
serialization just by defining a few new attributes: a processor would 
or might need some knowledge over /namespaces/, thus the whole "family" 
of *xmlns* attributes (with and without prefixes) should be specified 
for use with the html serialization, unless an alternative mechanism, 
similar to the one chosen for eRDF, were defined, and maybe such would 
result in a new, hybrid mechanism (stitching together pieces from eRDF 
and RDFa). Buf if we introduce xmlns and xmlns:<prefix> into html 
serialization, why not also prefixed attributes? That is, can RDFa be 
introduced into html serialization "as is", without resorting to the 
whole xml extensibility? This should be taken into account as well, 
because just adding new attributes to the language might work fine for 
xml-serialized documents, but might not for html-serialized ones. This 
means RDFa support might be more difficult than it may seem at first 
glance, whereas it might not be needed for custom and/or small scale use 
cases (and I think SearchMonkey is one such case).

>> Nobody is suggesting that user agents derive any behavior from <title>, so 
>> it doesn't matter if <title> is spammed or not.
>>     
>
> And RDFa does not mandate any specific behavior, only the ability to
> express structure. The power lies in products like SearchMonkey that
> make use of this structure with innovative applications.
>
> Can one imagine tools that make poor use of this structured data so that
> they incentivize spam? Absolutely. Is this the bar for HTML5? If bad or
> poorly conceived applications can be imagined, then it's not in the
> standard?
>
>   

I think the right question should be whether there are effective counter 
measures to circumscribe bad uses and make possible damages less 
significant then advantages from good uses. When a feature in the 
standard is thought to be a possible security (or privacy) issue, 
counter-measures are proposed. Since spam is a possible immediate issue 
for abused metadata, especially in wide-scale and automated data 
extraction, we should also think to possible counter-measures to be 
specc'ed out along with RDFa attributes.

WBR, Alex

 --
 Caselle da 1GB, trasmetti allegati fino a 3GB e in piu' IMAP, POP3 e SMTP autenticato? GRATIS solo con Email.it http://www.email.it/f

 Sponsor:
 Innammorarsi è facile con Meetic, milioni di single si sono iscritti, si sono conosciuti e hanno riscoperto l'amore. Tutto con Meetic, prova anche tu!
 Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=8292&d=10-1