[whatwg] Trying to work out the problems solved by RDFa

Calogero Alex Baldacchino alex.baldacchino at email.it
Sat Jan 3 11:22:25 PST 2009


Dan Brickley ha scritto:
> On 3/1/09 14:02, Julian Reschke wrote:
>> Tab Atkins Jr. wrote:
>>> The most successful alternative is nothing at all. ^_^ We can
>>> extract copious data from web pages reliably without metadata, either
>>> using our human senses (in personal use) or natural-language-based
>>> processing (in search engine use). It has not yet been established
>>> that sufficient and significant enough problems *exist* to justify a
>>> solution, let alone one that requires an addition to html. That is
>>> what Ian is specifically looking for.
>>
>> That's what you and Ian claim. Many disagree.
>
> My main problem with the natural language processing option is that it 
> feels too close to waiting for Artificial Intelligence. I'd rather add 
> 6 attributes to HTML and get on with life.
>
> But perhaps a more practical concern is that it unfairly biases things 
> towards popular languages - lucky English, lucky Spanish, etc., and 
> those that lend themselves more to NLP analysis. *The Web is for 
> everyone*, and people shouldn't be forced to read and write English to 
> enjoy the latest advances in *Web automation*. Since HTML5 is going 
> through W3C, such considerations need to be taken pretty seriously.
>

My concern is: is RDFa really suitable for everyone and for Web 
automation? My own answer, at first glance, is no. That's because RDF(a) 
can perhaps address nicely very niche needs, where determining how much 
data can be trusted is not a problem, but in general misuses AND 
deliberate abuses may harm automation heavily, since an automaton is 
unlikely to be able to understand whether metadata express the real 
meaning of a web page or not (without a certain degree of AI).

If an external mechanism is needed to determine trust level for 
metadata, that is to establish when an automation results are good or 
bad, such a mechanism may involve human beings at some stage, thus 
breaking automation (this is somehow similar to the problem of defining 
an "oracle machine" described by Turing, according to whom such a 
machine isn't an automaton).

On another hand, a very custom model thought for very custom needs (and 
not requiring wide support) may be less prone to abuses, since it's 
unlikely to find someone willing to cheat himself. Thus, having third 
parties agreeing a certain model and related APIs, and implementing APIs 
on their own sides, might be more reliable in some cases (anyway, third 
parties should agree their respective metadata are reliable and find a 
way to evaluate they really are).

Dan Brickley ha scritto:
> On 3/1/09 16:54, Håkon Wium Lie wrote:
>> Also sprach Dan Brickley:
>>
>>   >  My main problem with the natural language processing option is 
>> that it
>>   >  feels too close to waiting for Artificial Intelligence. I'd 
>> rather add 6
>>   >  attributes to HTML and get on with life.
>>
>> :-)
>
> Another thought re NLP. RDFa (and similar, ...) are formats that can 
> be used for writing down the conclusions of NLP analysis. For example 
> here see the BBC's recent Muddy Boots experiment, using DBPedia 
> (Wikipedia in RDF) data to drive autoclassification / named entity 
> recognition. So here we can agree with Ian and others that text 
> analysis has much to offer, and still use RDFa (or other semantic 
> markup - i'll sidestep that debate for now) as a notation for marking 
> up the words with a machine-friendly indicator of their NLP-guessed 
> meaning.
>
> http://www.bbc.co.uk/blogs/journalismlabs/2008/12/muddy_boots.html
>
>> Personally, I think the 'class' attribute may still be a more
>> compelling option in a less-is-more way. It already exists and can
>> easily be used for styling purposes. Styling is bait for authors to
>> disclose semantics.
>
> I'm sure there's mileage to be had there. I'm somehow incapable of 
> writing XSLT so GRDDL hasn't really charmed me, but 'class' certainly 
> corresponds to a lot of meaningful markup. Naturally enough it is 
> stronger at tagging bits of information with a category than at 
> defining relationships amongst the things defined when they're 
> scattered around the page. But that's no reason to dismiss it entirely.
>
> Did you see the RDF-EASE draft, 
> http://buzzword.org.uk/2008/rdf-ease/spec? From which comes: "Ten 
> second sales pitch: CSS is an external file that specifies how your 
> document should look; *RDF-EASE is an external file that specifies 
> what your document means.*"
>
> RDF-EASE uses CSS-based syntax. More discussion here, 
> http://lists.w3.org/Archives/Public/semantic-web/2008Dec/0148.html 
> including question of whether it ought to be expressed using 
> css3-namespace, 
> http://lists.w3.org/Archives/Public/semantic-web/2008Dec/0175.html
>
> chers,
>
> Dan
>
> -- 
> http://danbri.org/
>

My question is: how often can I trust such a file specifies what your 
document really means, without evaluating its content?

I'd distinguish two cases (not pretendig to make a complete classification),

- The semantics described by metadata is used for server-side 
computations: there's no need to evaluate content (since I'm trusting to 
you when navigating your site, and it's unlikely to find you purposedly 
messing with yourself), as well as to have client-side support for such 
metadata (by the UA). This is the case of a centralised database.

For instance, a *pedia page may send queries to the server, which 
elaborates them and sends results back the the user.

- The UA must understand metadata and automatically gather informations 
meshed-up in a page from several sources: each source must be actively 
evaluated and trusted (a bot can't do such). This is the case of a 
decentralized database.

For instance, that's easy to think of a spamming advertiser who 
apparently puts honest content into your pages (which maybe take 
reliable content from dbpedia), whereas he uses fake metadata to cheat 
my browser and send me irrelevant informations (or infos I'm not 
interested in) when I ask for related content [1], perhaps without you 
even guessing what's going on (and you may be loosing visitors because 
of that).

For obvious reasons, a trust evaluation mechanism can't be as easy as 
getting/creating a signature to be used in a secure connection, because 
someone must actively evaluate at least two things:
- the metadata really reflects a resource content, and
- the metadata is properly used with respect to an external schema 
involved to model data (otherwise, no relationship would be reliable -- 
however, such might be a minor concern from a certain angle, since 
misused metadata might be less harmful than deliberately abused ones).

The result can be very expensive (as certifying a driver or an 
application for a certain platform), or lead to a free choice to avoid 
any evaluation and instead to trust to any third parties. Both solutions 
may work, perhaps, for niche/limited cases, but I don't think such may 
be a good base for a "global" - and general purpose - automation.

[1] That's not the same as using the @rel attribute without any 
relationship with other metadata: a UA may just provide a link somehow 
described as pointing to a related resource with respect to the 
surrounding content, so that I can choose to follow such a link or not; 
if the @rel attribute is used by an automated mechanism in response to a 
query and with respect to other metadata, the UA must decide on its own 
if a link is worth to be followed or not, and I don't think there is any 
easy way to take automated decisions involving trust.

Best regards,
Alex
 
 
 --
 Caselle da 1GB, trasmetti allegati fino a 3GB e in piu' IMAP, POP3 e SMTP autenticato? GRATIS solo con Email.it http://www.email.it/f
 
 Sponsor:
 Incrementa la visibilita' della tua azienda con l'invio di newsletter e campagne email marketing.
* Con investimento di soli 250 Euro puoi incrementare la tua visibilita'
 Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=8350&d=3-1



More information about the whatwg mailing list