[whatwg] RDFa

Tue Aug 26 03:55:53 PDT 2008

Ian Hickson wrote:
> On Tue, 26 Aug 2008, Dan Brickley wrote:
[...]
>> Sketch of a scenario:
>>
>> 1. Alice deploys <class="creationDate.info">1979</class> to describe a 
>> museum artifact. She calls it this because it marks up some information 
>> about the creation date of some real world thing, and because 
>> 'creationDate' is already in use for describing page creation dates, in 
>> the CSS library she's using.
>>
>> 2. Bob buys himself the Internet domain creationDate.info and wires up a 
>> webserver to respond with an RDFa schema defining creationDate as a 
>> sub-property of http://ecommerce.example.com/vocab#priceInEuros.
> 
> I have no idea what this means or why anyone would want to do that, but 
> let's continue:

Bob is being mischievous.

>> 3. Charlie's code downloads Alice's markup, parses out the RDFa, and 
>> noticing that creationDate.info seems to be de-referencable, so goes to 
>> fetch the schema.
> 
> Step 3 seems totally crazy on several levels, but let's continue:

Charlie is working in the RDF style, which assumes each vocabulary term 
is associated with a real URI documenting some useful bits of 
information about the term. For example the things it makes sense to 
apply it to (it's "domain"), or the kinds of value it can take (it's 
"range"). Or other classes it is a sub-class of; or other properties 
that it is a sub-property of.

In this scenario Charlie is making a mistake, and treating strings that 
look like URIs as if they had other properties of URIs.

>> For every triple "x creationDate y" in the document, it also generates 
>> "x ecom:priceInEuros y" too. Perhaps Bob is selling other museum 
>> artifact and wants to make Alice's look more expensive. Or cheaper. Or 
>> to make her data look corrupted so that certain consumers won't include 
>> her listing. Or maybe he wants to buy the item cheaply and is probing 
>> for bugs in Alice's online shopping system.
> 
> Why would Charlie ever depend on Bob for anything to do with Alice's site? 
> That seems like a disaster waiting to happen.

Yes. The scenario here was intended to illustrate the danger of a system 
in which URI-esque strings are treated as an acceptable stand-in for 
real URIs.  In any healthy picture, Bob should be unable to intrude so 
easily; or Charlie should be aware that he can't do URI-style lookups to 
figure out what Alice's property names mean.

> For that matter, why would Charlie trust Alice _or_ Bob? Bob could easily 
> just lie on his own prices, or, if Charlie is busy downloading things from 
> Bob's site, could just feed up bogus data about Alice directly, without 
> having to go through the indirection layer of defining what Alice is doing 
> to mean something when it doesn't really mean anything.

Yup, I should have been clearer above: Bob is trying to influence 
Charlie's interpretation of Alice's data.

 > Similarly, Alice
> could just include totally bogus data on her site, about either her own 
> stuff or about Bob's.

Quite right. We are dealing in claims here, not absolute truth. Alice is 
trying to publish some claims, and Charlie is trying to figure out what 
she's claiming. Either party may have mischief, marketing or other 
agendas on their mind. Bob is trying to influence Charlie's 
interpretation of Alice's claims, by exploiting an RDFish desire to look 
up property definitions in the Web.

> If Charlie wants to work with Alice's site, he should agree with Alice 
> about what vocabularies they're going to use, and then only use that. 

Are you suggesting pairwise contracts between producers and consumers of 
HTML-embedded structured data? How would you suggest we structure and 
document these agreements?

> That's how standards work, you agree on common vocabularies and then use 
> those for interoperability. For example, everyone agrees on HTML's 
> vocabulary as a way to describe documents (and now applications).

Yup. Question is how we extend that to cover data properties like 
'creationDate' and 'priceInEuros' in a way that allows for diversity 
while supporting incremental progress towards schared vocabulary.

> Anyway. I assume that I'm missing something that is part of the problem 
> that is being solved, so maybe this will make more sense after I've read 
> Manu's e-mail.

Yup, I think it's that the idea of actually doing something with URIs 
for property names, rather than their just being long ugly strings. 
We'll get more written up on that.

>> In other words, the fact that Alice's markup only *appears* to be using 
>> an Internet domain opens her up to risk that someone will go buy that 
>> domain, and put a fake schema there which affects the likely 
>> interpretation of her markup.
> 
> This same problem exists with URIs. What happens if everyone is pointing 
> to w3.org for their definition of "price", and then someone hacks the W3C 
> servers and suddenly the whole Web's meaning changes for whoever is using 
> this magic "follow your nose" principle?

This is certainly a concern, and consumers of data will need to be 
cautious. PGP-signed schemas may be part of the answer. Using https:// 
for namespaces may be part of the answer. The ability to consult 
additional services to see how they think a property is used in practice 
(eg. crawler stats) may be part of the answer. Having sites provide an 
emergencies only list of compromised schema hosts could be another.

A motivator for using URIs is that at least we can be clear what we're 
disagreeing about. I could ask swoogle.com (an RDF index) for real stats 
on how http://www.w3.org/2003/01/geo/wgs84_pos#lat, #long and #alt are 
used in practice. Actually I did just this a while back for other 
reasons, see the report "How the W3C geo vocabulary is used" from the 
Swoogle folks, 
http://ebiquity.umbc.edu/blogger/2006/07/27/how-the-w3c-geo-vocabulary-is-used/ 

-> 
http://web.archive.org/web/20070220095502/http://ebiquity.umbc.edu/blogger/how-the-w3cs-geo-vocabulary-is-being-used/

I think in practice, different levels of caution will be used by 
different kinds of consuming app. I don't believe there is yet any 
history of W3C's site being hacked for purpose of altering a namespace. 
At least for RDF, schema expressivity is fairly limited; shared .js 
libraries on other sites are a much richer target. But this should be no 
cause for complacency.

> Anyway, I don't think you should ever dereference something that isn't an 
> actual URI. That's what URIs are for.

I agree wholeheartedly. Where we seem to disagree is on whether there is 
sufficient value in the use of URIs for naming the classes and 
properties in structured data. We agree I think on the nature of some of 
the costs, but apparently not on the nature and extent of the benefits. 
We perhaps also disagree on the costs associated with not using URI 
names for such properties.

cheers,

Dan

--
http://danbri.org/