[whatwg] RDFa

Dan Brickley danbri at danbri.org
Tue Aug 26 00:32:21 PDT 2008

Ian Hickson wrote:
> On Sat, 23 Aug 2008, Julian Reschke wrote:
>> Again you're confusing HTTP URLs with URIs.
>> Using URIs as identifiers allows lots of identification schemes other 
>> than HTTP, in particular ones that are not based on DNS, or that use 
>> DNS, but include a timestamp to address the concern of "losing" a domain 
>> name (tag URI scheme).
> Sure, but most people use HTTP URIs anyway for namespaces.
> You can use any URI or any system you want with class="". The key is just 
> to make it unique enough that clashes won't happen. In practice, names 
> like "dc:title" are actually quite unique enough. But people can use much 
> more unique ones if desired, all the way to full URIs.

I'm certainly in favour of making mainstream namespace names prettier. 
But this design worries me, since it requires guesswork and heuristics 
on the part of consumer code to figure out if class = "info.age" or 
"museum.acquisitionDate" is intended as a URI or not. I'll air the worry 
first, and then sketch an approach that makes me worry less and which 
might have some of the characteristics that you value (such as not 
depending on separate xmlns-like declarations of abbreviations, and not 
being too ugly to look at).

You mentioned earlier that the RDFish practices around downloading and 
interpreting schemas from the Web is news to you. I'll take up an action 
to document some of the things we do in that area (eg. with SPARQL for 
data merging), probably as a blog post.

Doing so would help as background on my next point, which is that making 
it ambiguous whether a URI was declared is something that would need 
careful security review, to ensure that data consumers are aware that 
they should not expect property definitions found at the domain to be 
consistent with the intended meaning of the markup.

Sketch of a scenario:

1. Alice deploys <class="creationDate.info">1979</class> to describe a 
museum artifact. She calls it this because it marks up some information 
about the creation date of some real world thing, and because 
'creationDate' is already in use for describing page creation dates, in 
the CSS library she's using.

2. Bob buys himself the Internet domain creationDate.info and wires up a 
webserver to respond with an RDFa schema defining creationDate as a 
sub-property of http://ecommerce.example.com/vocab#priceInEuros.

3. Charlie's code downloads Alice's markup, parses out the RDFa, and 
noticing that creationDate.info seems to be de-referencable, so goes to 
fetch the schema. For every triple "x creationDate y" in the document, 
it also generates "x ecom:priceInEuros y" too. Perhaps Bob is selling 
other museum artifact and wants to make Alice's look more expensive. Or 
cheaper. Or to make her data look corrupted so that certain consumers 
won't include her listing. Or maybe he wants to buy the item cheaply and 
is probing for bugs in Alice's online shopping system.

In other words, the fact that Alice's markup only *appears* to be using 
an Internet domain opens her up to risk that someone will go buy that 
domain, and put a fake schema there which affects the likely 
interpretation of her markup. This exposure is increased by our 
uncertainty about ICANN strategy: we can't rely on the assumption that 
there are only a tiny handful of TLDs. We can probably rely on them 
being expensive at the top level, but not on having a hardcoded list 
enumerating them.

Icann has announced it will allow the creation of any new top-level 
domains, albeit at a considerable cost.

As well as opening the door to an influx of new web addresses, Icann has 
also said that it will allow Japanese, Chinese, Arabic and Cyrillic 
characters to be used in registrations for the first time.

"It's a massive increase in the real estate of the internet. It will 
allow groups, communities and businesses to express their identities 
online," says Paul Twonmey, chief executive of Icann, speaking to the Times.

The RDF approach generally has been to make it very clear which chunks 
of data contain URIs, and whether they can be relative or not. Other 
markup systems have adopted a similar approach. These share the merit 
that it makes such ambiguity much less of a problem (although there are 
other attacks of course).

Lately I've been thinking that perhaps we can get something less ugly 
than "http://" in the markup, yet specify rules that allow expansion to 
http:// or https:// while keeping it clear whether the markup author 
really intends to cite some domain/page as vocabulary documentation.

For example <p>I'm <span property="info.foaf/age">1979</p> years old</p>
(if FOAF was documented at http://foaf.info/age and we specified the 
property attribute to use java-style names, and be declared relative to 
the http:// scheme).

Or <p>I'm <span property="foaf/age">1979</p> years old</p>
(if I spend $100k at ICANN to buy a tld 'foaf')

or <p>I'm <span property="Com.xmlns.foaf.age">1979</p> years old</p>
(if I did some Apache config sysadmin on xmlns.com)

<p>I'm <span property="http://xmlns.com/foaf/0.1/age">1979</p> years old</p>
(if this was written out in fullest form, and if the 'age' property 
existed yet in FOAF).

Such a design would open things to a marketplace in a real sense. 
Parties who wanted nice short URLs for their properties could beg, 
borrow or buy the appropriate domain names. The reverse-domain format 
from Java would be a bit unusual for people used to the HTTP/browser 
way. Perhaps property="age.foaf.xmlns.com" is equally readable?

The main cost here is that our prettification strategy is syntactically 
indistinguishable from relative URIs. So we could only reliably use it 
in attributes where we know we don't have a relative URI. For 
properties, that seems fine. For the subjects and objects of statements 
(ie. the things the properties apply to, or take as values) this would 
require further thought.

Am I making any sense here? (regardless of whether you agree...)




More information about the whatwg mailing list