[whatwg] RDFa

Sun Aug 24 12:17:23 PDT 2008

On Aug 23, 2008, at 18:16, Dan Brickley wrote:
> It may not be obvious to those who haven't followed the history, or  
> who were at school at the time, but many of us did indeed invest a  
> lot of time and effort using name/value metadata structures in HTML.  
> For example, the Dublin Core project began with this technology base  
> beginning back in 1994/5, and the experience of metadata  
> implementors using it was one of the drivers for the creation of  
> RDF. At the time there no WHATWG to talk to, but the metadata  
> community *did* talk to W3C.

I don't doubt that there's metadata that doesn't fit into name-value  
pairs nicely. However, the title of the work, the license for the work  
as a whole, attribution wish (a natural-language string with  
potentially multiple names, commas and "and"s) and a single  
attribution URL all fit into name-value pairs, so for CC licensing, a  
graph seems like an overkill.

Of course, there's the issue of conveying that data for each subwork  
of a larger work that remixes many works. But can we expect John Q.  
Public to convey that data so that there's something to be DRY with in  
a case where the subworks aren't independent files that could carry  
their own metadata? That is, if the larger work remixes multiple  
photos in a single Theora video stream or into one large JPEG file,  
can we really expect tools (or John Q. Public manually) to be able to  
address into the larger work in such a way that any syntax other than  
natural language identifies which subwork had which license and  
attribution requirement?

> Does the very loosely defined Dublin Core really qualify as a  
> "standard" that can be read and processed programmatically?

Thanks for the pointers to history. I wasn't aware that the Dublin  
Core community had itself documented this fundamental problem with  
Dublin Core so early on. I have ran into this problem myself when in a  
past project I inherited a metadata spec that my predecessors had  
modeled after Dublin Core without having experience of developing  
software.

> DC.creator.phone.1
> 	+44 227 462062

In this particular instance, it seems to me that the main problem  
isn't that the metadata doesn't fit into key-value pairs but that the  
metadata that doesn't probably doesn't *really* need to be recorded as  
metadata. If you are creating a document search engine, does the user  
ever want to search documents by the authors' phone numbers? If the  
user searches by other criteria, does the phone number *really* need  
to be extractable for display in search results?

I realize it sounds offensive to suggest that someone doesn't need the  
metadata they say they need, but when I worked (briefly) on metadata  
for long-term preservation of digital files in the National Archives  
of Finland, it became apparent pretty quickly that at least some  
metadata specs aren't driven by considering what absolutely *must* be  
there to satisfy realistic use cases but by modeling what *could* be  
said about the domain and inventing fields for everything *just in  
case*.

> Looking at this example,
>
>          <div id="license" about="#license" typeof="rdf:Property">
>              <h4>cc:license</h4>
>              A <a rel="rdfs:domain" href="#Work">Work</a> <span  
> property="rdfs:label">has license</span> a <a rel="rdfs:range"  
> href="#License">License</a>. <br />
>
>              (a <a rel="rdfs:subPropertyOf" href="http://purl.org/dc/terms/license 
> ">subproperty of dc:license</a>, <a rel="owl:sameAs" href="http://www.w3.org/1999/xhtml/vocab#license 
> ">the same as xhtml:license</a>)
>          </div>
>
>
> Actually we can do a fair bit more than simply have human readable  
> strings. For example from the CC case, we've got a sub-property  
> relationship between cc:license and dc:license.
[...]
> So while it is useful to have human readable strings (including  
> translations) we also get simple relationships between independently  
> defined vocabulary terms.

And in www-archive:
On Aug 23, 2008, at 23:59, Ben Adida wrote:

> Henri Sivonen wrote:
>> Also, in this case, the prefix cc is actually more persistent than  
>> the
>> URI, since Creative Commons has changed the namespace URI of its RDF
>> vocabulary without changing the canonical prefix (from
>> http://web.resource.org/cc/ to http://creativecommons.org/ns#).
>
> Highly misleading statement, since we are also creating equivalences
> between the old and new namespace. That's the power of RDF.

How common is it that user-facing applications that use RDF metadata  
dereference namespace URIs, load declarations of equivalence or  
subclass relationships between properties and successfully map  
vocabularies created after the creation of the application to the  
vocabulary understood by the application? Are there known instances of  
applications that were programmed to process http://web.resource.org/ 
cc/ metadata in a XML-wise correct way (i.e. not using regular  
expressions matching on "cc:") and that automatically processed http://creativecommons.org/ns# 
  metadata right by autodiscovering the equivalence? (These are not  
rhetorical questions. I really don't know and am curious. My intuiting  
suggests that this wouldn't be a common occurrence.)

Where some see "the power of RDF", others see "the RDF tax". There's a  
tradeoff between making the common case simple and making things  
powerful for the less common and more complex cases. The simple case  
is finding out what license a document is under. Compared to looking  
up a string value by unstructured opaque string key from within the  
file, it's very different to extract an RDF graph from a file,  
defererence all namespace URIs using a network connection relying on  
hosts being reachable, load data describing equivalence and subclass  
relations--perhaps recursively--and simplify until the application  
sees a value connected to a property it is programmed to know about.

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/