[whatwg] RDFa Features (was: RDFa Problem Statement)

Tue Aug 26 19:50:25 PDT 2008

Hi Ian,

The second part of the replies to your questions regarding RDFa are
below. Note that the list of technical merits I was affording RDFa was
not meant to be exhaustive and I won't be adding to them in the body of
this particular e-mail. There are additional ones, however, and we can
get a more complete list of RDFa features to this mailing list in time.

Ian Hickson wrote:
>> If one understands web semantics to be an important part of the web's 
>> future, the question then becomes, why RDFa? Why not Microformats?
>>
>> While there are a number of technical merits that speak in favor of RDFa 
>> over Microformats (fully qualified vocabulary terms
> 
> Why is this better?

I am assuming that you are questioning why "fully qualified vocabulary
terms" are better than "non-qualified vocabulary terms", so I will
answer based on that assumption.

There are several possibilities when specifying vocabulary terms that
the Microformats and RDFa communities have explored. Strictly
non-prefixed terms, emulated/pseudo-namespace qualified terms, and fully
qualified URLs.

Strictly Non-prefixed vocabulary terms
--------------------------------------

The non-prefixed terms approach is what Microformats do 90% of the time.
For example, the vocabulary term "photo" can be used to describe a
static image for a particular resource. For example:

<img class="photo" src="/images/sunset-at-sea.jpg" />

The Microformats community has taken this approach because it is the
easiest on web page authors. Having to memorize long strings of
pseudo-namespaced vocabulary terms like
"com.yourcompany.yourproject.package.term" is more difficult than
learning one simple word whose semantics does not change from vocabulary
to vocabulary. While the benefit is a lowered barrier to entry, the
drawback is a guaranteed collision of vocabulary terms as the number of
vocabularies increase. How long would it take to have somebody else from
another community come along and re-use the term "photo" for a
completely different purpose?

The answer, as it relates to the Microformats community, was "not that
long". When we started developing the hAudio Microformat, the first
thing we wanted to do was use the term "title" to refer to the "title of
the audio recording". It seemed to be the most straight-forward
vocabulary term that expressed what we wanted it to describe. We checked
several dictionaries and the English definition for "title" was in-line
with what we intended.

However, it conflicted with one of the 8 other compound Microformats -
hCard. It turns out "title" was defined as "job title" and so could not
be re-used by any other Microformat, to mean anything other than "job
title", no exceptions. From now on "title" will mean "job title" for the
rest of time. If anyone were to use "title" in their semantic vocabulary
outside of the Microformats community, the term would eventually become
semantically meaningless. What happens when both vocabularies are used
on the same page? There are examples of this being an issue on the
Microformats site[5].

When you use non-prefixed vocabulary terms, the chances that there will
be a vocabulary term conflict between two communities rises
exponentially with relation to an increase in the number of total
vocabularies. This approach is not scalable and was never meant to be
scalable.

Emulated-namespace/Pseudo-namespace vocabulary terms
----------------------------------------------------

Emulated-namespace/Pseudo-namespace (EN/PN) vocabulary terms have been
mentioned on this list during the various RDFa discussions over the past
week. Microformats have resorted to emulated namespaces when forced to
do so, for example, the hAtom uF "entry-title" vocabulary term[1].

Emulated namespaces can take the form of a set of words separated by a
namespace qualifier, for example:

<img class="hcard.photo" src="/images/sunset-at-sea.jpg" />
<img class="org.microformats.hcard.photo"
                         src="/images/sunset-at-sea.jpg" />
<img class="DC.title" src="/images/sunset-at-sea.jpg" />

Each one of these is an example of an emulated/pseudo-namespace. This
approach has been rejected by the Microformats community because it is
believed that namespaces are more difficult for webpage authors to learn
and are thus best if avoided due to the limited scope of Microformats.
The approach was rejected by the RDFa community because it doesn't
follow core W3C TAG practices and instead invents a new method of
specifying resources on the web that are not dereference-able. In short
- it re-invents the URI wheel unnecessarily.

Fully Qualified URLs
--------------------

The fully-qualified URL approach is the one that has been adopted by the
RDFa community. Fully qualified vocabulary terms specified using URLs
look like the following (take special note of the @typeof and @property
attributes):

<div about="#thunder" typeof="http://purl.org/media/video#Movie">
   <b property="http://purl.org/dc/terms/title">Tropic Thunder</b>
</div>

While this is the most verbose method of vocabulary term expression, it
has a number of benefits that the other two methods do not provide:

1. You are guaranteed to not have any sort of vocabulary term collision.
2. It re-uses the concept of a URL, a method of namespace expression
   familiar to all Web users (web page authors, developers, and most
   importantly, regular folks using the web).
3. The link is directly dereference-able, meaning that one can put any
   vocabulary term expression into a web browser and get the definition
   of the term. For example, go to the following vocabulary term that
   defines the concept of a "Movie" in the ->
   Video RDF vocabulary: http://purl.org/media/video#Movie

>> prefix short-hand via CURIEs
> 
> This is definitely not better.

I don't know where you're coming from since you haven't elaborated on
that statement nor given a link to a document explaining your thought
process. Since you haven't done so, all I can do is shoot in the dark as
to what your issue with CURIEs might be...

Let me first start with why we have this URL short-hand (aka: CURIEs) in
the first place. It is a feature that helps web authors and others that
are writing this stuff by hand to refer to long URLs in an easy way.
This means that the following:

<div about="#thunder" typeof="http://purl.org/media/video#Movie">
   <b property="http://purl.org/dc/terms/title">Tropic Thunder</b>
</div>

can be written like so, when using CURIEs:

<div about="#thunder" typeof="video:Movie">
   <b property="dcterms:title">Tropic Thunder</b>
</div>

all one must do to enable CURIEs, is to define the prefixes at any DOM
element that is higher in the tree, like so:

<div xmlns:video="http://purl.org/media/video#"
     xmlns:dcterms="http://purl.org/dc/terms/"
     ...

I can already hear the screams of protest on this list, as I understand
this to be the one of the most evil things that you can do in the WHATWG
group. :)

We have been discussing an alternate way of expressing prefixes, like so:

<div prefix="video=http://purl.org/media/video#"

The @prefix attribute above would take a space-separated list of
prefixes as CDATA, which could address one of the issues that the HTML5
community has with the CURIE proposal. However, I believe that we are
far from discussing this at the present time - the WHATWG would have to
acknowledge that web semantics is a problem they are interested in
addressing with HTML5.

Perhaps you could outline the reasons that the HTML5 community is so
allergic to the concept of URL-short-hand using prefix mapping in HTML
documents? I ask out of curiosity and because I have never heard the
whole story from the WHATWG's perspective.

>> accessibility-friendly
> 
> How is not reusing HTML semantics better than using them? 

Could you explain further and include examples, please? I don't
understand how RDFa's approach is "not reusing HTML semantics".

> With the 
> exception of the now-resolve <time> issue, it seems like Microformats has 
> the better accessibility story.

I was indirectly addressing the problem of semantically embedding
machine-readable data-types that are not human friendly. Data such as
dates, times, currency codes, weights, distances and most other types of
ISO-codes that are not meant to be human readable, but necessary for
proper semantic expression of units of measure.

The example that I have from the Microformats community is the mis-use
of the <abbr> element[2] in order to stuff dates and times into the
hCalendar, hAtom and hReview specifications as well as durations into
the hAudio specification[3]. This approach has resulted in a
rejection[4] of Microformats on several BBC web properties.

The addition of <time> only fixes things for Microformats in HTML5 - it
does nothing for the expression of these machine-readable data-types in
HTML4, XHTML1.1, and XHTML2.

HTML5 is certainly not going to include a <currency> or <weight> or
<force> element for mark-up of those data types for
accessibility-friendly browsers. I am asserting that RDFa can solve this
issue for all HTML communities without causing accessibility concerns,
like the <abbr> design pattern issue that is still unsolved in the
Microformats community.

This semantics and accessibility issue extends outside of the HTML5
community, and I believe that it has been addressed for the most part in
the HTML4, XHTML1.1 and XHTML2 communities through the adoption of RDFa.
 This is not to say that it is the only solution - just that it is the
only solution that I am aware of that addresses the accessibility
concerns when embedding semantic data in web pages using legacy elements
such as <abbr>.

>> unified processing rules, etc)
> 
> Microformats could certainly benefit from a more consistent parsing model, 
> but that can be obtained without going to RDFa.

Yes, I agree. If the parsing model were the only thing that needed to be
fixed, then we wouldn't need RDFa. However, there is more that RDFa
addresses.

If you had a fully consistent parsing model for Microformats, you would
still have the issue of erroneous scoping in Microformats:

http://microformats.org/wiki/accepted-limitations-of-microformats#Microformat_Scoping_Issue

Let us assume that the scoping issue is fix-able as well, and so the
only major issue that remains, then, is the strictly non-prefixed
vocabulary term issue, and if you address that - you basically have RDFa.

>> [...] this issue really boils down to one of centralized innovation vs. 
>> distributed innovation.
> 
> I don't see what the syntax has anything to do with whether the formats 
> are developed centrally or not. 

Then I have failed to explain how the syntax of the language affects the
vocabulary development process. Let me try again:

When a language choses to not use namespaces, the onus of the vocabulary
term development falls onto a central authority that must manage those
vocabulary terms in order to ensure that a collision does not occur.

Imagine the Java programming language without namespaces, all class
names would have to be globally unique, as in - if you created a popular
class called "Parser" and released it to the world, nobody else in the
entire world would be able to create a class named "Parser" for fear
that it would conflict with your "Parser" class.

However, if you add a namespace to that class, such as
"org.whatwg.Parser" - it becomes possible to create class names in a
decentralized environment.

Taking this one step further, due to the reasons stated earlier in this
e-mail, using arbitrary namespace qualifiers in HTML is unnecessary as
we already have a readily available namespace qualifier on the web - the
URL.

> Nothing is stopping anyone from creating 
> another Microformats-like organisation that does the same thing without 
> going through the Microformats.org process. There could be millions of 
> them, in fact. So long as they pick names that are suitably unique (e.g. 
> URIs, or Java-like identifiers), or so long as they don't promote the use 
> of their formats outside of their own site, I don't see a problem.
> 
> In fact, this is happening every day, with each author making up his own 
> class values for use on this own site.

Yes, agreed... but internal website vocabularies are not the issue here
- external website vocabularies are one of the issues that RDFa addresses.

One of the questions we were tasked with answering was "How do you
create vocabulary terms that can be easily inter-mixed with vocabulary
terms from another online community on any HTML page contained on any
site on the Web?"

The key here is that we DO want people to promote the use of their
vocabularies and formats outside of their own site, and we want them to
do so without having to deal with any central authority.

>> The Microformats community, and all communities like it, require a group 
>> of people to come together, collaborate and create a standard vocabulary 
>> to express ALL semantics.
> 
> Well, for any one person to do anything useful with the data on the Web, 
> they have to have a core vocabulary (or set of vocabularies) that they 
> understand. So a set of standard vocabularies to express all the semantics 
> that that one person is interested in is needed, yes.
> 
> It doesn't have to be _all_ semantics, however. I might want to have a 
> format for annotating Stargate analysis Web pages that me and my friends 
> write, but so long as me and my friends agree on it it doesn't have to 
> involve anyone else.

Sure, and that's great and that problem has already been addressed to
some degree. That is not the issue, however. The issue is "How do you
use that vocabulary outside of your friend group (for instance, between
all of the Stargate fan communities on the Web) and ensure that the
vocabulary terms do not conflict with any other vocabulary on the Web?"

>> A somewhat strained analogy would be bringing in representatives from 
>> all of the cultures of the world and having them agree on a universal 
>> vocabulary.
> 
> That's pretty much exactly what Unicode did. Or what we're doing with 
> HTML. That doesn't seem untennable, it seems quite reasonable.

No, what Unicode did was bring in representatives from all of the
cultures of the world and it had them agree on a universal /character
set/ - not a universal /vocabulary/. The difference is that no
mainstream culture had to choose to not have their character set
represented. Unicode can encode a majority of the characters used in all
languages around the world, nobody had to make any hard decisions on
which character set to leave behind.

My point was that they didn't bring representatives in from around the
world and tell them to decide on a single language - English, Mandarin,
French, Sanskrit, or Esperanto. Pick one.

Those are the types of decisions you are forced to make when you have
centralized vocabulary development. You have centralized vocabulary
development if your semantic expression language does not have the
capability of namespacing vocabulary terms.

Does that train of thought clarify the point I was attempting to make?

>> In short, RDFa addresses the problem of a lack of a standardized 
>> semantics expression mechanism in HTML family languages. RDFa not only 
>> enables the use cases described in the videos listed above, but all use 
>> cases that struggle with enabling web browsers and web spiders to
>> understand the context of the current page.
> 
> I'm not convinced the problem you describe can be solved in the manner you 
> describe. It seems to rely on getting authors to do something that they 
> have shown themselves incapable of doing over and over since the Web 
> started. It seems like a much better solution would be to get computers to 
> understand what humans are doing already.

I agree with your last two points, but not the first. Yes, some authors
have shown themselves incapable of marking up certain types of metadata
in HTML pages since the dawn of the Web. Yes, a better solution would be
to get computers to understand what humans are doing already.

Tools and education are vital, but not necessary, in addressing the
laziness issue. RDFa is a tool that we can use to address the issue of
machine learning. I won't go into the whole machine learning thing again
other than to state that RDFa and machine-based learning are not
mutually exclusive. They are mutually beneficial.

As for not being convinced, I believe that I am addressing your concerns
in a logical manner and am informing this community as to why some of
the suggestions that have been made over the past week by members of
WHATWG are not the proper approach to the problems that RDFa is
addressing. I trust that you will continue to address the issues I am
raising in the same logical manner and explain why WHATWG doesn't
consider semantics to be an issue that needs to be addressed.

It is unfortunate that it takes so much time to convey over a decade of
experience and work performed by members of the Semantic Web Deployment
workgroup as well as those involved in the RDFa Task Force and
Microformats community. I trust that all on this list will continue to
be receptive to what those that have been involved with these issues
have to say as we continue to answer questions posed by members of WHATWG.

> Even if we ignore that, it doesn't seem like the above discussion would 
> lead one to a set of requirements that would lead one to design a language 
> like RDFa.

You will have to clarify this statement and give some examples, Ian. I
am sure that there are holes in my explanation, however, we should be
careful not to prematurely discount RDFa.

It went through a very rigorous process involving many different people
from many different communities that settled on one method, out of
hundreds of permutations, for semantic expression in HTML. RDFa is going
to be published as a W3C Recommendation in the coming months (it just
got through the Candidate Recommendation phase and will be transitioning
to the Proposed Recommendation phase very shortly).

I need to know your thoughts about this topic in a bit more depth - why
doesn't it seem like the above discussion would lead one to a set of
requirements that would lead to a semantic expression mechanism like RDFa?

> Thanks for the explanation, by the way. This is by far the most useful 
> explanation of RDFa that I have ever seen.

Good, I'm glad it helped as there is much that people do not understand
about RDFa. Thanks for listening thus far, and I'm committing time over
the next two weeks to answer any questions that this community might
have as a result of our work in the Microformats and RDFa communities.
RDFa has changed a great deal in the past two years and what most people
know of RDF and RDFa predates 2005.

For those that are interested in all of the gory details, there is an
RDFa Primer which is a pretty quick read, available here:

http://www.w3.org/TR/xhtml-rdfa-primer/

The full RDFa specification, which is not a quick read, is available here:

http://www.w3.org/TR/rdfa-syntax/

-- manu

[1]http://microformats.org/wiki/namespaces-inconsistency-issue
[2]http://microformats.org/wiki/datetime-design-pattern
[3]http://microformats.org/wiki/haudio#Published
[4]http://www.bbc.co.uk/blogs/radiolabs/2008/06/removing_microformats_from_bbc.shtml
[5]http://microformats.org/wiki/accepted-limitations-of-microformats#Microformat_Scoping_Issue

-- 
Manu Sporny
President/CEO - Digital Bazaar, Inc.
blog: Bitmunk 3.0 Website Launches
http://blog.digitalbazaar.com/2008/07/03/bitmunk-3-website-launches