[whatwg] A Selector-based metadata proposal (was: Annotating structured data that HTML has no semantics for)

Thu May 21 08:19:21 PDT 2009

On Thu, 2009-05-21 at 13:26 +0200, Eduard Pascual wrote:

> [... lots ...]

Eduard, thanks for your long and informative reply. I won't go into
every point mentioned in detail, but in summary I'd like to say that
your message reassured me on a few points and perhaps CRDF is not as bad
as I initially thought.

That said, I do think that externalising of the semantics of a document
is a mistake. As the author of RDF-EASE, I don't say this without having
thought the matter through. 

CSS was invented as a way to separate out content from styling. Or to
put it another way, to separate out data and presentation, which allows
the same data to be re-presented (or indeed represented) in many
different ways. The unobtrusive scripting "movement" (for want of a
better word) aims to separate out behaviour from data, which I think is
also a worthy ideal. But I consider the information which RDFa carries
to be very strongly part of the document's *data*, so not especially
suitable for separating out.

(This consideration very much effected the design of RDF-EASE. You'll
note that the -rdf-about and -rdf-content properties which it defines do
not allow the author to hard code data into the RDF-EASE file -- they
only allow the author to specify an attribute from the (X)HTML file
where the data can be found.)

That's very much an ideological argument, and I appreciate that not
everyone shares my ideology. But for those who don't, there is also the
more practical argument that separating out an aspect of the document's
meaning from the bulk of the markup increases the fragility of its
meaning. If the external file is lost, then part of the document's
meaning is lost.

Some people might argue that RDF already does this by relying on
external vocabularies, but this is only partly so.

By simply using <span about="#me"
xmlns:foaf="http://xmlns.com/foaf/0.1/" property="foaf:name">...</span>
then I am, to a certain extent relying on the FOAF project's definition
of "name" to be stable.

(Bear with me here, as this is about to start to seem very abstract,
but I'll bring it back to the more practical eventually.)

Even without RDFa though, I am relying on the usual English definition
of "name" being stable. It might seem unlikely that the standard English
definition of words is going to change especially much, but remember
that some of HTML5's proponents have lofty ambitions that HTML5
documents should still be readable in 1000 years. 

Think not of 1000 years, but consider how, just in our own lifetimes,
the words 'Web', 'surf' and 'browser' have picked up new meanings which
probably surpass their original meanings in terms of day-to-day usage.

Look back at how English was spoken 1000 years ago and you'll appreciate
how much it's changed. Many people have difficulty reading Shakespeare,
who wrote his work a mere ~400 years ago. Chaucer's "The Canterbury
Tales" which was written only 200 years earlier is virtually
indecipherable these days. Go back any further and you are effectively
looking at another language.

Some believe that the future will bring an even faster rate of change to
the English language, with new technologies giving us new concepts to
think about and label, and the ever wider spread of English as a second
language leading to an increase in loan words.

A great help in clarifying your usage of terms is the inclusion of a
glossary. For example, I could write:

<dl>
  <dt>name</dt>
  <dd>
    A name is a label for a noun, (human or animal,
    thing, place, product [as in a brand name] and even an
    idea or concept), normally used to distinguish one from
    another.
    (<a href="http://en.wikipedia.org/wiki/Name">source</a>)
  </dd>
</dl>

With RDFa, the idea of a glossary can be used to reduce our reliance on
external vocabularies:

<dl xmlns:foaf="http://xmlns.com/foaf/0.1/"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
  <dt about="[foaf:name]" property="rdfs:label">name</dt>
  <dd about="[foaf:name]" property="rdfs:comment" datatype="">
    A name is a label for a noun, (human or animal,
    thing, place, product [as in a brand name] and even an
    idea or concept), normally used to distinguish one from
    another.
    (<a rel="rdfs:seeAlso"
    href="http://en.wikipedia.org/wiki/Name">source</a>)
  </dd>
</dl>

This doesn't completely eliminate the risk, but goes a long way to
mitigating it.

Anyway, that's enough on internal/external data. A few more specific
points...

> The reduced number of attributes in CRDF is not aimed to deal with
> complexity; but with a separate issue: it is easier for a host
> language to add a rel value for <link>s and an extra attribute with no
> predefined name, than the bunch of attributes RDFa defines.

Not just an extra rel value for <link>, but in some languages it would
involve introducing the <link> element to begin with. The cost of
introducing a new element is significantly higher than new attributes,
given that in most implementations of XML-like languages, unknown
attributes are generally ignored.

> Actually,
> there have been some complains [1] about why should HTML5 restraint
> itself from using quite useful attribute names such as "content" or
> "resource", just because RDFa decided to use them, without giving
> non-X HTML a thought.

Attribute names are not a scarce commodity. Just using the 26 letters of
the English alphabet (I avoid calling it the "Latin alphabet" given that
three of the letters are post-Roman inventions) you can create about 10
million different 5-letter attribute names. Certainly most of them are
nonsensical, but there are an awful lot of attribute names to choose
from, so it doesn't make sense to introduce potentially harmful clashes
where they could be avoided.

You beg the question of whether the RDFa task force invented attributes
without giving HTML a thought. Certainly RDFa's XHTML 2.0 heritage is
clear, but the language employed by the RDFa syntax document appears
very carefully chosen to accommodate HTML.

The processing sequence is defined in very DOM-like terms, making it
easy to carry out on any DOM tree without having to worry about the
serialisation that the DOM tree was built from.

As another example of its neutral stance, it says that language
information "can" be provided using xml:lang, but doesn't appear to rule
out other mechanisms for declaring language.

The only aspect of RDFa which doesn't sit especially well in HTML is
CURIE prefix mappings, which use xmlns:* attributes. In practice, it
doesn't seem to have proved a difficulty to those of us who have
implemented support for RDFa in HTML, but there are theoretical and
aesthetic arguments against it. But this is a small issue which is not
especially difficult to fix, and there's no reason to throw the baby out
with the bathwater. Various solutions to it are being discussed both
here and on the public-rdf-in-xhtml-tf at w3.org list.

> In other words: currently, RDFa parsers should have enough to ignore
> non-X HTML content (or, more specifically, documents with no default
> xmlns in <body>, so they can also cope with the XHTML1.1+RDFa served
> as text/html aberration, which is wrong no matter how you look at it).

Personally I think it was a mistake to register a new content-type for
XHTML to begin with - it introduced an unnecessary schism between HTML
and XHTML which should have just been a natural progression.

Any XHTML-family language which doesn't use elements from non-XHTML
namespaces and follows a few simple rules for backwards-compatibility in
practise seems to work fine served as text/html.

> If RDFa was taken into HTML5, then parsers should also care about
> non-X documents, which binds HTML to not use these attribute names for
> any future extension (actually, as pointed on Ian's mail referenced
> above, @content is already used on <meta> since HTML4, so this can't
> even be fulfilled).

RDFa's use of @content is compatible with its use in HTML4. No, they are
not identical uses, but they are not inconsistent either. Much like
saying that "I am a human", and "I am a mammal" are not identical
statements, but are consistent.

In HTML4 @content is used on <meta> to indicate a string that parsers
interested in a particular piece metadata should use. In RDFa it is used
in the same way, but allowed globally instead of just on <meta>.

-- 
Toby Inkster <mail at tobyinkster.co.uk>