[whatwg] A Selector-based metadata proposal (was: Annotating structured data that HTML has no semantics for)

Fri May 22 03:26:23 PDT 2009

On Thu, May 21, 2009 at 5:19 PM, Toby Inkster <mail at tobyinkster.co.uk> wrote:
> On Thu, 2009-05-21 at 13:26 +0200, Eduard Pascual wrote:
> [... lots ...]
I won't go point by point through your reply neither, but there are
some points worth answering.

> CSS was invented as a way to separate out content from styling. Or to
> put it another way, to separate out data and presentation, which allows
> the same data to be re-presented (or indeed represented) in many
> different ways. The unobtrusive scripting "movement" (for want of a
> better word) aims to separate out behaviour from data, which I think is
> also a worthy ideal. But I consider the information which RDFa carries
> to be very strongly part of the document's *data*, so not especially
> suitable for separating out.
The way you describe CSS really makes it look too different from CRDF
and similar approaches. But I see it somewhat different: as much as
CSS describes how content should be conveyed to humans, CRDF describes
how should it be conveyed to machines. With this description, they
suddenly look quite parallel; so I'll stay in neutral ground and take
these as just different points of view.
It's important to state that CRDF is *not* intended to take *all* the
semantics *out* of the document. In the most extreme cases, it would
be intended to take *some* *descriptions* of those semantics somewhere
more centralized (a external file if it's to be shared by several
documents, the document header if it's to be widely used across the
document, etc).

> (This consideration very much effected the design of RDF-EASE. You'll
> note that the -rdf-about and -rdf-content properties which it defines do
> not allow the author to hard code data into the RDF-EASE file -- they
> only allow the author to specify an attribute from the (X)HTML file
> where the data can be found.)
This makes a lot of sense. Actually, RDF-EASE is meant to be always
placed on an external file, so it's reasonable to disallow stuff that
just shouldn't go on an external file.
CRDF, on the other hand, is designed to work either as an external
file, an embedded piece of code (a.k.a. a <script>, using HTMLish
terms), or inline within the document; and, most prominently,
combining these forms as appropriate for each case. It also tries to
have a syntax and content model that is consistent across all three
usages. This leads for features that are mostly intended for inline
usage to be also allowed when CRDF is used as an external file; but
this doesn't meant that such usage is neither intended nor advisable.
To put a clearer example, should CSS forbid constructs like this:
"h1:not(h1)"? (hint: they are allowed). Some things just make no
sense, but are allowed because explicitly forbidding them would add
unneeded complexity to the format.
My plan was to follow CSS's good example, adding informative notes on
stuff that is implicitly allowed but makes no sense or is unadvisable,
rather than going for explicit prohibitions.
Keep in mind that, on external files or scripts, the kind of usages
that should be expected would be something like this:
.person { @|subject: blank() }
.person time.dob { foo|birthdate: foo|date(attr(datetime)) }
/* foo|date(...) is the explicit datatype notation */
Rules in the form "prefix|property: literalvalue" are only intended
for inline usages. Actually, trying to use them externally would be
quite hard, unless an author can be sure that all the elements matched
by a selector would actually share the value (and if they do, what'd
be wrong with stating it just once?).

> [... some stuff about how will English change in a thousand years ...]
>
> A great help in clarifying your usage of terms is the inclusion of a
> glossary. For example, I could write:
>
> <dl>
>  <dt>name</dt>
>  <dd>
>    A name is a label for a noun, (human or animal,
>    thing, place, product [as in a brand name] and even an
>    idea or concept), normally used to distinguish one from
>    another.
>    (<a href="http://en.wikipedia.org/wiki/Name">source</a>)
>  </dd>
> </dl>
>
> With RDFa, the idea of a glossary can be used to reduce our reliance on
> external vocabularies:
>
> <dl xmlns:foaf="http://xmlns.com/foaf/0.1/"
>    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
>  <dt about="[foaf:name]" property="rdfs:label">name</dt>
>  <dd about="[foaf:name]" property="rdfs:comment" datatype="">
>    A name is a label for a noun, (human or animal,
>    thing, place, product [as in a brand name] and even an
>    idea or concept), normally used to distinguish one from
>    another.
>    (<a rel="rdfs:seeAlso"
>    href="http://en.wikipedia.org/wiki/Name">source</a>)
>  </dd>
> </dl>
>
> This doesn't completely eliminate the risk, but goes a long way to
> mitigating it.
Agreed. But CRDF would also allow that kind of glossary. What's your
point with it?
Again, let me insist that external file CRDF is only one of its
possible usages. Actually, it only makes sense when it holds rules
that apply to multiple documents (otherwise, <script> or inline uses
would work better). If an author is already caring about keeping
several documents live, then keeping one extra .crdf file live as well
shouldn't be too difficult.
Please, don't be missguided by Tab's "favoritism" towards external
.crdf files. While they are a useful tool for some of the cases, they
do not cover all the cases. <script> and inline uses are equally
important and; IMO, one of the strongest points of CRDF is that it
provides a unified syntax for all three usages, rather than having to
rely on different formats for each thing (for example, using RDFa for
inline stuff and EASE for external stuff would be, on the best case,
messy).

>> The reduced number of attributes in CRDF is not aimed to deal with
>> complexity; but with a separate issue: it is easier for a host
>> language to add a rel value for <link>s and an extra attribute with no
>> predefined name, than the bunch of attributes RDFa defines.
>
> Not just an extra rel value for <link>, but in some languages it would
> involve introducing the <link> element to begin with. The cost of
> introducing a new element is significantly higher than new attributes,
> given that in most implementations of XML-like languages, unknown
> attributes are generally ignored.
Please, review "3.1. Linking to CRDF sheets" about this. <link> is
used in X/HTML because: 1) X/HTML already defines it; and 2) it's made
exactly for the kind of job we are doing here. For generic XML, a
processing instruction like <?xml-metadata ...?> is suggested. Besides
these case-specific recommendations, the basic requirement is stated
as "The host language must include a mechanism for linking to external
CRDF sheets." <link> and PIs, where available, are both good
mechanisms to deal with this requirement, but a language can define
any other mechanism it finds appropriate.
Section "3.2. Embedding CRDF sheets", which deals with <script>,
describes this as highly desirable, rather than a requirement:
<script> is reused in X/HTML because it's available and it is ready
for the job; for other languages three cases are possible:
1) The language has something as flexible as <script>, and thus it's
re-used for CRDF
2) The language defines an element just to deal with this feature.
3) This feature is not avaiblable at all from that language
This is a per-language choice, and all three options would be
perfectly compliant with CRDF's requirements.
In summary, the requirements for a CRDF host language would be:
"a mechanism for linking to external CRDF sheets" and "an attribute
whose content model is “a CRDF inline definition” (other wordings are
acceptable, of course, as long they mean the same)" (the document also
describes what "a CRDF inline definition" is).

>> Actually,
>> there have been some complains [1] about why should HTML5 restraint
>> itself from using quite useful attribute names such as "content" or
>> "resource", just because RDFa decided to use them, without giving
>> non-X HTML a thought.
>
> Attribute names are not a scarce commodity. Just using the 26 letters of
> the English alphabet (I avoid calling it the "Latin alphabet" given that
> three of the letters are post-Roman inventions) you can create about 10
> million different 5-letter attribute names. Certainly most of them are
> nonsensical, but there are an awful lot of attribute names to choose
> from, so it doesn't make sense to introduce potentially harmful clashes
> where they could be avoided.
>
> You beg the question of whether the RDFa task force invented attributes
> without giving HTML a thought. Certainly RDFa's XHTML 2.0 heritage is
> clear, but the language employed by the RDFa syntax document appears
> very carefully chosen to accommodate HTML.
Really? It already has some conflicts with HTML4 (@content is already
used in that format; more on this later). The point is that, among the
10 million or more available names, the RDFa group took names that are
highly generic: "content" or "resource", for example, could be used
for lots of things on a web markup language, but the RDFa guys decided
that HTML should abstain from using them for anything, without asking.
Not very polite, IMO.

> The only aspect of RDFa which doesn't sit especially well in HTML is
> CURIE prefix mappings, which use xmlns:* attributes. In practice, it
> doesn't seem to have proved a difficulty to those of us who have
> implemented support for RDFa in HTML, but there are theoretical and
> aesthetic arguments against it. But this is a small issue which is not
> especially difficult to fix, and there's no reason to throw the baby out
> with the bathwater. Various solutions to it are being discussed both
> here and on the public-rdf-in-xhtml-tf at w3.org list.
Are you calling the DOM Consistency Principle a "theoretical" or
"aesthetic" argument? That principle is the only thing that allows
migrating documents from X to soup or vice-versa without having to
redo every script; or to have scripts working properly with seamless
frames where XHTML and tag-soup sources are mixed together. Sure, this
is not an issue for script-less documents, but script-based web
applications are a reality, and are growing in both number and
complexity at a quite fast pace. One of the reasons HTML5 exists at
all is that the W3C was quite unwilling to deal with this reality.
The only reason that RDFa in HTML has worked until now is the same
reason <font> worked until browsers were ready for CSS: authors will
normally stick to what works, so they won't be messing with the DOM if
they are putting "xmlns:" stuff in it on an HTML document.
The point is that we need specs that deal with authors' and users'
needs; rather than authors that workaround spec flaws.

>> In other words: currently, RDFa parsers should have enough to ignore
>> non-X HTML content (or, more specifically, documents with no default
>> xmlns in <body>, so they can also cope with the XHTML1.1+RDFa served
>> as text/html aberration, which is wrong no matter how you look at it).
>
> Personally I think it was a mistake to register a new content-type for
> XHTML to begin with - it introduced an unnecessary schism between HTML
> and XHTML which should have just been a natural progression.
Personally, I think that XHTML (or, more exactly, trying to bring
draconic error handling to the web) was a mistake itself. XHTML can't
be a natural progression for HTML, for a quite simple reason: most of
existing HTML content would be rendered as an XML parsing error notice
if it was processed as XHTML requires a page to be processed.

> Any XHTML-family language which doesn't use elements from non-XHTML
> namespaces and follows a few simple rules for backwards-compatibility in
> practise seems to work fine served as text/html.
Any document that can work properly served as text/html could be
authored in plain HTML, and takes no benefits from XHTML. What's the
point of switching to XHTML if you aren't going to take profit of it,
and you are going to deal with the compatibility rules?

>> If RDFa was taken into HTML5, then parsers should also care about
>> non-X documents, which binds HTML to not use these attribute names for
>> any future extension (actually, as pointed on Ian's mail referenced
>> above, @content is already used on <meta> since HTML4, so this can't
>> even be fulfilled).
>
> RDFa's use of @content is compatible with its use in HTML4. No, they are
> not identical uses, but they are not inconsistent either. Much like
> saying that "I am a human", and "I am a mammal" are not identical
> statements, but are consistent.
>
> In HTML4 @content is used on <meta> to indicate a string that parsers
> interested in a particular piece metadata should use. In RDFa it is used
> in the same way, but allowed globally instead of just on <meta>.
At any given moment, the HTML group could have decided to extend the
use of @content to other elements. It would especially make sense if
it was a use comparable to that done on <meta>. RDFa took away this
possibility without even asking the HTML folks if there was any
expected ampliation of this attribute. Just like that, @typeof could
have lots of usages on future versions of webforms; but RDFa shut that
door for HTML. Again, @resource could also have several potential uses
(for example, to refer to cache or local storage resources by web
applications), but RDFa shut also that door for HTML. RDFa could have
taken a less disruptive approach, for example prefixing "rdfa-" or
even just "r" to attribute names to avoid shutting doors to HTML, but
they didn't. Now, don't be surprised that the HTML guys are so
unwilling to open the doors to RDFa.

Regards,
Eduard Pascual