[whatwg] A Selector-based metadata proposal (was: Annotating structured data that HTML has no semantics for)

Thu May 21 04:26:54 PDT 2009

On Wed, May 20, 2009 at 6:56 PM, Toby Inkster <mail at tobyinkster.co.uk> wrote:
> Given that one of the objections people cite with RDFa is complexity,
> I'm not sure how this resolves things. It seems twice as complicated to
> me. It creates fewer new attributes, true, but number of attributes
> themselves don't create much confusion.
The reduced number of attributes in CRDF is not aimed to deal with
complexity; but with a separate issue: it is easier for a host
language to add a rel value for <link>s and an extra attribute with no
predefined name, than the bunch of attributes RDFa defines. Actually,
there have been some complains [1] about why should HTML5 restraint
itself from using quite useful attribute names such as "content" or
"resource", just because RDFa decided to use them, without giving
non-X HTML a thought.
In other words: currently, RDFa parsers should have enough to ignore
non-X HTML content (or, more specifically, documents with no default
xmlns in <body>, so they can also cope with the XHTML1.1+RDFa served
as text/html aberration, which is wrong no matter how you look at it).
If RDFa was taken into HTML5, then parsers should also care about
non-X documents, which binds HTML to not use these attribute names for
any future extension (actually, as pointed on Ian's mail referenced
above, @content is already used on <meta> since HTML4, so this can't
even be fulfilled).
CRDF takes a *less intrussive* approach: it minimizes number of
attributes, and even lets the host language to chose the name for
them; with the only requirement of them being defined in the spec as
"CRDF inline stuff" (the document suggest one wording for this, but
explicitly allows for any equivalent wording).
The goal of the fewer attributes, hence, is not to be *simpler*, but
to be *less intrussive*; and the referenced mail is the main reason to
want to be so.

On the simplicity/complexity debate, I'll point out that the isn't a
goal to make things too much simpler with CRDF (although there is a
goal to not make them more complex than needed).
Also, keep in mind that the document is in a quite early stage: many
things are too vaguely defined yet, and will become clearer once it
matures. Again, let me insist that the goal of these early versions is
to describe the idea and concept, and to draw feedback about it: it is
*not* a spec, and there are many details that are just left implicit
or even undefined yet. I'll make sure to clearly state the design
goals of CRDF on the next iteration of the document, to avoid
confusion.

> e.g. which is a simpler syntax:
>
> <a href="http://foo.example.com/"
>   ping="http://tracker.example.com/">Foo</a>
>
> or:
>
> <a href="primary:url('http://foo.example.com/');
>         secondary:url('http://tracker.example.com/');">Foo</a>
Like Tab, I think this example is completely unrelated. Fortunately, I
could understand your point without it ;-)

> Stuffing multiple discrete pieces of information makes things harder for
> parsing, harder for authoring tools and harder for authors.
Parsing a @crdf (or whatever it gets named) attribute shouldn't be
much harder than parsing a @style attribute. Furthermore, a good deal
of CSS parsing code can be reused to build CRDF parsers. Similarly,
authoring tools that already handle CSS styling may reuse a good deal
of the code to enable handling of CRDF metadata; and authors may apply
most concepts from CSS (such as Selectors, or the property:value
syntax) to CRDF.
> In RDFa, each attribute performs a simple role - e.g. @rel specifies the
> relationship between two resources; @rev specifies the relationship in
> the reverse direction; @content allows you to override the
> human-readable text of an element. Combining these into a single
> attribute would not make things simpler.
Nop, it doesn't. It doesn't have to make things too much more complex
either. But it makes the format easier to integrate in the host
language, since it requires less changes and such changes are more
flexible.

Now, let's speak about simplicity: CRDF gives you two ways to define
the values of properties (or, actually, two variants of the same way):
the CSSish property:value syntax and the short-hand syntax that just
omits the value and defaults it to the "contents" keyword. RDFa can be
defining a value with href, or with contents, or have it implicit;
which depends on whether the rel/rev or property attributes are used,
and whether the contents attribute is present or not. That makes three
ways of defining property values, which depend on up to four different
attributes, against CRDF two ways which only vary on the value being
given or not.
Next, RDFa may be defining the property itself on @property, on @rel,
or on @rev; while CRDF always define them the same way.
For subjects, OMG, both @about and @src may be defining them, or they
may be "inherited" from parent elements; while on CSS they are always
defined via @|subject or inherited through the well-defined CSS
cascading rules (actually, they may also be re-defined for "reversed"
properties, but that part is being considered for removal since it's
unclear that it is really needed and it's too complex).
On resources, I'd prefer to hold this, because implicit and explicit
types will completely handle that on CRDF (the resource() notation
will disappear on the next iteration of the document), but they are
still to vaguely defined on the document. In summary, anything that is
defined as a URI (either by the implicit type rules or by user's
explicit typing) will be a resource; after all a URI is a Universal
*Resource* Identifier.
On datatypes, it seems quite like a tie: RDFa's typeof and datatype
vs. CRDF's @|typeof and explicit type syntax; maybe a bit more complex
in CRDF due to the implicit type rules, but this is just a
convenience: you can explicitly define the type of everything if you
want to, and most cases will be quite obvious anyways. OTOH, RDFa also
has its own default type rules (for example, values are normally
string literals, but when they are taken from @href they are
URIs/resources).

Simplicity is actually just a secondary goal for CRDF, but it's
pursued whenever it can be achieved without compromising the primary
goals.

> Looking at the comparison given in section 4.2, CRDF appears to suffer
> from several disadvantages compared to RDFa:
>
> 1. It's pretty ugly.
This is completely subjective, so I won't go any deeper into it. If
you are capable of translating that into specific, objective issues,
then I'll be eager to hear about and try to address them.

> 2. It's more verbose - though only by eleven bytes by my reckoning, so
> this isn't a major issue.
This is a small cost. It is small because it's only eleven characters
(the actual ammount of bytes depends on the actual encoding chosen);
and because it only applies in quite specific scenarios: when all
metadata needs to be inlined. In exchange for that cost, we get other
benefits (primarily the iguana collection case: take the example in
4.3 and try to write the RDFa equivalent; then check how much the RDFa
code grows when adding the 4th, 5th, etc specimens to the collection).

> 3. It divorces the CURIE prefix definitions from the use of CURIEs in
> the markup. This makes it more vulnerable to copy-paste problems. (As I
> understand <link rel="metadata"> in the proposal, CURIE prefix
> definitions can even be separated out into an external file. This
> obscures them greatly and will certainly be a cause of copy-paste
> issues!)
You either missunderstood the way namespaces are handled in CRDF, or
are deliberately missrepresenting it. I prefer to asume the former
rather than the later, so let me clarify some things:
1st: RDFa allows you to define all the prefixes in <body> if you want.
This is not too different from CRDF prefixes defined within <head>.
2nd: CRDF allows you to put some prefixes in a <script> as close as
you need to the code that will use them. There is a whole subsection
in the document (3.4. Additional considerations, very likely to be
renamed in further versions to something more descriptive) that deals
with the scoping of such <script> elements so, if you were copying
content from multiple sources, it would be unlikely for

> 4. It's ugly. I'm sorry, I just can't emphasise that enough.
That's subjective. I just can't emphasize that enough. Again, if you
can turn that into specific issues, I'll do my best to give them the
propper attention.

> Apart from the fact that *sometimes* RDFa involves a bit of repetition,
> I don't see what problems this proposal is actually supposed to solve.
I'll make sure to clearly state the goals and use cases CRDF attempts
to address in the next version of the document. Until then, here is a
summary:
RDFa *sometimes* involves *a bit* of repetition; but on several cases,
it involves *a lot* of *error-prone* repetition. I want to emphasize
the "error-prone" aspect because, IMO, it is critical: if there are
errors in the metadata, then automated extraction is not reliable, so
manual extraction is required and the whole purpose of
machine-readable metadata is lost. In other words, RDFa's
error-prowness for some scenarios make it a too brittle solution.
Also, this makes maintenance of those pages terribly painful: a minor
change in structure may require dozens of identical changes across the
document, for example. There is also an issue of overbloating the
markup, even if that's more secundary. After reviewing many examples
of RDFa vs CRDF vs many other options, I have come up with an
abstraction that I think is quite accurate:
- When the semantics are conveyed to the user via prose, the best way
to convey these semantics to the machine is with inline metadata. On
these cases, RDFa does a good enough job because it's entirely focused
on inline metadata.
- When semantics are conveyed to the user via structure/layout (for
example, with tables or even lists), inline metadata becomes
inadequate, due to the error-prowness and brittleness issue described
above. If a table only needs to state its column headers (which are
the element that conveys the semantics to the user) once per column,
it seems obvious that the best approach would be to allow the author
to state these semantics also only once. A selector-based approach
like CRDF or EASE is one way to achieve this. There may be other ways,
but I don't know of any which doesn't heavily rely on HTML's
semantics.
- On the huge web, both cases should be taken in account. RDFa handles
the first (semantics-in-prose) case, but can't properly deal with the
second (semantics-in-structure) one. EASE, on the other hand, can do
an acceptable job on the second case, but can't handle the first at
all. This is why CRDF attempts to deal with both scenarios.

Another problem is intrussion in the host language. In the RDFa case,
the issue directly comes from its X-centrism (not to blame the RDFa
guys, keeping in mind that the format was built on a time when the
"XML is the future" idea was a quite common belief), to the point that
it's a non-issue for XHTML itself: with XHTML1.1 and 2.0
modularization, the many attributes in RDFa can be defined in a
separate module and, voilà, no conflict: you can use the RDFa module
in your DTD, or you can use some other module that makes a different
use of these attribute names. For tag-soup HTML, RDFa attributes are
too many croutons for the already loaded soup: they prevent the reuse
of these names for future uses, and add extra crowding to the already
crowded content model. Still within the intrussion problem, there is
the prefix issue: RDFa requires the host language to define namespace
prefixes: again, something reasonable in a "XML is the future" world,
since XML-based languages already have a built-in mechanism for this
task; but this raises too many problems on non-X HTML. For these
reasons, CRDF attempts to minimize intrussion: it only requires one
attribute, which the host language is free to name however they
please, and takes the prefix definition task on itself, so it doesn't
burden the host. Actually, I find it quite beautiful how the prefix
thing has, on itself, shielded CRDF against mixing it with CSS: CRDF
parsers would ignore non-prefixed properties; while CSS parsers would
puke at and ignore the prefixed ones.

> Repetition in practise seems to be something that page authors can deal
> with. We don't provide a mechanism for setting the src or alt attributes
> of multiple <img> elements which need to load the external image; or
> setting the class attribute of the third cell in every row of a table.
No, but we provide mechanisms to just avoiding multiple <img> elements
for the same image, or repeating the class attribute for the third row
in every table, to follow your examples.
For the table (simpler case), why would you repeat the class once and
again when you can select it via tr>td:nth-child(3)? (My apologies if
I messed up with the selector; I'm not realy used to the :nth-whatever
pseudoclasses, but I hope the idea is clear).
For the multiple img elements, the solution is still a draft, and you
can find it at [2].
However, I want to point out that content repetition and code
repetition is not the same: if you are authoring a page with repeated
content (for whatever reason), you can view that the content shows up
properly on your browser. For metadata code, this is tricker, since
you aren't dealing with something intended for humans to deal with
even if you review the output of a RDFa parser

> So again, while I can see that this proposal would "work", in what way
> is it supposed to be preferable to RDFa?
1) It's less intrussive for non-X HTML languages (less attributes, no
"xmlns issue").
2) It eases the maintenance and reduces the chance of errors on
documents where inline metadata is far less than optimal.
3) It can still handle inline metadata, regardless it's not limited to inlining.

On Thu, May 21, 2009 at 12:10 AM, Tab Atkins Jr. <jackalmage at gmail.com> wrote:
> On Wed, May 20, 2009 at 11:56 AM, Toby Inkster <mail at tobyinkster.co.uk> wrote:
>> [...]
>> 2. It's more verbose - though only by eleven bytes by my reckoning, so
>> this isn't a major issue.
>
> When used inline, it may be.  It's not *intended* to be used inline,
> though - that's just there for the occasional case when you absolutely
> need to do so, just as @style is available but discouraged in favor of
> external CSS.
Please, let me state again that CRDF is intended to be used *both*
inline and via selectors (be it on <script> or similar embedding
elements, or <link>ed from an external file), since both uses have
their own use cases and scenarios. The bit of extra verbosity on the
inline case, when compared to an inline-only format such as RDFa, is
just a small cost of this flexibility.

>> 3. It divorces the CURIE prefix definitions from the use of CURIEs in
>> the markup. This makes it more vulnerable to copy-paste problems. (As I
>> understand <link rel="metadata"> in the proposal, CURIE prefix
>> definitions can even be separated out into an external file. This
>> obscures them greatly and will certainly be a cause of copy-paste
>> issues!)
>
> If you're using inline CRDF, then yeah, the prefix definitions may be
> far from the content.
The point here is that in RDFa, the prefix definitions *may* also be
far from the content that uses them. Both in RDFa and in CRDF,
however, there is quite flexibility to put them quite close to the
content that uses them.
> The prefixes are defined globally for the document, and may appear anywhere.
Ehm. Please check "3.4. Additional considerations" in the doc:
prefixes in CRDF *can* be defined globally, but may also be defined
for a limited scope. Actually, this quite mimics the XMLish ability to
scope "xmlns:" definitions into any arbitrary element (although it's a
bit more limited, because it requires a <script> in CRDF).
> In practice, inline CRDF should be
> rare, and the prefixes should appear at the top of the .crdf file
> where they can be easily seen.
I wouldn't make valorations about how CRDF would be used in practice
until/unless CRDF actually gets used in practice. I have already
stated that inline CRDF is perfectly legitimate; and IMO it has a
wider range of use cases than inline @style.
Let me point out that, the same way it would be silly to put a
"xmlns:" definition in the <body> tag for a prefix that only gets used
once or twice, deep in the document; it would be equally silly to
define prefixes within <head> or externally for a similar usage.
Again, section 3.4 of the CRDF document is deliberately intended to
deal with these cases, and it has been there since the first version
posted on these lists (although I don't know for sure if the section
number was the same). Definitelly, I'll be giving it a more
descriptive name on the next iteration of the document.

On Thu, May 21, 2009 at 1:29 AM, Toby A Inkster <mail at tobyinkster.co.uk> wrote:
> On 20 May 2009, at 23:10, Tab Atkins Jr. wrote:
>> We are going to have to massively disagree on this point.  ^_^  I love
>> CSS syntax.
> So do I, but CRDF as defined is no more like CSS in terms of syntax than C
> or Perl are - they share the curly braces and semicolons, but not much else.
Ehm. What about Selectors? and the @namespace syntax, which is
directly taken from CSS3 Namespaces? Or the classic property:value
syntax? The calc(), attr(), and url() notations maybe?
Sure, there are differences with CSS. There are enough differences
that I put a rationale and justification to them on "2.2. Rationale on
the differences with CSS syntax", maybe you should review it. Keep in
mind that, if you don't like the shorthand syntax, you are also
allowed to explicitly put the "contents" keyword as much as you
please.
Maybe you could complain about the "|" inside properties, but it's
also a given: CSS3 Namespaces defines "|" for separating prefixes;
just that CSS would only use it on selectors, while CRDF uses it on
properties because, basically, CSS properties are not namespaced but
RDF ones are. Would it be better to define yet another separator? Even
if so, the colon is out of question: that would be a nightmare for
parsers, and horrendous to read.
If you complain about the "@|subject" pseudo-property syntax, I can
agree on it being ugly. Do you have any better suggestion? The reasons
to chose that syntax were:
1st: it needs to carry a "|" *or* use some other foolproofing
mechanism (to ensure that things don't become weird if an author makes
the mistake of putting the CSS and the CRDF on the same file).
2nd: to decide what to put before the "|", I thought that
pseudo-properties are quite comparable to at-rules, so the "@" seemed
a good choice. Other options considered where to put nothing (so we'd
have "|subject"), or using a reserved prefix (having something like
"crdf|subject"). I don't really like the reserved prefix idea, but if
the "@" is that much ugly, it may be taken as a "lesser evil". The
idea of no namespace at all might work. Originally, I discarded it
because I wasn't sure how would it interact with default namespaces;
but afterwards default namespaces were completely avoided, so I could
fall back to it if people find it less ugly. Feedback on either of
these alternatives, or any other suggestion, would be quite welcome.
What's clear is that there is a need to define, at least, subjects for
the triples, and their types.

>> It is rarely, if ever, necessary to set multiple <img> elements to the
>> same @src or @alt.
>
> I'm thinking of things like a table which has a check-mark column with a
> green tick image repeated all the way down, or a traffic-light indicator
> column with red, green and perhaps amber images indicating different
> statuses. I quite often see such things in web applications.
Let's take the traffic-light example, just because I find it clearer:
doesn't [2] deal with that? For example:
<body>
<head>
<style type="text/css">
td.red { content: url(http://example.org/images/red.png) }
td.green { content: url(http://example.org/images/green.png) }
</style>
</head>
<body>
<table><tr><th>state</th><th>whatever</th>
<tr><td class="red">Wrong</td><td>This has the red image</td></tr>
<tr><td class="green">Good</td><td>And this has the green image</td></tr>
</table>
</body>
</html>
In this case, you would need the "red" and "green" classes; but no
matter what you do, you need some mechanism to distinguish between the
"red" and "green" rows. Note also that if the images fail to load (for
example, images disabled or network issues), the browser would
fallback and show the contents of the element, which works mostly like
@alt would, but also allowing additional markup. And if you want
tooltips, just swap the @class by @title, and change the selectors to
td[title=red] and the same for green ;-).
So, in summary; repetition in general is perceived as a problem, and
this is why efforts are being made on addressing it.

Regards,
Eduard Pascual

PS: BTW, replying to three emails at once is *not* sane

References:
[1] [http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-May/019635.html]
[2] [http://www.w3.org/TR/css3-content/#replacedContent]