[whatwg] A Selector-based metadata proposal (was: Annotating structured data that HTML has no semantics for)

Thu Jul 9 14:11:36 PDT 2009

On Thu, Jul 9, 2009 at 12:06 AM, Ian Hickson<ian at hixie.ch> wrote:
> On Wed, 10 Jun 2009, Eduard Pascual wrote:
>> >
>> > I think this is a level of indirection too far -- when something is a
>> > heading, it should _be_ a heading, it shouldn't be labeled opaquely
>> > with a transformation sheet elsewhere defining that is maps to the
>> > heading semantic.
>>
>> That doesn't make much sense. When something is a heading, it *is* a
>> heading. What do you mean by "should be a heading?".
>
> I mean that a conforming implementation should intrinsically know that the
> content is a heading, without having to do further processing to discover
> this.
>
> For example, with this CSS and HTML:
>
>   h1 { color: blue; }
>
>   <h1> Introduction </h1>
>
> ...the HTML processor knows, regardless of what else is going on, that the
> word "Introduction" is part of a heading. It only knows that the word
> should be blue after applying processing rules for CSS.
Now I think I got your point. However, I don't think it is really an
issue. Let's take a variant of your example:

CSS:
h1 { font-size: large; }

CRDF:
h1 { foo|MainHeading: contents; }

HTML:
<h1> Introduction </h1>

If we took the HTML alone (for example, if the CSS and CRDF are in
external files and fail to download), the browser will find an H1
element and it will know that it is a first-level heading. It will
also render it large by default (maybe depending of context; a voice
browser won't render anything as "large"). Now, if the CSS and CRDF
get processed, the browser will *also* know that it has to render it
large (now it's not just falling back to some default, it knows that
the author wanted the heading to render as large), and that it is
whatever the "foo" (or the namespace mapped by the "foo" prefix, to be
more specific) namespace defines as a "MainHeading", which will
probably be something quite similar to the browser's own concept of
"first-level heading".

The point here is: the CSS is stating that the <h1> should display
large; despite the browser would display it large in most cases.
Similarly, the CRDF is defining the <h1> as a MainHeading, despite the
browser already knows it is a heading. Both the CSS and the CRDF
provide redundant information. Of course, someone could attempt to
describe semantics through CRDF that conflict with HTML's, but that
one could also make headings smaller, hide <strong>s and enlarge
<small>s with CSS.

No matter what CRDF says, a compliant HTML browser will always know
that <h1> is a heading (and similarly, will know what other HTML
elements mean). But if what CRDF says is consisten with what the HTML
says (the main point of metadata is stating things that are true,
false data is almost useless), then RDF tools that are completelly
unaware of HTML itself can still know that something is a heading. The
same way, when CSS is consistent with HTML's semantics (for example
making headings large, <strong>s bold, or <em>s italized), a user
viewing the page can perceive that something is a heading, important,
or emphasized, respectivelly.

> I think by and large the same should hold for more elaborate semantics.
>
>
> (I didn't really agree with your other responses regarding my criticisms
> of your proposal either, but I don't have anything except my opinions to
> go on as far as those go, so I can't argue my case usefully there.)
Most of such responses were based on what is brewing for the next
version of the document, rather than the version actually available,
so I don't think it's worth going further on those points until the
update is ready and up.

>> > I think CRDF has a bright future in doing the kind of thing GRDDL does,
>>
>> I'm not sure about what GRDDL does: I just took a look through the spec,
>> and it seems to me that it's just an overcomplication of what XSLT can
>> already do; so I'm not sure if I should take that statement as a good or
>> a bad thing.
>
> A good thing.
>
> GRDDL is a way to take an HTML page and infer RDF information from that
> page despite the page, e.g. by "implementing" Microformats using XSLT. So
> for example, GRDDL can be used to extract hCard data from an HTML page and
> turn it into RDF data.
Ok. Making metadata available from documents that were not authored
with metadata in mind, and without altering the document itself (at
much adding a <link> to the header) is one of the use-cases CRDF aims
to handle; so it's good news to hear from someone that it's on the
right way to achieve it ^^-

>> > It's an interesting way of converting, say, Microformats to RDF.
>>
>> The ability to convert Microformats to RDF was intended (although not
>> fully achieved: some "bad" content would be treated differently between
>> CRDF and Microformats); and in the same way CRDF also provides the
>> ability to define de-centralized Microformats.org-like vocabularies (I'm
>> not sure if referring to these as "microformats" would still be
>> appropiate).
>
> I think this is a particularly useful feature; I would encourage you to
> continue to develop this idea as a separate language, and see if there is
> a market for it.
The reasoning was quite simple, something like this: "if the only bad
thing about Microformats is centralization, then something that allows
decentraliced microformats should be a good thing".
Currently all Microformats can be implemented on CRDF (assuming there
is a suitable RDF vocabulary to map them to), in a way that is 100%
compatible for "good" content (and using just .class selectors and the
descendant combinator). The problem is only with some forms of bad
content (specificially, when a "singular" property is stated multiple
times). Unfortunatelly, I have not found any decent way to select just
the first appearance of a class, and I'm afraid it might need some
form of the :matches() pseudo-class to be achievable. I'll keep
working on that anyway; maybe I can figure out something.

Greetings,
Eduard Pascual