[whatwg] A Selector-based metadata proposal (was: Annotating structured data that HTML has no semantics for)
ian at hixie.ch
Tue Jun 9 16:29:15 PDT 2009
On Thu, 14 May 2009, Eduard Pascual wrote:
> I have put online a document that describes my idea/proposal for a
> selector-based solution to metadata. The document can be found at
> http://herenvardo.googlepages.com/CRDF.pdf Feel free to copy and/or link
> the file wherever you deem appropriate.
> Needless to say, feedback and constructive criticism to the proposal is
> always welcome. (Note: if discussion about this proposal should take
> place somewhere else, please let me know.)
This proposal is very similar to RDF EASE. While I sympathise with the
goal of making semantic extraction easier, I feel this approach has
several fundamental problems which make it inappropriate for the specific
use cases that were brought up and which resulted in the microdata
* It separates (by design) the semantics from the data with those
semantics. I think this is a level of indirection too far -- when
something is a heading, it should _be_ a heading, it shouldn't be
labeled opaquely with a transformation sheet elsewhere defining that is
maps to the heading semantic.
* It is even more brittle in the face of copy-and-paste and regular
maintenance than, say, namespace prefixes. It is very easy to forget to
copy the semantic transformation rules. It is very easy to edit the
document such that the selectors no longer match what they used to
match. It's not at all obvious from looking at the page that there are
* It relies on selectors to do something subtle. Authors have a great
deal of trouble understanding selectors -- if you watch a typical Web
authors writing CSS, he will either use just class selectors, or he
will write selectors by trial and error until he gets the style he
wants. This isn't fatal for CSS because you can see the results right
there; for something as subtle as semantic data mining, it is extremely
likely that authors will make mistakes that turn their data into
garbage, which would make the feature impractical for large-scale use.
I say this despite really wanting Selectors to succeed (disclosure: I'm
one of the editors of the Selectors specification and spent years working
on its test suite).
I think CRDF has a bright future in doing the kind of thing GRDDL does,
and in extracting data from pages that were written by authors who did not
want to provide semantic data (i.e. screen scraping). It's an interesting
way of converting, say, Microformats to RDF.
Having said that, I do agree that the repetition of microdata requires in
common scenarios with blocks of repeated data is unfortunate. It is worse
than the repetition one has just from the basic HTML markup.
<td> Hedral <td> Black
<td> Pillar <td> White
<td itemprop=name> Hedral <td itemprop=color> Black
<td itemprop=name> Pillar <td itemprop=color> White
<td itemprop=com.example.name> Hedral <td itemprop=com.example.color> Black
<td itemprop=com.example.name> Pillar <td itemprop=com.example.color> White
...which is far more verbose than ideal.
I considered special casing tables (using <col itemprop> to set
itemprop="" for all cells in a column) but it would require quite a lot of
complexity in processors since they'd additionally have to implement the
table model, and having seen the quality of some of the implementations of
metadata extractors used on Web content, I fear that that will be far too
much complexity. (I fear even subject="" might already be too much.) The
simpler we make it the more reliable it will be.
It also wouldn't solve the problem with other patterns, e.g. <dl> (which
approaches like CRDF's handle fine).
I don't have a good answer for the repetition problem.
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
More information about the whatwg