[whatwg] Extensible microdata attributes

Tue Jun 21 11:21:54 PDT 2011

On 6/14/2011 2:32 AM, Tab Atkins Jr. wrote:
> On Mon, Jun 13, 2011 at 2:29 AM, Brett Zamir<brettz9 at yahoo.com>  wrote:
>> Thanks, that's helpful. Still would be nice to have item-* though...
> Well, your idea for custom item-* attributes is just a way to more
> concisely embed triples of non-visible data.  You already have a
> mechanism for embedding non-visible triples (<meta>  or<link>), so the
> new method needs some decent benefits to justify the duplication of
> functionality.
HTML could have been created without attributes too--but if one is going 
to use it frequently enough, concision is a big selling point (as is 
non-redundant styleability).
> Additionally, while we recognize that non-visible data is sometimes
> necessary to embed, we'd like to discourage its use as much as
> possible (in general, non-visible data rots much faster).  One way to
> do that is to make the syntax slightly cumbersome or ugly - when you
> really need it, you can use it, but your aesthetic sense will keep it
> from being the first tool you reach for.  So, making it easier or
> prettier to embed non-visible triples is actually something we'd like
> to avoid if we can.

People who are going to go to the trouble of adding semantics which do 
nothing for visual rendering are probably going to have some idea of 
what they are doing. And if there is an adequately convenient method, 
they will have the chance to learn from experience about the right balance.

And is my idea really encouraging "hidden" meta-data?

Even in my own example of using water damage:

<span itemprop="damage" item-agent="water">
     So blurry....
</span>

...this is allowing some extensibility (by allowing an indefinite number 
of attributes), but conceptually it is not so different from:

<span itemprop="water-damage">
     So blurry....
</span>

...which no one is calling "hidden".

My suggestion is actually /helping/ avoid hidden meta tags not directly 
associated with an element encapsulating visible text.

>>> Note, though, that Microdata or RDFa may not be quite appropriate for
>>> this kind of thing.  You're not marking up data triples for later
>>> extraction as independent data - you're doing in-band annotations of
>>> the document itself.  As such, a different mechanism may be more
>>> appropriate, such as your original design of using a custom markup
>>> language in XML, or using custom attributes in HTML.  There's no
>>> particular reason for these sorts of things to be readable by
>>> arbitrary robots; it's sufficient to design for ones that know exactly
>>> what they're reading and looking for.
>> With the likes of Google offering Microdata-aware searches, I think it makes
>> a whole lot of sense to allow rich documents such as TEI ones to enter as
>> regular document citizens of the web, whereby the limited resources of such
>> specialized semantic communities can leverage the general purpose and
>> better-supported services such as Google's Microdata tool, while also having
>> their documents editable within the likes of WYSIWYG HTML text editors, and
>> stored on sites such as discussion forums or wikis where only HTML may be
>> allowed and supported.
>>
>> I think such a focus would also enable the TEI community to benefit from
>> reusing search-engine-recognized schemas where available, as well as helping
>> the web community build new schemas for the unique needs of encoding
>> academic texts.
> I haven't yet looked into TEI's metadata scheme, but is the TEI
> metadata actually something that needs to be known to search engines?
> The one example you've presented in your emails, annotating that some
> parts of a transcription were water-damaged (and thus presumably
> possibly inaccurate?), isn't something useful for search engines, but
> only for humans looking at the document as a whole.
It could be useful to a search engine. If I remembered that some text 
was water-damaged, I could specify that I only wanted to look for 
water-damaged text (with the TEI itemtype).

But I used the water damage example to show something very minute and 
concrete. I could have given examples about how one wished to search for 
more frequent use cases such as finding a particular component of a 
structured bibliography, or find all quotations attributed to a 
particular author.

Search engines could of course be employed not only for searching the 
whole web, but for searching a particular site.

> If most of the other metadata is similar, then the only reason to use
> Microdata is to potentially make it easier to read/embed data via
> Microdata-aware WYSIWYG editors (are there any?).  Or, possibly, to
> use Microdata-extraction tools.
My point about editors was that relative to TEI XML, TEI in HTML could 
be put into editors. Relative to other approaches like using data-*, it 
would not be a particular advantage, outside of the fact that data-* is 
meant only to be used by the specific site, not for republishing by 
others. For example, if a publisher of a TEI Bible encoded a ton of 
semantics, using data-* to do so would let the document be previewable 
in a text editor or shared on a wiki, but it would not be using a 
recognized mechanism for semantics.
>   Is it useful to, for example, extract
> all the water-damaged text from a document, minus the context in which
> it appeared?
It could be. Scholars might be interested in many different aspects of a 
document:

* Finding all of the unique closings of a letter writer.
* Using the semantics as hooks for transformations, such as finding all 
of the letters whose openers begin after a certain date.
* Finding quotations attributed to a particular person.

Many other possibilities using the rich semantic detail of TEI (as one 
can see by browsing 
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/REF-ELEMENTS.html 
and http://www.tei-c.org/release/doc/tei-p5-doc/en/html/REF-ATTS.html. 
XQuery meets this need rather well for XML and is familiar to that 
community (and is just starting to become available to JavaScript via 
XQIB) though jQuery/querySelect could meet the need very well in HTML.

Of course, search engines might not be offering users the ability to 
make open-ended XQueries over night, but some targeting across the web 
would still be very powerful.

> Otherwise, one might as well just use data-* attributes to mark up
> triples directly on the subjects.  That would give you most of the
> benefits with much less verbosity and more direct linkages between
> data and metadata.  It would also be somewhat easier to style with
> CSS:
>
> <span data-tei-damage="water">
>     Some water damaged words
> </span>
>
> span[data-tei-damage=water] {
>   ...
> }
Yes, thanks for that, but I'd really like to avoid the additional 
redundancy here of needing to add both CSS and <meta/> tags (which would 
be necessary in order to have the extra information be recognized as 
universally semantic, rather than application-specific, markup--for the 
sake of the benefits of search engine discoverability, for one).

Best wishes,
Brett