[whatwg] Extensible microdata attributes

Mon Jun 13 11:32:35 PDT 2011

On Mon, Jun 13, 2011 at 2:29 AM, Brett Zamir <brettz9 at yahoo.com> wrote:
> Thanks, that's helpful. Still would be nice to have item-* though...

Well, your idea for custom item-* attributes is just a way to more
concisely embed triples of non-visible data.  You already have a
mechanism for embedding non-visible triples (<meta> or <link>), so the
new method needs some decent benefits to justify the duplication of
functionality.

Additionally, while we recognize that non-visible data is sometimes
necessary to embed, we'd like to discourage its use as much as
possible (in general, non-visible data rots much faster).  One way to
do that is to make the syntax slightly cumbersome or ugly - when you
really need it, you can use it, but your aesthetic sense will keep it
from being the first tool you reach for.  So, making it easier or
prettier to embed non-visible triples is actually something we'd like
to avoid if we can.

>> Note, though, that Microdata or RDFa may not be quite appropriate for
>> this kind of thing.  You're not marking up data triples for later
>> extraction as independent data - you're doing in-band annotations of
>> the document itself.  As such, a different mechanism may be more
>> appropriate, such as your original design of using a custom markup
>> language in XML, or using custom attributes in HTML.  There's no
>> particular reason for these sorts of things to be readable by
>> arbitrary robots; it's sufficient to design for ones that know exactly
>> what they're reading and looking for.
>
> With the likes of Google offering Microdata-aware searches, I think it makes
> a whole lot of sense to allow rich documents such as TEI ones to enter as
> regular document citizens of the web, whereby the limited resources of such
> specialized semantic communities can leverage the general purpose and
> better-supported services such as Google's Microdata tool, while also having
> their documents editable within the likes of WYSIWYG HTML text editors, and
> stored on sites such as discussion forums or wikis where only HTML may be
> allowed and supported.
>
> I think such a focus would also enable the TEI community to benefit from
> reusing search-engine-recognized schemas where available, as well as helping
> the web community build new schemas for the unique needs of encoding
> academic texts.

I haven't yet looked into TEI's metadata scheme, but is the TEI
metadata actually something that needs to be known to search engines?
The one example you've presented in your emails, annotating that some
parts of a transcription were water-damaged (and thus presumably
possibly inaccurate?), isn't something useful for search engines, but
only for humans looking at the document as a whole.

If most of the other metadata is similar, then the only reason to use
Microdata is to potentially make it easier to read/embed data via
Microdata-aware WYSIWYG editors (are there any?).  Or, possibly, to
use Microdata-extraction tools.  Is it useful to, for example, extract
all the water-damaged text from a document, minus the context in which
it appeared?

Otherwise, one might as well just use data-* attributes to mark up
triples directly on the subjects.  That would give you most of the
benefits with much less verbosity and more direct linkages between
data and metadata.  It would also be somewhat easier to style with
CSS:

<span data-tei-damage="water">
   Some water damaged words
</span>

span[data-tei-damage=water] {
 ...
}

~TJ