[whatwg] Extensible microdata attributes

Sat Jun 11 04:20:36 PDT 2011

On 4/27/2011 9:06 PM, Benjamin Hawkes-Lewis wrote:
> On Wed, Apr 27, 2011 at 3:54 AM, Brett Zamir<brettz9 at yahoo.com>  wrote:
>> Thanks for the references. While this may be relevant for the likes of blogs
>> and other documents whose requirements for semantic density is limited
>> enough to allow such reshaping for practical effect and whose content is
>> reshapeable by the content creator (as opposed to republishing of already
>> completed books), for more semantically dense content, such as the types of
>> classical documents marked up by TEI, it is simply not possible to expose
>> text for each bit of semantic information or to generate new text to meet
>> that need. And of course, even with microformats/microdata as it is now, the
>> semantic content itself is not necessarily exposed just because text is
>> visible on the page.
>>
>> The issue of discoverability is I think more related to how it will be
>> consumed or may be consumed. And even if some pieces of information are less
>> discoverable, it does not mean that they have no value. For such rich
>> documents, a lot of attention is being paid to these texts since they are
>> themselves considered important enough to be worth the time.
>>
>> If the Declaration of Independence of the United States was marked up with
>> hidden information about prior emendations, their likely reasons, etc., or
>> about suspected authors of particular passages, or the United Nations
>> Declaration of Human Rights were marked up to indicate which countries have
>> expressed reservations (qualifications) about which rights, while a browsing
>> application or query tool ought to be able (optionally) expose this hidden
>> information, there is no automatic need for the markup to be polluted with
>> extra (hidden) (and especially URI-based or other non-textual) tags when an
>> attribute would suffice.
>>
>> For things that are truly important, there may be a great deal of care put
>> into building up many layers which are meant to be peeled away, and it is
>> worth allowing some of that information (particularly the non-textual
>> information, e.g., the conditions of authorship, publisher, etc.),
>> especially which the original publication did not expose, to be still
>> selectively revealed to queries and deeper browsing.
>>
>> If a site like Wikisource (the online library sister project of Wikipedia's)
>> would be able to offer such officially sanctioned semantic attributes,
>> classic texts could become enhanced in this way over time, with the wiki
>> exposing the hidden semantic information, which indeed may not be as
>> important as the visible text, but with queries by interested to users, any
>> problems in encoding could be discovered just as well.
> Your email challenges the principle of visible data on four different grounds:
>
>     1. You note even proponents of visible data do not always show their data.
> But the microformats community only endorse hidden metadata for annotating
> human-friendly visible data (e.g. "mercredi prochain") with a machine-readable
> equivalent (e.g. an ISO 8601 formatted date). They do not endorse hidden
> metadata without visible equivalents against which it can be cross-checked.
>
>     2. You imply editorial effort can offset the error-proneness of hidden
> metadata. But the same extraordinary editorial effort would yield even greater
> accuracy if it went towards creating visible data rather than hidden metadata.
>
>     3. You claim tool-assisted queries by end-users against the hidden metadata
> will reveal errors at the same rate as visible data. But this is doubtful, in
> so far as many queries will obfuscate context whereas simply reading through the
> text encourages serendipitous error discovery. For example, I could issue a
> query asking what proportion of the Declaration of Independence is suspected to
> be authored by John Adams. A percentage answer would not reveal the odd
> misattributed passage. By contrast, if I'm a scholar of the Declaration and am
> reading through the text and I happen to see a suspiciously Jeffersonian
> passage visibly attributed to John Adams, I'm much more likely to notice the
> error.
Of course a visible attribution is helpful, but one cannot possibly 
visibly represent all information one might wish to add, especially if 
one does not wish to clutter the view hopelessly. Meta-data can be 
available to searching, and if search engines don't wish to take 
advantage of it, at least individual document queries can do so.
>     4. You assert that it is not viable to make multiple layers of rich data
> visible in a single view. I'd make the counterargument that on the web, unlike
> in print, it is economical to dynamically construct different views and filters
> of a document and its various visible data streams on the client, on the
> server, on the client, or on some combination of the two. The HTML5
> specification itself is a great example of this. The source text is kept in a
> repository that stores changes to the text, along with date and rationale.
> Multiple views of this source text are then generated serverside: the source
> text is carved up into multiple draft specs for W3C and a single mammoth
> specification for WHATWG. The HTML spec is provided in a browser-crashing
> single document view and in a multipage view. On top of this, there is
> clientside filtering in the form of an in-page control that can produce a web
> author view by hiding technical text aimed at browser vendors.
>
Sometimes projects simply wish to make the meta-data available and let 
consumers determine how to display it. If someone has a good idea about 
how to manage the display (or editing) of meta-data, all power to them, 
but this does not mean that the original document creator should be 
forced to create every possible use when their interest and 
responsibility may simply be properly defining the semantics in use.

In any case, the specification has allowed in-body <meta/> as you point 
out, so hidden meta-data is thankfully available to authors.
> If you're keen on using the TEI vocabulary to meet the Wikisource use case,
> there's no particular reason why you couldn't convert Wiki markup to TEI source
> text, serve TEI directly over the web, and also generate various HTML views of
> visible rich data from the TEI (for example, with XSLT). The Perseus project
> uses TEI and HTML in combination a bit like that:
>
> http://www.perseus.tufts.edu/hopper/text?doc=Perseus%3atext%3a1999.01.0199
Thanks, but I'm not a fan of custom solutions, since, similar to the 
"many eyes" view you are espousing for exposing meta-data visually, I 
believe such solutions leave different semantic communities out of the 
benefits of utilizing and contributing to general purpose solutions. For 
example, I'd like TEI to be serialized such that it can take advantage 
of tools exclusive to HTML such as WYSIWYG editors, wikis which 
whitelist only certain elements and attributes, etc., and have the TEI 
community engaged in enhancing the same Microdata schemas (such as those 
detailed on http://schema.org) available to all on the web. No reason 
for the same "apple" to be expressed in a hundred different vocabularies.
> But let's say you were determined to serve up a single HTML document with lots
> of hidden metadata. None of microformats, microdata, and RDFa were designed to
> do this. But both microdata and RDFa allow you to do so in a conforming manner
> using the @content attribute. In WHATWG HTML, this is restricted to the "meta"
> element, but the "meta" element is now allowed amidst body text so it can apply
> to individual sections of the document, rather than just the whole document.
> In W3C HTML+RDFa, the @content attribute is allowed on any element.
>
> In other words, where your examples currently abuse the skinning layer
> ("display: none") to preserve logical text flow, they should actually be using
> meta at content instead; there is no need for "ugly hacks" even if the markup
> becomes more verbose than you might like.
I had not been aware of <meta/> being available in-body, thank you.

However, my item-* proposal, besides being more succinct in the case of 
attribute content, allows for targeted styling of elements which <meta/> 
currently would not.

For example, to take a water-damaged text (e.g., for the TEI element 
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-damage.html ) 
which in TEI could be expressed as:

<damage agent="water" xmlns="http://www.tei-c.org/ns/1.0/">Some water 
damaged words</damage>

might be represented currently in Microdata as:

<span itemprop="damage" itemscope="" 
itemtype="http://www.tei-c.org/ns/1.0/">
<meta itemprop="agent" content="water"/>
     Some water damaged words
</span>

But there is no "parent combinator" selector such that the following 
(also cumbersome) selector would work:

span[itemprop=damage] <  meta[itemprop=agent][content=water] {
     text-shadow: 2px 2px 16px #2b2b2b;
}

While admittedly, I perhaps should be directing my request to the CSS 
group, I still think it highlights the unnecessary burden of forcing the 
use of child elements when attributes are more reasonable.

In my item-* proposal, it would be nicely expressed as:

<span itemprop="damage" item-agent="water" itemscope="" 
itemtype="http://www.tei-c.org/ns/1.0/">
     Some water damaged words
</span>

which works fairly well in CSS too:

span[itemprop=damage][item-agent=water] {
     text-shadow: 2px 2px 16px #2b2b2b;
}

This offers a conveniently condensed syntax, while also ensuring 
discoverability of the prefixed Microdata attributes.

Especially as more attributes are needed (kept simple for this example), 
it becomes easier to handle (and cleaner), even if it admittedly adds a 
little work to crawlers to detect this different approach.

> Note HTML also has other extension points that are available, including dumping
> data in script elements,
Not a standard approach and not likely to work in restricted 
whitelisting environment.
> dumping data in class attributes,
Suffers, as schema.org implies at http://schema.org/docs/faq.html#14 , 
from a lack of extensibility/namespacing.
> and mixing XHTML and
> other XML vocabularies in a compound document.
>
Suffers from a lack of support in the HTML serialization and from a lack 
of a uniform means of discoverability.
> Beware that even where a conforming hidden metadata mechanism is provided,
> consumers of such documents may well distrust hidden metadata that is not a
> machine-readable equivalent to visible data. For example, Google say:
>
> "In general, Google won't display content that is not visible to the user. In
> other words, don't show content to users in one way, and use hidden text to
> mark up information separately for search engines and web applications. You
> should mark up the text that actually appears to your users when they visit
> your web pages."
>
> http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=146898
>
I'm skeptical that this would exclude (or need to exclude) 
namespace-aware Microdata searches since the user is clearly seeking 
this information explicitly.

Best wishes,
Brett