[whatwg] Considering a lang- attribute prefix for machine translation and intelligibility

Wed May 2 16:53:25 PDT 2012

On Thu, May 3, 2012 at 2:59 AM, Charles Pritchard <chuck at jumis.com> wrote:
> There has been some discussion on the w3c/whatwg mailing lists about how far
> we can mark up content with linguistic tags, such as marking word and/or
> sentence boundaries.
>
> In my authoring of web apps, I often write a short manual into a hidden div,
> so that the vocabulary of my application can be processed by translation
> services such as Google translate. Having content in the DOM seems the most
> appropriate way to handle translation.
>
> I'd like the group to consider the costs/benefits/alternatives to a "lang-"
> attribute.
> Such as <span lang-role="sentence">This is a sentence.</span>
>
> The data- and aria- attributes have worked out well. We may want to make
> room for one more.
>
> Such a structure could be used to markup typical subject/object/verb and
> clause sections; it could also be used to markup poetic texts as well as
> defined meanings of content.
>
> http://www.omegawiki.org/Expression:orange
> This is an <span lang-meaning="DefinedMeaning:orange_(5821)">orange</span>.
> Now this, this is <span
> lang-meaning="DefinedMeaning:orange_(5822)">orange</span>.
>
> In most cases there's no need to define sentence boundary, meaning or
> otherwise. But, it'd sure be nice to have the ability to do so in a standard
> manner.
>
> I'd recommend role, meaning and prosody/pronunciation as the primary
> targets. Character markup may be something to consider as it's come up in
> SVG (rotate) and in CSS before. Doing a span for each character is not
> practical, so we'd want a shorthand much as SVG has shorthand for rotate.
>
> -Charles

Hi Charles,

In one of my companies, we've successfully used <span>, @class and
@data-xxx attributes to support linguistic markup. See
http://www.eopas.org/transcripts/70 for an example (you will need to
agree to a research license checkbox to link through).

Here's a markup excerpt:

<div class="051-004_w morphemes tier">
<span>
<table class="word">
<tbody><tr>
<td colspan="1">
<span class="concordance" data-addr="/p4/w1" data-language-code="erk"
data-search="Maarik" data-type="word">
Maarik
</span>
</td></tr><tr>
<td class="morpheme">
<span class="concordance" data-addr="/p4/w1/m1"
data-language-code="erk" data-search="maarik" data-type="morpheme">
maarik
</span>
</td>
</tr>
<tr>
<td class="gloss">mister</td>
</tr>
</tbody></table>
</span>

It supports multiple levels of linguistic semantic markup:
* phrase
* word
* morpheme
* gloss

If you wanted to make a standard for what levels should be marked up
in which way for linguistic data, you'd first have to get the
linguistic researchers to agree on the required feature-set. Then you
could standardise e.g. data-lang-xxx attributes - or even make up new
linguistic-xxx attributes .
http://www.whatwg.org/specs/web-apps/current-work/#extensibility
describes how to do that.

Hope this helps.

Cheers,
Silvia.