[whatwg] Considering a lang- attribute prefix for machine translation and intelligibility

Benjamin Hawkes-Lewis bhawkeslewis at googlemail.com
Wed May 2 14:29:10 PDT 2012

On Wed, May 2, 2012 at 8:01 PM, Charles Pritchard <chuck at jumis.com> wrote:
>>> 1. New features won't fix Google Translate bugs with existing
>>> features, and it's more efficient for Google to fix Translate than for
>>> the community to design, specify, and implement new features.
> New features do allow services to coalesce around standards. That's what the
> standards are here for.

Existing mechanisms for embedding custom data are (being) standardized
and can make use of standardized vocabularies.

> HTML5 just added a translate attribute.

That doesn't describe a drawback with the using the existing mechanisms.

> Span does not in and of itself signify any semantic meaning. Doesn't that mean that Google Translate is operating correctly?


Moving text in or out of an element that "mean[s] something on its
own" (as the spec puts it) has potential to break things. But that's
also true, if less so, for an element that "doesn't mean anything on
its own". There might be code (clientside JS, CSS selectors, XPointer
URIs, automation scripts, whatever) that depends on that text being
inside or outside that element at that position in the DOM.

That's not to say that Google Translate is operating incorrectly.
Translation inevitably changes the DOM. Text node contents change of
course. Because different languages may express the same ideas in
different orders, DOM nodes may need to be reordered. Because
different languages have different practices around compounding or
implying ideas with different numbers of words, what might be a
separate word in a separate element in one language might need to be
merged into another word outside the element, or vice versa. It's not
obvious that there is a correct behavior here, and I struggle to see
how the markup examples you proposed would help. (Perhaps you could
elaborate?) Researching and recommending authoring practices that make
translation less likely to break code might be a more immediately
fruitful line of enquiry, and might help inform the ultimate creation
of a vocabulary fit for purpose.

But more importantly, assuming such a vocabulary could be created,
this is not a reason why it could not be embedded using the existing
mechanisms. The HTML specification is not the only source of
standardized vocabulary on the web.

>>> 2, 3, and 4: Given an appropriate vocabulary, existing mechanisms can
>>> encode unambiguous meanings, information about how text should be
>>> spoken, and phrase and sentence boundaries. Unicode describes
>>> character boundaries.
> Boris brought up that the concept of letter could use some attention:
> http://lists.w3.org/Archives/Public/www-style/2011Nov/0055.html

It's not clear to me that Boris has raised something not addressed by
Unicode, but in any case an appropriate vocabulary could be used for
letters too.

> Yes, we have existing XML mechanisms for text should be spoken.
> What existing mechanism do we have for disambiguation?

Any vocabulary you want to use with microdata, microformats, RDFa, etc.

If the vocabulary doesn't exist yet, create it and publish it as a spec.

>>> 5. Tab isn't talking about "data-" here, but about all the various
>>> mechanisms available to provide custom data for services to consume
>>> (e.g. microdata, microformats, RDFa).
> Tab asked directly why data- does not work

He had two questions:

1. If you're only using the data yourself, why not data-?

2. If you want other people to use the data, why not the other
mechanisms for custom data embedding?

Your 5 points appeared to be in answer to his second question, because
you placed them as a list in response to it.

But never mind.

> Yes, we have a lot of microformats, it's true. And RDFa.
> They don't seem to be taking flight for these issues

I suspect that's because these are new mechanisms and markup is a
doomed solution to these problems at web scale.

Anyhow, given you're one of the few people asking to be able to encode
these details in markup, offering lack of usage as a reason for not
being able to use these mechanisms is circular.

> and language translation seems like a high level issue appropriate for HTML.

That's not a reason why you could not use the existing mechanisms.

Aside: just because a problem is important, does not mean that
introducing more markup features is an approach that will scale to
solve the problem across the web. More work on NLP would probably be a
better investment in this case.

> Again, look at the translate and lang attributes; those are baked into HTML.

That's not a reason why you could not use the existing mechanisms.

> I am approaching the "lang-" proposal as language agnostic, much as "aria-"
> is language agnostic.
> This seems to be where we are currently:
> <img lang="es" translate="no" alt="No" />
> With alt having ARIA counterparts.
> I'm suggesting a "lang-" with counterparts to translate, language code, and
> a vastly enhanced vocabulary, much as ARIA vastly enhanced the UI
> vocabulary. I think it could help in the long run.

That's just you choosing to use something _other_ than the existing
mechanisms; it's not a reason why you could not use them.

I'm baffled why you think defining an RDF vocabulary then requiring
host languages to closely couple their specs to your spec with a set
of arbitrary and confusing syntactical and behavioural requirements is
preferable to just defining a vocabulary and letting host languages
embed it however they like. I would certainly caution against further
integrations with HTML along the ARIA model, having seen the pain it's

I'd suggest instead that the small number of authors interested in
this markup get together and use and develop vocabularies that can be
embedded in HTML or XML using microdata or RDFa. You will probably
make lots of mistakes and learn a lot along the way. If at the end of
the day, you've got robust vocabularies that solve problems for more
authors and sees non-microscopic levels of adoption, then they could
be pulled into the mainstream language just as class="nav" got pulled
in as <nav> and class="datetime" got pulled in as <time>.

Proposing that we conjure such a vocabulary out of the air to solve a
wide set of mostly unanalysed problems in the absence of documented
workarounds and then reify that vocabulary in a load of specific
features seems to me to put the cart way before the horse.

Benjamin Hawkes-Lewis

More information about the whatwg mailing list