[whatwg] Considering a lang- attribute prefix for machine translation and intelligibility

Benjamin Hawkes-Lewis bhawkeslewis at googlemail.com
Wed May 2 11:46:40 PDT 2012

On Wed, May 2, 2012 at 6:59 PM, Charles Pritchard <chuck at jumis.com> wrote:
>> If you do expect that, have you evaluated the existing mechanisms for
>> embedding custom data in the page and found them wanting? If so, how?
> 1. Google translate gets a little loose with some markup, to where the
> translated content may be placed outside the span tag.
> Such as: <div id="one">My potato is <span>hot</span></div>.
> 2. Some words can be ambiguous to the point that even a human reader may not
> know what the meaning is. It'd be great to have a mechanism to disambiguate.
> 3. Speech markup is cool, I like it, but we can have something a little
> lighter or even have some interplay with prosody.
> <span>You say <span>potato</span>, I say <span>potato</span></span>.
> (poteitoe, potahtoe)
> 4. CSS markup has come up a few times for sentence, word and character
> boundaries. Language is not static, it is very much human, and enabling
> humans to markup their language is what HTML is all about.
> I'll put some effort in later this week to dig up a few threads on the CSS
> requests.
> 5. Services should never touch data-*; I've had to put all my content into
> markup anyway. I've had to add id attributes so I can identify it when it's
> translated by the UA or other service. Since I've done all that work, it'd
> be really nice to have some more options to add in, such as disambiguation,
> part of speech and occasionally, pronunciation and translation suggestions.


I don't get how *any* of these are problems with the "existing
mechanisms for embedding custom data".

1. New features won't fix Google Translate bugs with existing
features, and it's more efficient for Google to fix Translate than for
the community to design, specify, and implement new features.

2, 3, and 4: Given an appropriate vocabulary, existing mechanisms can
encode unambiguous meanings, information about how text should be
spoken, and phrase and sentence boundaries. Unicode describes
character boundaries.

5. Tab isn't talking about "data-" here, but about all the various
mechanisms available to provide custom data for services to consume
(e.g. microdata, microformats, RDFa).

Benjamin Hawkes-Lewis

More information about the whatwg mailing list