[whatwg] Machine translation and related proposals

Mon Jun 11 14:58:49 PDT 2012

On Mon, 26 Mar 2012, Adam Barth wrote:
> On Mon, Mar 26, 2012 at 3:17 PM, Ian Hickson <ian at hixie.ch> wrote:
> > On Mon, 26 Mar 2012, Adam Barth wrote:
> >>
> >> WebKit recently implemented 
> >> http://www.whatwg.org/specs/web-apps/current-work/#attr-translate, 
> >> but that caused us to break orange.fr on mobile:
> >>
> >> https://bugs.webkit.org/show_bug.cgi?id=82246
> >>
> >> The problem is that 
> >> http://www.winktoolkit.org/documentation/symbols/HTMLElement.html#translate 
> >> has a the following code:
> >>
> >> if (wink.isUndefined(HTMLElement.prototype.translate))
> >>     HTMLElement.prototype.translate = HTMLElement.prototype.winkTranslate;
> >>
> >> The web site expects HTMLElement.prototype.translate to be Wink's 
> >> translate function rather than the HTML translate attribute.
> >>
> >> Would it make sense to change the name of the translate attribute to 
> >> avoid this conflict?  Should we try to evangalize the Wink Toolkit to 
> >> change their code and everyone who uses Wink to update to the fixed 
> >> version?
> >
> > How widely used is it? (In particular, how widely used is .translate() 
> > rather than .winkTranslate()?)
> 
> The documentation lists only .translate(), not .winkTranslate(), so I 
> would expect most folks using the library to use the former rather than 
> the latter.

On Mon, 26 Mar 2012, Edward O'Connor wrote:
> >
> > It would be unfortunate to have to reserve the use of a name as 
> > generic as "translate" for a particular library.
> 
> Indeed. That said, the name "translate" already means something in the 
> platformâ€”it's used by CSS transforms and by the <canvas> 2D Context 
> API. Wink's usage of the term matches the existing use of the term on 
> the platform.
> 
> Maybe we should rename the "is this element translatable or not" 
> attribute to, say, "translatable".

On Tue, 27 Mar 2012, jerome.giraud at orange.com wrote:
> 
> We had already planned on finding a replacement for our HTML Element 
> extensions and I think the current discussions will force us to speed 
> things up, which is a good thing :)
> 
> This was a "legacy" feature that we decided we should get rid of a long 
> time ago for these obvious conflicts reasons, though we had never 
> imagined the "translate" would be used on HTMLElements in an i18n 
> context (so +1 for Edward O'Connors comment if I may)
> 
> I already warned the persons in charge of the Orange portal and they 
> will replace the HTMLElement.translate calls. We will warn our users and 
> prepare the necessary changes for our next release.

Since you are on top of this I have not changed the attribute name in the 
spec. Please do let me know if this ends up being a less tractable problem 
than it currently appears.

On Wed, 2 May 2012, Charles Pritchard wrote:
>
> There has been some discussion on the w3c/whatwg mailing lists about how 
> far we can mark up content with linguistic tags, such as marking word 
> and/or sentence boundaries.
> 
> In my authoring of web apps, I often write a short manual into a hidden 
> div, so that the vocabulary of my application can be processed by 
> translation services such as Google translate. Having content in the DOM 
> seems the most appropriate way to handle translation.
> 
> I'd like the group to consider the costs/benefits/alternatives to a 
> "lang-" attribute.
> Such as <span lang-role="sentence">This is a sentence.</span>
> 
> The data- and aria- attributes have worked out well. We may want to make 
> room for one more.
> 
> Such a structure could be used to markup typical subject/object/verb and 
> clause sections; it could also be used to markup poetic texts as well as 
> defined meanings of content.
> 
> http://www.omegawiki.org/Expression:orange
> This is an <span lang-meaning="DefinedMeaning:orange_(5821)">orange</span>.
> Now this, this is <span
> lang-meaning="DefinedMeaning:orange_(5822)">orange</span>.
> 
> In most cases there's no need to define sentence boundary, meaning or 
> otherwise. But, it'd sure be nice to have the ability to do so in a 
> standard manner.
> 
> I'd recommend role, meaning and prosody/pronunciation as the primary 
> targets. Character markup may be something to consider as it's come up 
> in SVG (rotate) and in CSS before. Doing a span for each character is 
> not practical, so we'd want a shorthand much as SVG has shorthand for 
> rotate.

On Wed, 2 May 2012, Tab Atkins Jr. wrote:
> 
> Do you expect outside services to do anything useful with this 
> information?  If not, the data-* attributes seem appropriate.
> 
> If you do expect that, have you evaluated the existing mechanisms for 
> embedding custom data in the page and found them wanting? If so, how?

On Wed, 2 May 2012, Charles Pritchard wrote:
> 
> Yes, that's the primary reason. "services such as Google translate".
> 
> 1. Google translate gets a little loose with some markup, to where the 
> translated content may be placed outside the span tag.
> 
> Such as: <div id="one">My potato is <span>hot</span></div>.
> 
> 2. Some words can be ambiguous to the point that even a human reader may 
> not know what the meaning is. It'd be great to have a mechanism to 
> disambiguate.
> 
> 3. Speech markup is cool, I like it, but we can have something a little 
> lighter or even have some interplay with prosody.
> <span>You say <span>potato</span>, I say <span>potato</span></span>.
> (poteitoe, potahtoe)
> 
> 4. CSS markup has come up a few times for sentence, word and character 
> boundaries. Language is not static, it is very much human, and enabling 
> humans to markup their language is what HTML is all about.
> 
> I'll put some effort in later this week to dig up a few threads on the 
> CSS requests.
> 
> 5. Services should never touch data-*; I've had to put all my content 
> into markup anyway. I've had to add id attributes so I can identify it 
> when it's translated by the UA or other service. Since I've done all 
> that work, it'd be really nice to have some more options to add in, such 
> as disambiguation, part of speech and occasionally, pronunciation and 
> translation suggestions.

On Wed, 2 May 2012, Benjamin Hawkes-Lewis wrote:
> 
> I don't get how *any* of these are problems with the "existing 
> mechanisms for embedding custom data".
> 
> 1. New features won't fix Google Translate bugs with existing features, 
> and it's more efficient for Google to fix Translate than for the 
> community to design, specify, and implement new features.
> 
> 2, 3, and 4: Given an appropriate vocabulary, existing mechanisms can 
> encode unambiguous meanings, information about how text should be 
> spoken, and phrase and sentence boundaries. Unicode describes character 
> boundaries.
> 
> 5. Tab isn't talking about "data-" here, but about all the various 
> mechanisms available to provide custom data for services to consume 
> (e.g. microdata, microformats, RDFa).

On Wed, 2 May 2012, Charles Pritchard wrote:
> 
> New features do allow services to coalesce around standards. That's what 
> the standards are here for. HTML5 just added a translate attribute.
> 
> Span does not in and of itself signify any semantic meaning. Doesn't 
> that mean that Google Translate is operating correctly?
> 
> [...]
> 
> Boris brought up that the concept of letter could use some attention: 
> http://lists.w3.org/Archives/Public/www-style/2011Nov/0055.html
> 
> Yes, we have existing XML mechanisms for text should be spoken.
> 
> What existing mechanism do we have for disambiguation?
> 
> [...]
> 
> Tab asked directly why data- does not work
> 
> Yes, we have a lot of microformats, it's true. And RDFa.
> 
> They don't seem to be taking flight for these issues, and language 
> translation seems like a high level issue appropriate for HTML. Again, 
> look at the translate and lang attributes; those are baked into HTML.
> 
> I am approaching the "lang-" proposal as language agnostic, much as 
> "aria-" is language agnostic.
> 
> This seems to be where we are currently:
> <img lang="es" translate="no" alt="No" />
> 
> With alt having ARIA counterparts.
> 
> I'm suggesting a "lang-" with counterparts to translate, language code, 
> and a vastly enhanced vocabulary, much as ARIA vastly enhanced the UI 
> vocabulary. I think it could help in the long run.

On Wed, 2 May 2012, Benjamin Hawkes-Lewis wrote:
> [...]
> 
> Moving text in or out of an element that "mean[s] something on its own" 
> (as the spec puts it) has potential to break things. But that's also 
> true, if less so, for an element that "doesn't mean anything on its 
> own". There might be code (clientside JS, CSS selectors, XPointer URIs, 
> automation scripts, whatever) that depends on that text being inside or 
> outside that element at that position in the DOM.
> 
> That's not to say that Google Translate is operating incorrectly. 
> Translation inevitably changes the DOM. Text node contents change of 
> course. Because different languages may express the same ideas in 
> different orders, DOM nodes may need to be reordered. Because different 
> languages have different practices around compounding or implying ideas 
> with different numbers of words, what might be a separate word in a 
> separate element in one language might need to be merged into another 
> word outside the element, or vice versa. It's not obvious that there is 
> a correct behavior here, and I struggle to see how the markup examples 
> you proposed would help. (Perhaps you could elaborate?) Researching and 
> recommending authoring practices that make translation less likely to 
> break code might be a more immediately fruitful line of enquiry, and 
> might help inform the ultimate creation of a vocabulary fit for purpose.
> 
> But more importantly, assuming such a vocabulary could be created, this 
> is not a reason why it could not be embedded using the existing 
> mechanisms. The HTML specification is not the only source of 
> standardized vocabulary on the web.
>
> [...]
> 
> 1. If you're only using the data yourself, why not data-?
> 
> 2. If you want other people to use the data, why not the other 
> mechanisms for custom data embedding?
> 
> Your 5 points appeared to be in answer to his second question, because 
> you placed them as a list in response to it.
>
> [...]
> 
> That's just you choosing to use something _other_ than the existing 
> mechanisms; it's not a reason why you could not use them.
> 
> I'm baffled why you think defining an RDF vocabulary then requiring host 
> languages to closely couple their specs to your spec with a set of 
> arbitrary and confusing syntactical and behavioural requirements is 
> preferable to just defining a vocabulary and letting host languages 
> embed it however they like. I would certainly caution against further 
> integrations with HTML along the ARIA model, having seen the pain it's 
> caused.
> 
> I'd suggest instead that the small number of authors interested in this 
> markup get together and use and develop vocabularies that can be 
> embedded in HTML or XML using microdata or RDFa. You will probably make 
> lots of mistakes and learn a lot along the way. If at the end of the 
> day, you've got robust vocabularies that solve problems for more authors 
> and sees non-microscopic levels of adoption, then they could be pulled 
> into the mainstream language just as class="nav" got pulled in as <nav> 
> and class="datetime" got pulled in as <time>.
> 
> Proposing that we conjure such a vocabulary out of the air to solve a 
> wide set of mostly unanalysed problems in the absence of documented 
> workarounds and then reify that vocabulary in a load of specific 
> features seems to me to put the cart way before the horse.

On Thu, 3 May 2012, Silvia Pfeiffer wrote:
> 
> In one of my companies, we've successfully used <span>, @class and 
> @data-xxx attributes to support linguistic markup. See 
> http://www.eopas.org/transcripts/70 for an example (you will need to 
> agree to a research license checkbox to link through).
> 
> Here's a markup excerpt:
> 
> <div class="051-004_w morphemes tier">
> <span>
> <table class="word">
> <tbody><tr>
> <td colspan="1">
> <span class="concordance" data-addr="/p4/w1" data-language-code="erk"
> data-search="Maarik" data-type="word">
> Maarik
> </span>
> </td></tr><tr>
> <td class="morpheme">
> <span class="concordance" data-addr="/p4/w1/m1"
> data-language-code="erk" data-search="maarik" data-type="morpheme">
> maarik
> </span>
> </td>
> </tr>
> <tr>
> <td class="gloss">mister</td>
> </tr>
> </tbody></table>
> </span>
> 
> It supports multiple levels of linguistic semantic markup:
> * phrase
> * word
> * morpheme
> * gloss
> 
> If you wanted to make a standard for what levels should be marked up in 
> which way for linguistic data, you'd first have to get the linguistic 
> researchers to agree on the required feature-set. Then you could 
> standardise e.g. data-lang-xxx attributes - or even make up new 
> linguistic-xxx attributes . 
> http://www.whatwg.org/specs/web-apps/current-work/#extensibility 
> describes how to do that.

Given the existence of solution that can address this already, I haven't 
added anything to the spec to support it. If it turns out that a lot of 
people do this, then it would make sense to examine whether we should have 
dedicated markup for it.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'