[whatwg] Machine translation and related proposals
Ian Hickson
ian at hixie.ch
Mon Jun 11 14:58:49 PDT 2012
On Mon, 26 Mar 2012, Adam Barth wrote:
> On Mon, Mar 26, 2012 at 3:17 PM, Ian Hickson <ian at hixie.ch> wrote:
> > On Mon, 26 Mar 2012, Adam Barth wrote:
> >>
> >> WebKit recently implemented
> >> http://www.whatwg.org/specs/web-apps/current-work/#attr-translate,
> >> but that caused us to break orange.fr on mobile:
> >>
> >> https://bugs.webkit.org/show_bug.cgi?id=82246
> >>
> >> The problem is that
> >> http://www.winktoolkit.org/documentation/symbols/HTMLElement.html#translate
> >> has a the following code:
> >>
> >> if (wink.isUndefined(HTMLElement.prototype.translate))
> >> HTMLElement.prototype.translate = HTMLElement.prototype.winkTranslate;
> >>
> >> The web site expects HTMLElement.prototype.translate to be Wink's
> >> translate function rather than the HTML translate attribute.
> >>
> >> Would it make sense to change the name of the translate attribute to
> >> avoid this conflict? Should we try to evangalize the Wink Toolkit to
> >> change their code and everyone who uses Wink to update to the fixed
> >> version?
> >
> > How widely used is it? (In particular, how widely used is .translate()
> > rather than .winkTranslate()?)
>
> The documentation lists only .translate(), not .winkTranslate(), so I
> would expect most folks using the library to use the former rather than
> the latter.
On Mon, 26 Mar 2012, Edward O'Connor wrote:
> >
> > It would be unfortunate to have to reserve the use of a name as
> > generic as "translate" for a particular library.
>
> Indeed. That said, the name "translate" already means something in the
> platformâit's used by CSS transforms and by the <canvas> 2D Context
> API. Wink's usage of the term matches the existing use of the term on
> the platform.
>
> Maybe we should rename the "is this element translatable or not"
> attribute to, say, "translatable".
On Tue, 27 Mar 2012, jerome.giraud at orange.com wrote:
>
> We had already planned on finding a replacement for our HTML Element
> extensions and I think the current discussions will force us to speed
> things up, which is a good thing :)
>
> This was a "legacy" feature that we decided we should get rid of a long
> time ago for these obvious conflicts reasons, though we had never
> imagined the "translate" would be used on HTMLElements in an i18n
> context (so +1 for Edward O'Connors comment if I may)
>
> I already warned the persons in charge of the Orange portal and they
> will replace the HTMLElement.translate calls. We will warn our users and
> prepare the necessary changes for our next release.
Since you are on top of this I have not changed the attribute name in the
spec. Please do let me know if this ends up being a less tractable problem
than it currently appears.
On Wed, 2 May 2012, Charles Pritchard wrote:
>
> There has been some discussion on the w3c/whatwg mailing lists about how
> far we can mark up content with linguistic tags, such as marking word
> and/or sentence boundaries.
>
> In my authoring of web apps, I often write a short manual into a hidden
> div, so that the vocabulary of my application can be processed by
> translation services such as Google translate. Having content in the DOM
> seems the most appropriate way to handle translation.
>
> I'd like the group to consider the costs/benefits/alternatives to a
> "lang-" attribute.
> Such as <span lang-role="sentence">This is a sentence.</span>
>
> The data- and aria- attributes have worked out well. We may want to make
> room for one more.
>
> Such a structure could be used to markup typical subject/object/verb and
> clause sections; it could also be used to markup poetic texts as well as
> defined meanings of content.
>
> http://www.omegawiki.org/Expression:orange
> This is an <span lang-meaning="DefinedMeaning:orange_(5821)">orange</span>.
> Now this, this is <span
> lang-meaning="DefinedMeaning:orange_(5822)">orange</span>.
>
> In most cases there's no need to define sentence boundary, meaning or
> otherwise. But, it'd sure be nice to have the ability to do so in a
> standard manner.
>
> I'd recommend role, meaning and prosody/pronunciation as the primary
> targets. Character markup may be something to consider as it's come up
> in SVG (rotate) and in CSS before. Doing a span for each character is
> not practical, so we'd want a shorthand much as SVG has shorthand for
> rotate.
On Wed, 2 May 2012, Tab Atkins Jr. wrote:
>
> Do you expect outside services to do anything useful with this
> information? If not, the data-* attributes seem appropriate.
>
> If you do expect that, have you evaluated the existing mechanisms for
> embedding custom data in the page and found them wanting? If so, how?
On Wed, 2 May 2012, Charles Pritchard wrote:
>
> Yes, that's the primary reason. "services such as Google translate".
>
> 1. Google translate gets a little loose with some markup, to where the
> translated content may be placed outside the span tag.
>
> Such as: <div id="one">My potato is <span>hot</span></div>.
>
> 2. Some words can be ambiguous to the point that even a human reader may
> not know what the meaning is. It'd be great to have a mechanism to
> disambiguate.
>
> 3. Speech markup is cool, I like it, but we can have something a little
> lighter or even have some interplay with prosody.
> <span>You say <span>potato</span>, I say <span>potato</span></span>.
> (poteitoe, potahtoe)
>
> 4. CSS markup has come up a few times for sentence, word and character
> boundaries. Language is not static, it is very much human, and enabling
> humans to markup their language is what HTML is all about.
>
> I'll put some effort in later this week to dig up a few threads on the
> CSS requests.
>
> 5. Services should never touch data-*; I've had to put all my content
> into markup anyway. I've had to add id attributes so I can identify it
> when it's translated by the UA or other service. Since I've done all
> that work, it'd be really nice to have some more options to add in, such
> as disambiguation, part of speech and occasionally, pronunciation and
> translation suggestions.
On Wed, 2 May 2012, Benjamin Hawkes-Lewis wrote:
>
> I don't get how *any* of these are problems with the "existing
> mechanisms for embedding custom data".
>
> 1. New features won't fix Google Translate bugs with existing features,
> and it's more efficient for Google to fix Translate than for the
> community to design, specify, and implement new features.
>
> 2, 3, and 4: Given an appropriate vocabulary, existing mechanisms can
> encode unambiguous meanings, information about how text should be
> spoken, and phrase and sentence boundaries. Unicode describes character
> boundaries.
>
> 5. Tab isn't talking about "data-" here, but about all the various
> mechanisms available to provide custom data for services to consume
> (e.g. microdata, microformats, RDFa).
On Wed, 2 May 2012, Charles Pritchard wrote:
>
> New features do allow services to coalesce around standards. That's what
> the standards are here for. HTML5 just added a translate attribute.
>
> Span does not in and of itself signify any semantic meaning. Doesn't
> that mean that Google Translate is operating correctly?
>
> [...]
>
> Boris brought up that the concept of letter could use some attention:
> http://lists.w3.org/Archives/Public/www-style/2011Nov/0055.html
>
> Yes, we have existing XML mechanisms for text should be spoken.
>
> What existing mechanism do we have for disambiguation?
>
> [...]
>
> Tab asked directly why data- does not work
>
> Yes, we have a lot of microformats, it's true. And RDFa.
>
> They don't seem to be taking flight for these issues, and language
> translation seems like a high level issue appropriate for HTML. Again,
> look at the translate and lang attributes; those are baked into HTML.
>
> I am approaching the "lang-" proposal as language agnostic, much as
> "aria-" is language agnostic.
>
> This seems to be where we are currently:
> <img lang="es" translate="no" alt="No" />
>
> With alt having ARIA counterparts.
>
> I'm suggesting a "lang-" with counterparts to translate, language code,
> and a vastly enhanced vocabulary, much as ARIA vastly enhanced the UI
> vocabulary. I think it could help in the long run.
On Wed, 2 May 2012, Benjamin Hawkes-Lewis wrote:
> [...]
>
> Moving text in or out of an element that "mean[s] something on its own"
> (as the spec puts it) has potential to break things. But that's also
> true, if less so, for an element that "doesn't mean anything on its
> own". There might be code (clientside JS, CSS selectors, XPointer URIs,
> automation scripts, whatever) that depends on that text being inside or
> outside that element at that position in the DOM.
>
> That's not to say that Google Translate is operating incorrectly.
> Translation inevitably changes the DOM. Text node contents change of
> course. Because different languages may express the same ideas in
> different orders, DOM nodes may need to be reordered. Because different
> languages have different practices around compounding or implying ideas
> with different numbers of words, what might be a separate word in a
> separate element in one language might need to be merged into another
> word outside the element, or vice versa. It's not obvious that there is
> a correct behavior here, and I struggle to see how the markup examples
> you proposed would help. (Perhaps you could elaborate?) Researching and
> recommending authoring practices that make translation less likely to
> break code might be a more immediately fruitful line of enquiry, and
> might help inform the ultimate creation of a vocabulary fit for purpose.
>
> But more importantly, assuming such a vocabulary could be created, this
> is not a reason why it could not be embedded using the existing
> mechanisms. The HTML specification is not the only source of
> standardized vocabulary on the web.
>
> [...]
>
> 1. If you're only using the data yourself, why not data-?
>
> 2. If you want other people to use the data, why not the other
> mechanisms for custom data embedding?
>
> Your 5 points appeared to be in answer to his second question, because
> you placed them as a list in response to it.
>
> [...]
>
> That's just you choosing to use something _other_ than the existing
> mechanisms; it's not a reason why you could not use them.
>
> I'm baffled why you think defining an RDF vocabulary then requiring host
> languages to closely couple their specs to your spec with a set of
> arbitrary and confusing syntactical and behavioural requirements is
> preferable to just defining a vocabulary and letting host languages
> embed it however they like. I would certainly caution against further
> integrations with HTML along the ARIA model, having seen the pain it's
> caused.
>
> I'd suggest instead that the small number of authors interested in this
> markup get together and use and develop vocabularies that can be
> embedded in HTML or XML using microdata or RDFa. You will probably make
> lots of mistakes and learn a lot along the way. If at the end of the
> day, you've got robust vocabularies that solve problems for more authors
> and sees non-microscopic levels of adoption, then they could be pulled
> into the mainstream language just as class="nav" got pulled in as <nav>
> and class="datetime" got pulled in as <time>.
>
> Proposing that we conjure such a vocabulary out of the air to solve a
> wide set of mostly unanalysed problems in the absence of documented
> workarounds and then reify that vocabulary in a load of specific
> features seems to me to put the cart way before the horse.
On Thu, 3 May 2012, Silvia Pfeiffer wrote:
>
> In one of my companies, we've successfully used <span>, @class and
> @data-xxx attributes to support linguistic markup. See
> http://www.eopas.org/transcripts/70 for an example (you will need to
> agree to a research license checkbox to link through).
>
> Here's a markup excerpt:
>
> <div class="051-004_w morphemes tier">
> <span>
> <table class="word">
> <tbody><tr>
> <td colspan="1">
> <span class="concordance" data-addr="/p4/w1" data-language-code="erk"
> data-search="Maarik" data-type="word">
> Maarik
> </span>
> </td></tr><tr>
> <td class="morpheme">
> <span class="concordance" data-addr="/p4/w1/m1"
> data-language-code="erk" data-search="maarik" data-type="morpheme">
> maarik
> </span>
> </td>
> </tr>
> <tr>
> <td class="gloss">mister</td>
> </tr>
> </tbody></table>
> </span>
>
> It supports multiple levels of linguistic semantic markup:
> * phrase
> * word
> * morpheme
> * gloss
>
> If you wanted to make a standard for what levels should be marked up in
> which way for linguistic data, you'd first have to get the linguistic
> researchers to agree on the required feature-set. Then you could
> standardise e.g. data-lang-xxx attributes - or even make up new
> linguistic-xxx attributes .
> http://www.whatwg.org/specs/web-apps/current-work/#extensibility
> describes how to do that.
Given the existence of solution that can address this already, I haven't
added anything to the spec to support it. If it turns out that a lot of
people do this, then it would make sense to examine whether we should have
dedicated markup for it.
--
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
More information about the whatwg
mailing list