[whatwg] Web API for speech recognition and synthesis

Wed Dec 16 07:17:01 PST 2009

(resending to include the whatwg list, sorry for multiple postings)
Hi Olli,
Thank you for bringing this interesting thread to the Multimodal
Interaction Working Group's attention.
The working group is in fact very active. Although it is chartered as 
W3C Member-only, we do have a public mailing list, www-multimodal at w3.org, 
available for public discussions. 

In general, we would be very interested in hearing about the kinds of use 
cases for speech recognition and TTS in a browser context that you would 
like to handle. The Multimodal Architecture is our primary draft spec 
that addresses using speech in web pages (although it also addresses 
other modes of input, such as handwriting). A new Working Draft has just 
been published and we would be very interested 
in getting feedback on it. In my opinion, it's probably focused more on 
distributed architectures than on the use cases you might be interested 
in, but we would like our specs to be comprehensive enough to be able to 
address both server-based and client-based speech processing. 

We would also be interested in general discussions of questions about
multimodality. 

Here are some pointers that may be useful.
MMI page: http://www.w3.org/2002/mmi/
MMI Architecture spec: http://www.w3.org/TR/2009/WD-mmi-arch-20091201/

best regards,

Debbie Dahl, MMI Working Group Chair

> -----Original Message-----
> From: Olli Pettay [mailto:Olli.Pettay at helsinki.fi] 
> Sent: Friday, December 11, 2009 4:14 PM
> To: Bjorn Bringert
> Cc: Olli at pettay.fi; Dave Burke; João Eiras; whatwg; David 
> Singleton; Gudmundur Hafsteinsson; westonruter at gmail.com; 
> www-multimodal at w3.org; Deborah Dahl
> Subject: Re: [whatwg] Web API for speech recognition and synthesis
> 
> On 12/11/09 6:05 AM, Bjorn Bringert wrote:
> > Thanks for the discussion - cool to see more interest today also
> > 
>
(http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-December/024453.ht
ml)
> >
> > I've hacked up a proof-of-concept JavaScript API for speech
> > recognition and synthesis. It adds a navigator.speech object with
> > these functions:
> >
> > void listen(ListenCallback callback, ListenOptions options);
> > void speak(DOMString text, SpeakCallback callback, 
> SpeakOptions options);
> 
> 
> So if I read the examples correctly you're not using grammars 
> anywhere.
> I wonder how well does that work in real world cases. Of course if
> the speech recognizer can handle everything well without grammars, the
> result validation could be done in JS after the result is got from the
> recognizer. But I think having support for grammars simplifies coding
> and can make speech dialogs somewhat more manageable.
> 
> W3C has already standardized things like
> http://www.w3.org/TR/speech-grammar/ and
> http://www.w3.org/TR/semantic-interpretation/
> and the latter one works quite nicely with JS.
> 
> Again, I think this kind of discussion should happen in W3C 
> multimodal 
> WG. Though, I'm not sure how actively or how openly that 
> working group 
> works atm.
> 
> -Olli
> 
> 
> >
> > The implementation uses an NPAPI plugin for the Android browser that
> > wraps the existing Android speech APIs. The code is available at
> > http://code.google.com/p/speech-api-browser-plugin/
> >
> > There are some simple demo apps in
> > 
> http://code.google.com/p/speech-api-browser-plugin/source/brow
> se/trunk/android-plugin/demos/
> > including:
> >
> > - English to Spanish speech-to-speech translation
> > - Google search by speaking a query
> > - The obligatory pizza ordering system
> > - A phone number dialer
> >
> > Comments appreciated!
> >
> > /Bjorn
> >
> > On Fri, Dec 4, 2009 at 2:51 PM, Olli 
> Pettay<Olli.Pettay at helsinki.fi>  wrote:
> >> Indeed the API should be something significantly simpler than X+V.
> >> Microsoft has (had?) support for SALT. That API is pretty 
> simple and
> >> provides speech recognition and TTS.
> >> The API could be probably even simpler than SALT.
> >> IIRC, there was an extension for Firefox to support SALT 
> (well, there was
> >> also an extension to support X+V).
> >>
> >> If the platform/OS provides ASR and TTS, adding a JS API 
> for it should
> >> be pretty simple. X+V tries to handle some logic using 
> VoiceXML FIA, but
> >> I think it would be more web-like to give pure JS API 
> (similar to SALT).
> >> Integrating visual and voice input could be done in 
> scripts. I'd assume
> >> there would be some script libraries to handle multimodal 
> input integration
> >> - especially if there will be touch and gestures events 
> too etc. (Classic
> >> multimodal map applications will become possible in web.)
> >>
> >> But this all is something which should be possibly 
> designed in or with W3C
> >> multimodal working group. I know their current 
> architecture is way more
> >> complex, but X+X, SALT and even Multimodal-CSS has been 
> discussed in that
> >> working group.
> >>
> >>
> >> -Olli
> >>
> >>
> >>
> >> On 12/3/09 2:50 AM, Dave Burke wrote:
> >>>
> >>> We're envisaging a simpler programmatic API that looks 
> familiar to the
> >>> modern Web developer but one which avoids the legacy of 
> dialog system
> >>> languages.
> >>>
> >>> Dave
> >>>
> >>> On Wed, Dec 2, 2009 at 7:25 PM, João Eiras<joaoe at opera.com
> >>> <mailto:joaoe at opera.com>>  wrote:
> >>>
> >>>     On Wed, 02 Dec 2009 12:32:07 +0100, Bjorn Bringert
> >>>     <bringert at google.com<mailto:bringert at google.com>>  wrote:
> >>>
> >>>         We've been watching our colleagues build native 
> apps that use
> >>> speech
> >>>         recognition and speech synthesis, and would like 
> to have JavaScript
> >>>         APIs that let us do the same in web apps. We are 
> thinking about
> >>>         creating a lightweight and 
> implementation-independent API that lets
> >>>         web apps use speech services. Is anyone else 
> interested in that?
> >>>
> >>>         Bjorn Bringert, David Singleton, Gummi Hafsteinsson
> >>>
> >>>
> >>>     This exists already, but only Opera supports it, 
> although there are
> >>>     problems with the library we use for speech recognition.
> >>>
> >>>     http://www.w3.org/TR/xhtml+voice/
> >>>
> >>>   
> http://dev.opera.com/articles/view/add-voice-interactivity-to-
> your-site/
> >>>
> >>>     Would be nice to revive that specification and get 
> vendor buy-in.
> >>>
> >>>
> >>>
> >>>     --
> >>>
> >>>     João Eiras
> >>>     Core Developer, Opera Software ASA, http://www.opera.com/
> >>>
> >>>
> >>
> >>
> >
> >
> >
> 
>