[whatwg] Web API for speech recognition and synthesis

Thu Dec 3 04:06:05 PST 2009

On Wed, Dec 2, 2009 at 10:20 PM, Jonas Sicking <jonas at sicking.cc> wrote:
> On Wed, Dec 2, 2009 at 11:17 AM, Bjorn Bringert <bringert at google.com> wrote:
>> I agree that being able to capture and upload audio to a server would
>> be useful for a lot of applications, and it could be used to do speech
>> recognition. However, for a web app developer who just wants to
>> develop an application that uses speech input and/or output, it
>> doesn't seem very convenient, since it requires server-side
>> infrastructure that is very costly to develop and run. A
>> speech-specific API in the browser gives browser implementors the
>> option to use on-device speech services provided by the OS, or
>> server-side speech synthesis/recognition.
>
> Again, it would help a lot of you could provide use cases and
> requirements. This helps both with designing an API, as well as
> evaluating if the use cases are common enough that a dedicated API is
> the best solution.
>
> / Jonas

I'm mostly thinking about speech web apps for mobile devices. I think
that's where speech makes most sense as an input and output method,
because of the poor keyboards, small screens, and frequent hands/eyes
busy situations (e.g. while driving). Accessibility is the other big
reason for using speech.

Some ideas for use cases:

- Search by speaking a query
- Speech-to-speech translation
- Voice Dialing (could open a tel: URI to actually make the call)
- Dialog systems (e.g. the canonical pizza ordering system)
- Lightweight JavaScript browser extensions (e.g. Greasemonkey /
Chrome extensions) for using speech with any web site, e.g, for
accessibility.

Requirements:

- Web app developer side:
   - Allows both speech recognition and synthesis.
   - Easy to use API. Makes simple things easy and advanced things possible.
   - Doesn't require web app developer to develop / run his own speech
recognition / synthesis servers.
   - (Natural) language-neutral API.
   - Allows developer-defined application specific grammars / language models.
   - Allows multilingual applications.
   - Allows easy localization of speech apps.

- Implementor side:
   - Easy enough to implement that it can get wide adoption in browsers.
   - Allows implementor to use either client-side or server-side
recognition and synthesis.

-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902