[whatwg] Web API for speech recognition and synthesis

Thu Dec 3 08:48:21 PST 2009

On Thu, Dec 3, 2009 at 4:06 AM, Bjorn Bringert <bringert at google.com> wrote:
> On Wed, Dec 2, 2009 at 10:20 PM, Jonas Sicking <jonas at sicking.cc> wrote:
>> On Wed, Dec 2, 2009 at 11:17 AM, Bjorn Bringert <bringert at google.com> wrote:
>>> I agree that being able to capture and upload audio to a server would
>>> be useful for a lot of applications, and it could be used to do speech
>>> recognition. However, for a web app developer who just wants to
>>> develop an application that uses speech input and/or output, it
>>> doesn't seem very convenient, since it requires server-side
>>> infrastructure that is very costly to develop and run. A
>>> speech-specific API in the browser gives browser implementors the
>>> option to use on-device speech services provided by the OS, or
>>> server-side speech synthesis/recognition.
>>
>> Again, it would help a lot of you could provide use cases and
>> requirements. This helps both with designing an API, as well as
>> evaluating if the use cases are common enough that a dedicated API is
>> the best solution.
>>
>> / Jonas
>
> I'm mostly thinking about speech web apps for mobile devices. I think
> that's where speech makes most sense as an input and output method,
> because of the poor keyboards, small screens, and frequent hands/eyes
> busy situations (e.g. while driving). Accessibility is the other big
> reason for using speech.
>
> Some ideas for use cases:
>
> - Search by speaking a query
> - Speech-to-speech translation
> - Voice Dialing (could open a tel: URI to actually make the call)

<input type=search>, <input type=text> and <input type=tel> seems like
the correct solution for these. Nothing prevents UAs for allowing
speech rather than keyboard input into these (and I believe that most
do if you have AT tools installed).

> - Dialog systems (e.g. the canonical pizza ordering system)

I saw some pretty cool XHTML+Voice demos a few years ago that did
this. They didn't use speech-to-text scripting APIs though.

> - Lightweight JavaScript browser extensions (e.g. Greasemonkey /
> Chrome extensions) for using speech with any web site, e.g, for
> accessibility.

These would seem like APIs not exposed to webpages, but rather to
extensions. So WHATWG would be the wrong place to standardize them.
And I'm not convinced that this needs speech-to-text scripting APIs
either, but rather simply support for speech rather than keyboard as
text input.

/ Jonas