[whatwg] Web API for speech recognition and synthesis

Thu Dec 3 08:21:20 PST 2009

On Dec 3, 2009, at 4:06 AM, Bjorn Bringert wrote:

> On Wed, Dec 2, 2009 at 10:20 PM, Jonas Sicking <jonas at sicking.cc> wrote:
>> On Wed, Dec 2, 2009 at 11:17 AM, Bjorn Bringert <bringert at google.com> wrote:
>>> I agree that being able to capture and upload audio to a server would
>>> be useful for a lot of applications, and it could be used to do speech
>>> recognition. However, for a web app developer who just wants to
>>> develop an application that uses speech input and/or output, it
>>> doesn't seem very convenient, since it requires server-side
>>> infrastructure that is very costly to develop and run. A
>>> speech-specific API in the browser gives browser implementors the
>>> option to use on-device speech services provided by the OS, or
>>> server-side speech synthesis/recognition.
>> 
>> Again, it would help a lot of you could provide use cases and
>> requirements. This helps both with designing an API, as well as
>> evaluating if the use cases are common enough that a dedicated API is
>> the best solution.
>> 
>> / Jonas
> 
> I'm mostly thinking about speech web apps for mobile devices. I think
> that's where speech makes most sense as an input and output method,
> because of the poor keyboards, small screens, and frequent hands/eyes
> busy situations (e.g. while driving). Accessibility is the other big
> reason for using speech.
Accessibility is already handle through ARIA and the host platforms accessibility features.

> 
> Some ideas for use cases:
> 
> - Search by speaking a query
> - Speech-to-speech translation
> - Voice Dialing (could open a tel: URI to actually make the call)
> - Dialog systems (e.g. the canonical pizza ordering system)
> - Lightweight JavaScript browser extensions (e.g. Greasemonkey /
> Chrome extensions) for using speech with any web site, e.g, for
> accessibility.

I am unsure why the site should be directly responsible for things like audio based accessibility.  What do you believe a site should be doing itself manually vs. the accessibility services provided by the host OS?

> 
> Requirements:
> 
> - Web app developer side:
>   - Allows both speech recognition and synthesis.
ARIA (in conjunction with the OS accessibility services) already provides the accessibility focused text to speech (unsure about recognition side)
> 
>   - Doesn't require web app developer to develop / run his own speech
> recognition / synthesis servers.
This would seem to be "use the OS services"
> 
> - Implementor side:
>   - Easy enough to implement that it can get wide adoption in browsers.
These services are not simple -- any implementation would seem to be a significant amount of work, especially if you want to a) actually be good at it and b) interact with the host OS's native accessibility features.

>   - Allows implementor to use either client-side or server-side
> recognition and synthesis.
I honestly have no idea what you mean by this.

--Oliver

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.whatwg.org/pipermail/whatwg-whatwg.org/attachments/20091203/245afb31/attachment.htm>