[whatwg] Web API for speech recognition and synthesis

Ian McGraw imcgraw at mit.edu
Sun Dec 13 10:46:46 PST 2009

I'm new to this list, but as a speech-scientist and web developer, I wanted
to add my 2 cents.  Personally, I believe the future of speech recognition
is in the cloud.

Here are two services which provide Javascript APIs for speech recognition
(and TTS) today:


Both of these are research systems, and as such they are really just
That said, Wami's JSONP-like implementation allows Quizlet.com to use speech
recognition today on a relatively large scale, with just a few lines of
Javascript code:


Since there are a lot of Google folks on this list, I recommend you talk to
Alex Gruenstein (in your speech group) who was one of the lead developers of
WAMI while at MIT.

The major limitation we found when building the system was that we had to
develop a new audio controller for every client (Java for the desktop,
custom browsers for iPhone and Android).  It would have been much simpler if
browsers came with standard microphone capture and audio streaming


On Sun, Dec 13, 2009 at 12:07 PM, Weston Ruter <westonruter at gmail.com>wrote:

> I blogged yesterday about this topic (including a text-to-speech demo using
> HTML5 Audio and Google Translate's TTS service); the more relevant part for
> this thread: <http://weston.ruter.net/projects/google-tts/>
> I am really excited at the prospect of text-to-speech being made available
>> on
>> the Web! It's just too bad that fetching MP3s on an remote web service is
>> the
>> only standard way of doing so currently; modern operating systems all have
>> TTS
>> capabilities, so it's a shame that web apps and can't utilize them via
>> client-side scripting. I posted to the WHATWG mailing list about such a
>> Text-To-Speech (TTS) Web API for JavaScript, and I was directed to a
>> recent
>> thread about a Web API for speech recognition and synthesis.
>> Perhaps there is some momentum building here? Having TTS available in the
>> browser would boost accessibility for the seeing-impaired and improve
>> usability
>> for people on-the-go. TTS is just another technology that has
>> traditionally been
>> relegated to desktop applications, but as the open Web advances as the
>> preferred
>> platform for application development, it is an essential service to make
>> available (as with Geolocation API, Device API, etc.). And besides, I want
>> to
>> build TTS applications and my motto is: "If it can't be done on the open
>> web,
>> it's not worth doing at all"!
> http://weston.ruter.net/projects/google-tts/
> Weston
> On Fri, Dec 11, 2009 at 1:35 PM, Weston Ruter <westonruter at gmail.com>wrote:
>> I was just alerted about this thread from my post "Text-To-Speech (TTS)
>> Web API for JavaScript" at <
>> http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-December/024453.html>.
>> Amazing how shared ideas like these seem to arise independently at the same
>> time.
>> I have a use-case and an additional requirement, that the time indices be
>> made available for when each word is spoken in the TTS-generated audio:
>> I've been working on a web app which reads text in a web page,
>>> highlighting each word as it is read. For this to be possible, a
>>> Text-To-Speech API is needed which is able to:
>>> (1) generate the speech audio from some text, and
>>> (2) include the time indicies for when each of the words in the text is
>>> spoken.
>> I foresee that a TTS API should integrate closely with the HTML5 Audio
>> API. For example, invoking a call to the API could return a "TTS" object
>> which has an instance of Audio, whose interface could be used to navigate
>> through the TTS output. For example:
>> var tts = new TextToSpeech("Hello, World!");
>> tts.audio.addEventListener("canplaythrough", function(e){
>>     //tts.indices == [{startTime:0, endTime:500, text:"Hello"},
>> {startTime:500, endTime:1000, text:"World"}]
>> }, false);
>> tts.read(); //invokes tts.audio.play
>> What would be even cooler, is if the parameter passed to the TextToSpeech
>> constructor could be an Element or TextNode, and the indices would then
>> include a DOM Range in addition to the "text" property. A flag could also be
>> set which would result in each of these DOM ranges to be selected when it is
>> read. For example:
>> var tts = new TextToSpeech(document.querySelector("article"));
>> tts.selectRangesOnRead = true;
>> tts.audio.addEventListener("canplaythrough", function(e){
>>     /*
>>     tts.indices == [
>>         {startTime:0, endTime:500, text:"Hello", range:Range},
>>         {startTime:500, endTime:1000, text:"World", range:Range}
>>     ]
>>     */
>> }, false);
>> tts.read();
>> In addition to the events fired by the Audio API, more events could be
>> fired when reading TTS, such as a "readrange" event whose event object would
>> include the index (startTime, endTime, text, range) for the range currently
>> being spoken. Such functionality would make the ability to "read along" with
>> the text trivial.
>> What do you think?
>> Weston
>> On Thu, Dec 3, 2009 at 4:06 AM, Bjorn Bringert <bringert at google.com>wrote:
>>> On Wed, Dec 2, 2009 at 10:20 PM, Jonas Sicking <jonas at sicking.cc> wrote:
>>> > On Wed, Dec 2, 2009 at 11:17 AM, Bjorn Bringert <bringert at google.com>
>>> wrote:
>>> >> I agree that being able to capture and upload audio to a server would
>>> >> be useful for a lot of applications, and it could be used to do speech
>>> >> recognition. However, for a web app developer who just wants to
>>> >> develop an application that uses speech input and/or output, it
>>> >> doesn't seem very convenient, since it requires server-side
>>> >> infrastructure that is very costly to develop and run. A
>>> >> speech-specific API in the browser gives browser implementors the
>>> >> option to use on-device speech services provided by the OS, or
>>> >> server-side speech synthesis/recognition.
>>> >
>>> > Again, it would help a lot of you could provide use cases and
>>> > requirements. This helps both with designing an API, as well as
>>> > evaluating if the use cases are common enough that a dedicated API is
>>> > the best solution.
>>> >
>>> > / Jonas
>>> I'm mostly thinking about speech web apps for mobile devices. I think
>>> that's where speech makes most sense as an input and output method,
>>> because of the poor keyboards, small screens, and frequent hands/eyes
>>> busy situations (e.g. while driving). Accessibility is the other big
>>> reason for using speech.
>>> Some ideas for use cases:
>>> - Search by speaking a query
>>> - Speech-to-speech translation
>>> - Voice Dialing (could open a tel: URI to actually make the call)
>>> - Dialog systems (e.g. the canonical pizza ordering system)
>>> - Lightweight JavaScript browser extensions (e.g. Greasemonkey /
>>> Chrome extensions) for using speech with any web site, e.g, for
>>> accessibility.
>>> Requirements:
>>> - Web app developer side:
>>>   - Allows both speech recognition and synthesis.
>>>   - Easy to use API. Makes simple things easy and advanced things
>>> possible.
>>>   - Doesn't require web app developer to develop / run his own speech
>>> recognition / synthesis servers.
>>>   - (Natural) language-neutral API.
>>>   - Allows developer-defined application specific grammars / language
>>> models.
>>>   - Allows multilingual applications.
>>>   - Allows easy localization of speech apps.
>>> - Implementor side:
>>>   - Easy enough to implement that it can get wide adoption in browsers.
>>>   - Allows implementor to use either client-side or server-side
>>> recognition and synthesis.
>>> --
>>> Bjorn Bringert
>>> Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
>>> Palace Road, London, SW1W 9TQ
>>> Registered in England Number: 3977902
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.whatwg.org/pipermail/whatwg-whatwg.org/attachments/20091213/f1660b78/attachment-0002.htm>

More information about the whatwg mailing list