[whatwg] Web API for speech recognition and synthesis

Tue Dec 15 12:25:54 PST 2009

It seems like there is enough interest in speech to start developing
experimental implementations. There appear to be two general
directions that we could take:

- A general microphone API + streaming API + audio tag
  - Pro: Useful for non-speech recognition / synthesis applications.
           E.g. audio chat, sound recording.
  - Pro: Allows JavaScript libraries for third-party network speech services.
           E.g. an AJAX API for Google's speech services. Web app developers
           that don't have their own speech servers could use that.
  - Pro: Consistent recognition / synthesis user experience across
            user agents in the same web app.
  - Con: No support for on-device recognition / synthesis, only
            network services.
  - Con: Varying recognition / synthesis user experience across
            different web apps in a single user agent.
  - Con: Possibly higher overhead because the audio data needs to
            pass through JavaScript.
  - Con: Requires dealing with audio encodings, endpointing, buffer
            sizes etc in the microphone API.

- A speech-specific back-end neutral API
  - Pro: Simple API, basically just two methods: listen() and speak().
  - Pro: Can use local recognition / synthesis.
  - Pro: Consistent recognition / synthesis user experience across
           different web apps in a single user agent.
  - Con: Varying recognition / synthesis user experience across user
            agents in the same web app.
  - Con: Only works for speech, not general audio.

/Bjorn

On Sun, Dec 13, 2009 at 6:46 PM, Ian McGraw <imcgraw at mit.edu> wrote:
> I'm new to this list, but as a speech-scientist and web developer, I wanted
> to add my 2 cents.  Personally, I believe the future of speech recognition
> is in the cloud.
> Here are two services which provide Javascript APIs for speech recognition
> (and TTS) today:
> http://wami.csail.mit.edu/
> http://www.research.att.com/projects/SpeechMashup/index.html
> Both of these are research systems, and as such they are really just
> proof-of-concepts.
> That said, Wami's JSONP-like implementation allows Quizlet.com to use speech
> recognition today on a relatively large scale, with just a few lines of
> Javascript code:
> http://quizlet.com/voicetest/415/?scatter
> Since there are a lot of Google folks on this list, I recommend you talk to
> Alex Gruenstein (in your speech group) who was one of the lead developers of
> WAMI while at MIT.
> The major limitation we found when building the system was that we had to
> develop a new audio controller for every client (Java for the desktop,
> custom browsers for iPhone and Android).  It would have been much simpler if
> browsers came with standard microphone capture and audio streaming
> capabilities.
> -Ian
>
>
> On Sun, Dec 13, 2009 at 12:07 PM, Weston Ruter <westonruter at gmail.com>
> wrote:
>>
>> I blogged yesterday about this topic (including a text-to-speech demo
>> using HTML5 Audio and Google Translate's TTS service); the more relevant
>> part for this thread:
>>
>>> I am really excited at the prospect of text-to-speech being made
>>> available on
>>> the Web! It's just too bad that fetching MP3s on an remote web service is
>>> the
>>> only standard way of doing so currently; modern operating systems all
>>> have TTS
>>> capabilities, so it's a shame that web apps and can't utilize them via
>>> client-side scripting. I posted to the WHATWG mailing list about such a
>>> Text-To-Speech (TTS) Web API for JavaScript, and I was directed to a
>>> recent
>>> thread about a Web API for speech recognition and synthesis.
>>>
>>> Perhaps there is some momentum building here? Having TTS available in the
>>> browser would boost accessibility for the seeing-impaired and improve
>>> usability
>>> for people on-the-go. TTS is just another technology that has
>>> traditionally been
>>> relegated to desktop applications, but as the open Web advances as the
>>> preferred
>>> platform for application development, it is an essential service to make
>>> available (as with Geolocation API, Device API, etc.). And besides, I
>>> want to
>>> build TTS applications and my motto is: "If it can't be done on the open
>>> web,
>>> it's not worth doing at all"!
>>
>> http://weston.ruter.net/projects/google-tts/
>>
>> Weston
>>
>> On Fri, Dec 11, 2009 at 1:35 PM, Weston Ruter <westonruter at gmail.com>
>> wrote:
>>>
>>> I was just alerted about this thread from my post "Text-To-Speech (TTS)
>>> Web API for JavaScript" at
>>> <http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-December/024453.html>.
>>> Amazing how shared ideas like these seem to arise independently at the same
>>> time.
>>>
>>> I have a use-case and an additional requirement, that the time indices be
>>> made available for when each word is spoken in the TTS-generated audio:
>>>
>>>> I've been working on a web app which reads text in a web page,
>>>> highlighting each word as it is read. For this to be possible, a
>>>> Text-To-Speech API is needed which is able to:
>>>> (1) generate the speech audio from some text, and
>>>> (2) include the time indicies for when each of the words in the text is
>>>> spoken.
>>>
>>> I foresee that a TTS API should integrate closely with the HTML5 Audio
>>> API. For example, invoking a call to the API could return a "TTS" object
>>> which has an instance of Audio, whose interface could be used to navigate
>>> through the TTS output. For example:
>>>
>>> var tts = new TextToSpeech("Hello, World!");
>>> tts.audio.addEventListener("canplaythrough", function(e){
>>>     //tts.indices == [{startTime:0, endTime:500, text:"Hello"},
>>> {startTime:500, endTime:1000, text:"World"}]
>>> }, false);
>>> tts.read(); //invokes tts.audio.play
>>>
>>> What would be even cooler, is if the parameter passed to the TextToSpeech
>>> constructor could be an Element or TextNode, and the indices would then
>>> include a DOM Range in addition to the "text" property. A flag could also be
>>> set which would result in each of these DOM ranges to be selected when it is
>>> read. For example:
>>>
>>> var tts = new TextToSpeech(document.querySelector("article"));
>>> tts.selectRangesOnRead = true;
>>> tts.audio.addEventListener("canplaythrough", function(e){
>>>     /*
>>>     tts.indices == [
>>>         {startTime:0, endTime:500, text:"Hello", range:Range},
>>>         {startTime:500, endTime:1000, text:"World", range:Range}
>>>     ]
>>>     */
>>> }, false);
>>> tts.read();
>>>
>>> In addition to the events fired by the Audio API, more events could be
>>> fired when reading TTS, such as a "readrange" event whose event object would
>>> include the index (startTime, endTime, text, range) for the range currently
>>> being spoken. Such functionality would make the ability to "read along" with
>>> the text trivial.
>>>
>>> What do you think?
>>> Weston
>>>
>>> On Thu, Dec 3, 2009 at 4:06 AM, Bjorn Bringert <bringert at google.com>
>>> wrote:
>>>>
>>>> On Wed, Dec 2, 2009 at 10:20 PM, Jonas Sicking <jonas at sicking.cc> wrote:
>>>> > On Wed, Dec 2, 2009 at 11:17 AM, Bjorn Bringert <bringert at google.com>
>>>> > wrote:
>>>> >> I agree that being able to capture and upload audio to a server would
>>>> >> be useful for a lot of applications, and it could be used to do
>>>> >> speech
>>>> >> recognition. However, for a web app developer who just wants to
>>>> >> develop an application that uses speech input and/or output, it
>>>> >> doesn't seem very convenient, since it requires server-side
>>>> >> infrastructure that is very costly to develop and run. A
>>>> >> speech-specific API in the browser gives browser implementors the
>>>> >> option to use on-device speech services provided by the OS, or
>>>> >> server-side speech synthesis/recognition.
>>>> >
>>>> > Again, it would help a lot of you could provide use cases and
>>>> > requirements. This helps both with designing an API, as well as
>>>> > evaluating if the use cases are common enough that a dedicated API is
>>>> > the best solution.
>>>> >
>>>> > / Jonas
>>>>
>>>> I'm mostly thinking about speech web apps for mobile devices. I think
>>>> that's where speech makes most sense as an input and output method,
>>>> because of the poor keyboards, small screens, and frequent hands/eyes
>>>> busy situations (e.g. while driving). Accessibility is the other big
>>>> reason for using speech.
>>>>
>>>> Some ideas for use cases:
>>>>
>>>> - Search by speaking a query
>>>> - Speech-to-speech translation
>>>> - Voice Dialing (could open a tel: URI to actually make the call)
>>>> - Dialog systems (e.g. the canonical pizza ordering system)
>>>> - Lightweight JavaScript browser extensions (e.g. Greasemonkey /
>>>> Chrome extensions) for using speech with any web site, e.g, for
>>>> accessibility.
>>>>
>>>> Requirements:
>>>>
>>>> - Web app developer side:
>>>>   - Allows both speech recognition and synthesis.
>>>>   - Easy to use API. Makes simple things easy and advanced things
>>>> possible.
>>>>   - Doesn't require web app developer to develop / run his own speech
>>>> recognition / synthesis servers.
>>>>   - (Natural) language-neutral API.
>>>>   - Allows developer-defined application specific grammars / language
>>>> models.
>>>>   - Allows multilingual applications.
>>>>   - Allows easy localization of speech apps.
>>>>
>>>> - Implementor side:
>>>>   - Easy enough to implement that it can get wide adoption in browsers.
>>>>   - Allows implementor to use either client-side or server-side
>>>> recognition and synthesis.
>>>>
>>>> --
>>>> Bjorn Bringert
>>>> Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
>>>> Palace Road, London, SW1W 9TQ
>>>> Registered in England Number: 3977902
>>>
>>
>
>

-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902