[whatwg] Web API for speech recognition and synthesis

Fri Dec 11 13:35:18 PST 2009

I was just alerted about this thread from my post "Text-To-Speech (TTS) Web
API for JavaScript" at <
http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-December/024453.html>.
Amazing how shared ideas like these seem to arise independently at the same
time.

I have a use-case and an additional requirement, that the time indices be
made available for when each word is spoken in the TTS-generated audio:

I've been working on a web app which reads text in a web page, highlighting
> each word as it is read. For this to be possible, a Text-To-Speech API is
> needed which is able to:
> (1) generate the speech audio from some text, and
> (2) include the time indicies for when each of the words in the text is
> spoken.
>

I foresee that a TTS API should integrate closely with the HTML5 Audio API.
For example, invoking a call to the API could return a "TTS" object which
has an instance of Audio, whose interface could be used to navigate through
the TTS output. For example:

var tts = new TextToSpeech("Hello, World!");
tts.audio.addEventListener("canplaythrough", function(e){
    //tts.indices == [{startTime:0, endTime:500, text:"Hello"},
{startTime:500, endTime:1000, text:"World"}]
}, false);
tts.read(); //invokes tts.audio.play

What would be even cooler, is if the parameter passed to the TextToSpeech
constructor could be an Element or TextNode, and the indices would then
include a DOM Range in addition to the "text" property. A flag could also be
set which would result in each of these DOM ranges to be selected when it is
read. For example:

var tts = new TextToSpeech(document.querySelector("article"));
tts.selectRangesOnRead = true;
tts.audio.addEventListener("canplaythrough", function(e){
    /*
    tts.indices == [
        {startTime:0, endTime:500, text:"Hello", range:Range},
        {startTime:500, endTime:1000, text:"World", range:Range}
    ]
    */
}, false);
tts.read();

In addition to the events fired by the Audio API, more events could be fired
when reading TTS, such as a "readrange" event whose event object would
include the index (startTime, endTime, text, range) for the range currently
being spoken. Such functionality would make the ability to "read along" with
the text trivial.

What do you think?
Weston

On Thu, Dec 3, 2009 at 4:06 AM, Bjorn Bringert <bringert at google.com> wrote:

> On Wed, Dec 2, 2009 at 10:20 PM, Jonas Sicking <jonas at sicking.cc> wrote:
> > On Wed, Dec 2, 2009 at 11:17 AM, Bjorn Bringert <bringert at google.com>
> wrote:
> >> I agree that being able to capture and upload audio to a server would
> >> be useful for a lot of applications, and it could be used to do speech
> >> recognition. However, for a web app developer who just wants to
> >> develop an application that uses speech input and/or output, it
> >> doesn't seem very convenient, since it requires server-side
> >> infrastructure that is very costly to develop and run. A
> >> speech-specific API in the browser gives browser implementors the
> >> option to use on-device speech services provided by the OS, or
> >> server-side speech synthesis/recognition.
> >
> > Again, it would help a lot of you could provide use cases and
> > requirements. This helps both with designing an API, as well as
> > evaluating if the use cases are common enough that a dedicated API is
> > the best solution.
> >
> > / Jonas
>
> I'm mostly thinking about speech web apps for mobile devices. I think
> that's where speech makes most sense as an input and output method,
> because of the poor keyboards, small screens, and frequent hands/eyes
> busy situations (e.g. while driving). Accessibility is the other big
> reason for using speech.
>
> Some ideas for use cases:
>
> - Search by speaking a query
> - Speech-to-speech translation
> - Voice Dialing (could open a tel: URI to actually make the call)
> - Dialog systems (e.g. the canonical pizza ordering system)
> - Lightweight JavaScript browser extensions (e.g. Greasemonkey /
> Chrome extensions) for using speech with any web site, e.g, for
> accessibility.
>
> Requirements:
>
> - Web app developer side:
>   - Allows both speech recognition and synthesis.
>   - Easy to use API. Makes simple things easy and advanced things possible.
>   - Doesn't require web app developer to develop / run his own speech
> recognition / synthesis servers.
>   - (Natural) language-neutral API.
>   - Allows developer-defined application specific grammars / language
> models.
>   - Allows multilingual applications.
>   - Allows easy localization of speech apps.
>
> - Implementor side:
>   - Easy enough to implement that it can get wide adoption in browsers.
>   - Allows implementor to use either client-side or server-side
> recognition and synthesis.
>
> --
> Bjorn Bringert
> Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
> Palace Road, London, SW1W 9TQ
> Registered in England Number: 3977902
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.whatwg.org/pipermail/whatwg-whatwg.org/attachments/20091211/16388bf3/attachment-0002.htm>