[whatwg] Web API for speech recognition and synthesis

Sun Dec 13 09:07:07 PST 2009

I blogged yesterday about this topic (including a text-to-speech demo using
HTML5 Audio and Google Translate's TTS service); the more relevant part for
this thread: <http://weston.ruter.net/projects/google-tts/>

I am really excited at the prospect of text-to-speech being made available
> on
> the Web! It's just too bad that fetching MP3s on an remote web service is
> the
> only standard way of doing so currently; modern operating systems all have
> TTS
> capabilities, so it's a shame that web apps and can't utilize them via
> client-side scripting. I posted to the WHATWG mailing list about such a
> Text-To-Speech (TTS) Web API for JavaScript, and I was directed to a recent
> thread about a Web API for speech recognition and synthesis.
>
> Perhaps there is some momentum building here? Having TTS available in the
> browser would boost accessibility for the seeing-impaired and improve
> usability
> for people on-the-go. TTS is just another technology that has traditionally
> been
> relegated to desktop applications, but as the open Web advances as the
> preferred
> platform for application development, it is an essential service to make
> available (as with Geolocation API, Device API, etc.). And besides, I want
> to
> build TTS applications and my motto is: "If it can't be done on the open
> web,
> it's not worth doing at all"!
>

http://weston.ruter.net/projects/google-tts/

Weston

On Fri, Dec 11, 2009 at 1:35 PM, Weston Ruter <westonruter at gmail.com> wrote:

> I was just alerted about this thread from my post "Text-To-Speech (TTS) Web
> API for JavaScript" at <
> http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-December/024453.html>.
> Amazing how shared ideas like these seem to arise independently at the same
> time.
>
> I have a use-case and an additional requirement, that the time indices be
> made available for when each word is spoken in the TTS-generated audio:
>
> I've been working on a web app which reads text in a web page, highlighting
>> each word as it is read. For this to be possible, a Text-To-Speech API is
>> needed which is able to:
>> (1) generate the speech audio from some text, and
>> (2) include the time indicies for when each of the words in the text is
>> spoken.
>>
>
> I foresee that a TTS API should integrate closely with the HTML5 Audio API.
> For example, invoking a call to the API could return a "TTS" object which
> has an instance of Audio, whose interface could be used to navigate through
> the TTS output. For example:
>
> var tts = new TextToSpeech("Hello, World!");
> tts.audio.addEventListener("canplaythrough", function(e){
>     //tts.indices == [{startTime:0, endTime:500, text:"Hello"},
> {startTime:500, endTime:1000, text:"World"}]
> }, false);
> tts.read(); //invokes tts.audio.play
>
> What would be even cooler, is if the parameter passed to the TextToSpeech
> constructor could be an Element or TextNode, and the indices would then
> include a DOM Range in addition to the "text" property. A flag could also be
> set which would result in each of these DOM ranges to be selected when it is
> read. For example:
>
> var tts = new TextToSpeech(document.querySelector("article"));
> tts.selectRangesOnRead = true;
> tts.audio.addEventListener("canplaythrough", function(e){
>     /*
>     tts.indices == [
>         {startTime:0, endTime:500, text:"Hello", range:Range},
>         {startTime:500, endTime:1000, text:"World", range:Range}
>     ]
>     */
> }, false);
> tts.read();
>
> In addition to the events fired by the Audio API, more events could be
> fired when reading TTS, such as a "readrange" event whose event object would
> include the index (startTime, endTime, text, range) for the range currently
> being spoken. Such functionality would make the ability to "read along" with
> the text trivial.
>
> What do you think?
> Weston
>
>
> On Thu, Dec 3, 2009 at 4:06 AM, Bjorn Bringert <bringert at google.com>wrote:
>
>> On Wed, Dec 2, 2009 at 10:20 PM, Jonas Sicking <jonas at sicking.cc> wrote:
>> > On Wed, Dec 2, 2009 at 11:17 AM, Bjorn Bringert <bringert at google.com>
>> wrote:
>> >> I agree that being able to capture and upload audio to a server would
>> >> be useful for a lot of applications, and it could be used to do speech
>> >> recognition. However, for a web app developer who just wants to
>> >> develop an application that uses speech input and/or output, it
>> >> doesn't seem very convenient, since it requires server-side
>> >> infrastructure that is very costly to develop and run. A
>> >> speech-specific API in the browser gives browser implementors the
>> >> option to use on-device speech services provided by the OS, or
>> >> server-side speech synthesis/recognition.
>> >
>> > Again, it would help a lot of you could provide use cases and
>> > requirements. This helps both with designing an API, as well as
>> > evaluating if the use cases are common enough that a dedicated API is
>> > the best solution.
>> >
>> > / Jonas
>>
>> I'm mostly thinking about speech web apps for mobile devices. I think
>> that's where speech makes most sense as an input and output method,
>> because of the poor keyboards, small screens, and frequent hands/eyes
>> busy situations (e.g. while driving). Accessibility is the other big
>> reason for using speech.
>>
>> Some ideas for use cases:
>>
>> - Search by speaking a query
>> - Speech-to-speech translation
>> - Voice Dialing (could open a tel: URI to actually make the call)
>> - Dialog systems (e.g. the canonical pizza ordering system)
>> - Lightweight JavaScript browser extensions (e.g. Greasemonkey /
>> Chrome extensions) for using speech with any web site, e.g, for
>> accessibility.
>>
>> Requirements:
>>
>> - Web app developer side:
>>   - Allows both speech recognition and synthesis.
>>   - Easy to use API. Makes simple things easy and advanced things
>> possible.
>>   - Doesn't require web app developer to develop / run his own speech
>> recognition / synthesis servers.
>>   - (Natural) language-neutral API.
>>   - Allows developer-defined application specific grammars / language
>> models.
>>   - Allows multilingual applications.
>>   - Allows easy localization of speech apps.
>>
>> - Implementor side:
>>   - Easy enough to implement that it can get wide adoption in browsers.
>>   - Allows implementor to use either client-side or server-side
>> recognition and synthesis.
>>
>> --
>> Bjorn Bringert
>> Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
>> Palace Road, London, SW1W 9TQ
>> Registered in England Number: 3977902
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.whatwg.org/pipermail/whatwg-whatwg.org/attachments/20091213/9e7bbbac/attachment-0002.htm>