<div id=":us" class="ii gt"><div>I was just alerted about this thread from my post "Text-To-Speech (TTS) Web API for JavaScript" at <<a href="http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-December/024453.html">http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-December/024453.html</a>>. Amazing how shared ideas like these seem to arise independently at the same time.<br>


<br>I have a use-case and an additional requirement, that the time indices be made available for when each word is spoken in the TTS-generated audio:<br><br><blockquote style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;" class="gmail_quote">


I've been working on a web app which reads text in a web page, highlighting each word as it is read. For this to be possible, a Text-To-Speech API is needed which is able to:<br>(1) generate the speech audio from some text, and<br>


(2) include the time indicies for when each of the words in the text is spoken.<br></blockquote><br>I foresee that a TTS API should integrate closely with the HTML5 Audio API. For example, invoking a call to the API could return a "TTS" object which has an instance of Audio, whose interface could be used to navigate through the TTS output. For example:<br>


<br><span style="font-family: courier new,monospace;">var tts = new TextToSpeech("Hello, World!");</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">tts.audio.addEventListener("canplaythrough", function(e){</span><br style="font-family: courier new,monospace;">


<span style="font-family: courier new,monospace;">    //tts.indices == [{startTime:0, endTime:500, text:"Hello"}, {startTime:500, endTime:1000, text:"World"}]</span><br style="font-family: courier new,monospace;">


<span style="font-family: courier new,monospace;">}, false);</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">tts.read(); //invokes tts.audio.play</span><br><br>What would be even cooler, is if the parameter passed to the TextToSpeech constructor could be an Element or TextNode, and the indices would then include a DOM Range in addition to the "text" property. A flag could also be set which would result in each of these DOM ranges to be selected when it is read. For example:<br>


<br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">var tts = new TextToSpeech(document.querySelector("article"));<br>tts.selectRangesOnRead = true;</span><br style="font-family: courier new,monospace;">


<span style="font-family: courier new,monospace;">tts.audio.addEventListener("canplaythrough", function(e){</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">    /*<br>


    tts.indices == [<br>        {startTime:0, endTime:500, text:"Hello", </span><span style="font-family: courier new,monospace;">range:Range</span><span style="font-family: courier new,monospace;">}, <br>        {startTime:500, endTime:1000, text:"World", range:Range}<br>


    ]</span> <br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">    */</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">}, false);</span><br style="font-family: courier new,monospace;">


<span style="font-family: courier new,monospace;">tts.read();</span><br><br>In addition to the events fired by the Audio API, more events could be fired when reading TTS, such as a "readrange" event whose event object would include the index (startTime, endTime, text, range) for the range currently being spoken. Such functionality would make the ability to "read along" with the text trivial.<br>


</div><div></div><div><br></div><div>What do you think?<br></div><div>Weston</div>

</div><br><br><div class="gmail_quote">On Thu, Dec 3, 2009 at 4:06 AM, Bjorn Bringert <span dir="ltr"><<a href="mailto:bringert@google.com">bringert@google.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">


<div><div></div><div class="h5">On Wed, Dec 2, 2009 at 10:20 PM, Jonas Sicking <jonas@sicking.cc> wrote:<br>

> On Wed, Dec 2, 2009 at 11:17 AM, Bjorn Bringert <<a href="mailto:bringert@google.com">bringert@google.com</a>> wrote:<br>

>> I agree that being able to capture and upload audio to a server would<br>

>> be useful for a lot of applications, and it could be used to do speech<br>

>> recognition. However, for a web app developer who just wants to<br>

>> develop an application that uses speech input and/or output, it<br>

>> doesn't seem very convenient, since it requires server-side<br>

>> infrastructure that is very costly to develop and run. A<br>

>> speech-specific API in the browser gives browser implementors the<br>

>> option to use on-device speech services provided by the OS, or<br>

>> server-side speech synthesis/recognition.<br>

><br>

> Again, it would help a lot of you could provide use cases and<br>

> requirements. This helps both with designing an API, as well as<br>

> evaluating if the use cases are common enough that a dedicated API is<br>

> the best solution.<br>

><br>

> / Jonas<br>

<br>

</div></div>I'm mostly thinking about speech web apps for mobile devices. I think<br>

that's where speech makes most sense as an input and output method,<br>

because of the poor keyboards, small screens, and frequent hands/eyes<br>

busy situations (e.g. while driving). Accessibility is the other big<br>

reason for using speech.<br>

<br>

Some ideas for use cases:<br>

<br>

- Search by speaking a query<br>

- Speech-to-speech translation<br>

- Voice Dialing (could open a tel: URI to actually make the call)<br>

- Dialog systems (e.g. the canonical pizza ordering system)<br>

- Lightweight JavaScript browser extensions (e.g. Greasemonkey /<br>

Chrome extensions) for using speech with any web site, e.g, for<br>

accessibility.<br>

<br>

Requirements:<br>

<br>

- Web app developer side:<br>

   - Allows both speech recognition and synthesis.<br>

   - Easy to use API. Makes simple things easy and advanced things possible.<br>

   - Doesn't require web app developer to develop / run his own speech<br>

recognition / synthesis servers.<br>

   - (Natural) language-neutral API.<br>

   - Allows developer-defined application specific grammars / language models.<br>

   - Allows multilingual applications.<br>

   - Allows easy localization of speech apps.<br>

<br>

- Implementor side:<br>

   - Easy enough to implement that it can get wide adoption in browsers.<br>

   - Allows implementor to use either client-side or server-side<br>

recognition and synthesis.<br>

<div><div></div><div class="h5"><br>

--<br>

Bjorn Bringert<br>

Google UK Limited, Registered Office: Belgrave House, 76 Buckingham<br>

Palace Road, London, SW1W 9TQ<br>

Registered in England Number: 3977902<br>

</div></div></blockquote></div><br>