[whatwg] Speech input element

Mon May 17 08:55:53 PDT 2010

On Mon, May 17, 2010 at 3:00 PM, Olli Pettay <Olli.Pettay at helsinki.fi> wrote:
> On 5/17/10 4:05 PM, Bjorn Bringert wrote:
>>
>> Back in December there was a discussion about web APIs for speech
>> recognition and synthesis that saw a decent amount of interest
>>
>> (http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2009-December/thread.html#24281).
>> Based on that discussion, we would like to propose a simple API for
>> speech recognition, using a new<input type="speech">  element. An
>> informal spec of the new API, along with some sample apps and use
>> cases can be found at:
>>
>> http://docs.google.com/Doc?docid=0AaYxrITemjbxZGNmZzc5cHpfM2Ryajc5Zmhx&hl=en.
>>
>> It would be very helpful if you could take a look and share your
>> comments. Our next steps will be to implement the current design, get
>> some feedback from web developers, continue to tweak, and seek
>> standardization as soon it looks mature enough and/or other vendors
>> become interested in implementing it.
>>
>
> After a quick read I, in general, like the proposal.

It's pretty underspecified still, as you can see. Thanks for pointing
out some missing pieces.

> Few comments though.
>
> - What should happen if for example
>  What happens to the events which are fired during that time?
>  Or should recognition stop?

(Looks like half of the first question is missing, so I'm guessing
here) If you are asking about when the web app loses focus (e.g. the
user switches to a different tab or away from the browser), I think
the recognition should be cancelled. I've added this to the spec.

> - What exactly are grammars builtin:dictation and builtin:search?
>  Especially the latter one is not at all clear to me

They are intended to be implementation-dependent large language
models, for dictation (e.g. e-mail writing) and search queries
respectively. I've tried to clarify them a bit in the spec now. There
should perhaps be more of these (e.g. builtin:address), maybe with
some optional, mapping to builtin:dictation if not available.

> - When does recognitionState change? Before which events?

Thanks, that was very underspecified. I've added a diagram to clarify it.

> - It is not quite clear how SGRS works with <input type="speech">

The grammar specifies the set of utterances that the speech recognizer
should match against. The grammar may be annotated with SISR, which
will be used to populate the 'interpretation' field in ListenResult.

Since grammars may be protected by cookies etc that are only available
in the browsing session, I think the user agent will have to fetch the
grammar and the pass it to the speech recognizer, rather than the
recognizer accessing it directly.

I'm not sure if any of that answers your question though.

> - I believe there is no need for
>  DOMImplementation.hasFeature("SpeechInput", "1.0")

The intention was that apps could use this to conditionally enable
features that require speech input support. Is there some other
mechanism that should be used instead?

> And I think we really need to define something for TTS.
> Not every web developers have servers for text -> <audio>.

Yes, I agree. We intend to work on that next, but didn't include it in
this proposal since they are pretty separate features from the browser
point of view.

-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902