[whatwg] Speech input element

Tue May 18 01:27:45 PDT 2010

On Mon, May 17, 2010 at 9:23 PM, Olli Pettay <Olli.Pettay at helsinki.fi> wrote:
> On 5/17/10 6:55 PM, Bjorn Bringert wrote:
>
>> (Looks like half of the first question is missing, so I'm guessing
>> here) If you are asking about when the web app loses focus (e.g. the
>> user switches to a different tab or away from the browser), I think
>> the recognition should be cancelled. I've added this to the spec.
>>
>
> Oh, where did the rest of the question go.
>
> I was going to ask about alert()s.
> What happens if alert() pops up while recognition is on?
> Which events should fire and when?

Hmm, good question. I think that either the recognition should be
cancelled, like when the web app loses focus, or it should continue
just as if there was no alert. Are there any browser implementation
reasons to do one or the other?

>> The grammar specifies the set of utterances that the speech recognizer
>> should match against. The grammar may be annotated with SISR, which
>> will be used to populate the 'interpretation' field in ListenResult.
>
> I know what grammars are :)

Yeah, sorry about my silly reply there, I just wasn't sure exactly
what you were asking.

> What I meant that it is not very well specified that the result is actually
> put to .value etc.

Yes, good point. The alternatives would be to use either the
'utterance' or the 'interpretation' value from the most likely
recognition result. If the grammar does not contain semantics, those
are identical, so it doesn't matter in that case. If the developer has
added semantics to the grammar, the interpretation is probably more
interesting than the utterance. So my conclusion is that it would make
most sense to store the interpretation in @value. I've updated the
spec with better definitions of @value and @results.

> And still, I'm still not quite sure what builtin:search actually
> is. What kind of grammar would that be? How is that different from
> builtin:dictation?

To be useful, those should probably be large statistical language
models (e.g. n-gram models) trained on different corpora. So
"builtin:dictation" might be trained on a corpus containing e-mails,
SMS messages and news text, and "builtin:search" might be trained on
query strings from a search engine. I've updated the spec to make
"builtin:search" optional, mapping to "builtin:dictation" if not
implemented. The exact language matched by these models would be
implementation dependent, and implementations may choose to be clever
about them. For example by:

- Dynamic tweaking for different web apps based on the user's previous
inputs and the text contained in the web app.

- Adding the names of all contacts from the user's address book to the
dictation model.

- Weighting place names based on geographic proximity (in an
implementation that has access to the user's location).

-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902