[whatwg] PeerConnection, MediaStream, getUserMedia(), and other feedback

Mon Jul 25 22:30:37 PDT 2011

On Thu, 14 Jul 2011, Shwetank Dixit wrote:
> On Thu, 14 Jul 2011 04:09:40 +0530, Ian Hickson <ian at hixie.ch> wrote:
> > > 
> > > Another question is flash. As far as I have seen, there seems to be 
> > > no option to specify whether the camera needs to use flash or not. 
> > > Is this decision left up to the device? (If someone is making an app 
> > > which is just clicking a picture of the person, then it would be 
> > > nice to have the camera use flash in low light conditions).
> >
> > getUserMedia() returns a video stream, so it wouldn't use a flash.
> 
> Wouldn't it make sense to have a provision for flash separately then? I 
> think a lot of apps would like just a picture instead of video, and in 
> those cases, flash would be required. Maybe a seperate provision in the 
> spec which defines whether to use flash, and if so, for how many 
> miliseconds. Is that doable?

In response to getUserMedia()? I don't really understand how that would 
work. Could you elaborate? How do you envisage the API working? Maybe a 
concrete example would help.

I'm particularly concerned about two things: preventing hostile sites from 
abusing a flash feature to troll the user, and preventing well-meaning but 
poorly designed sites from using the flash when the user doesn't want it 
to (e.g. when taking a photograph in an area where a flash isn't desired).

On Thu, 14 Jul 2011, timeless wrote:
>
> I'd expect a web app to have no idea about device camera specifications 
> and thus to not be able to properly specify a flash duration. I don't 
> see how such a thing is valuable.
> 
> If a user is in a movie theater, or a museum, it's quite likely they 
> won't notice a web app is forcing a flash. Let the user control flash 
> through a useragent only or host application only mode. I believe the 
> hazards of exposing flash duration outweigh any benefits. The only 
> application class I know of built using control of camera flash is 
> "flash-light", and that's both a hack and not guaranteed to be workable 
> for all possible flash technologies.

Right.

On Fri, 15 Jul 2011, Shwetank Dixit wrote:
> 
> Just like, just allowing the web app to use the camera as it is will not 
> make sense, and presumably, user agents will implement a authorization 
> by the user before the app gains access to the camera (something like 
> 'This application requests access to the camera. Allow for now/Always 
> Allow/Never Allow/Close' just like you do in geolocation right now) ... 
> just like that, you could do it for flash, where the app only gains 
> access to it if the user allows it. If that is the implementation, i do 
> not think there would be much hazards in allowing flash access.

This is quickly going to get frustrating to the user. In general, we'd 
rather not have any such prompts. For example, for video, well-designed 
browsers are likely not going to have a "yes/no" prompt, instead they'll 
just have a prompt that asks the user what camera they want to use. This 
is far less frustrating to the user.

> Apart from helping capture images/video in low light conditions, there 
> are a few other use cases for flash such as the flash light thing you 
> mentioned, as well as a possible S.O.S type app.
>
> I'm fine if the consensus is that the device/user agent will handle the 
> issue of flash by showing some sort of control where the user can click 
> between 'flash on/off/auto'. That will cover *most* of the use cases, 
> which is recording images/video in low light conditions. If so, then it 
> might be good to specify that somewhere in the spec just to make things 
> a bit clearer?

Ok, done.

On Tue, 19 Jul 2011, Per-Erik Brodin wrote:
> 
> Perhaps now that there is no longer any relation to tracks on the media 
> elements we could also change Track to something else, maybe Component. 
> I have had people complaining to me that Track is not really a good name 
> here.

I'm happy to change the name if there's a better one. I'm not sure 
Component is any better than Track though.

> Good. Could we still keep audio and video in separate lists though? It 
> makes it easier to check the number of audio or video components and you 
> can avoid loops that have to check the kind for each iteration if you 
> only want to operate on one media type.

Well in most (almost all?) cases, there'll be at most one audio track and 
at most one video track, which is why I didn't put them in separate lists. 
What use cases did you have in mind where there would be enough tracks 
that it would be better for them to be separate lists?

> I also think that it would be easier to construct new MediaStream 
> objects from individual components rather than temporarily disabling the 
> ones you do not want to copy to the new MediaStream object and then 
> re-enabling them again afterwards.

Re-enabling them afterwards would re-include them in the copies, too.

The main use case here is temporarily disabling a video or audio track in 
a video conference. I don't understand how your proposal would work for 
that. Can you elaborate?

> It is also unclear to me what happens to a LocalMediaStream object that 
> is currently being consumed in that case.

Not sure what you mean. Can you elaborate?

> Why should the label the same as the parent on the newly constructed
> MediaStream object?

The label identifies the source of the media. It's the same source, so, 
same label.

> If you send two MediaStream objects constructed from the same 
> LocalMediaStream over a PeerConnection there needs to be a way to 
> separate them on the receiving side.

What's the use case for sending the same feed twice?

> I also think it is a bit unfortunate that we now have a 'label' property 
> on the track objects that means something else than the 'label' property 
> on MediaStream, perhaps 'description' would be a more suitable name for 
> the former.

In what sense do they mean different things? I don't understand the 
problem here. Can you elaborate?

> > > We prefer having a StreamRecorder that you have to stop in order get 
> > > the recorded data (like the previous one, but with asynchronous Blob 
> > > retrieval) and we do not understand the use cases for the current 
> > > proposal where recording continues until the recorder is garbage 
> > > collected (or the Stream ends) and you always get the data from the 
> > > beginning of the recording. This also has to be tied to application 
> > > quota in some way.
> > 
> > The current design is just the result of needing to define what 
> > happens when you call getRecordedData() twice in a row. Could you 
> > elaborate on what API you think we should have?
> 
> What I am thinking of is something similar to what was proposed in 
> http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2011-March/030921.html 

That doesn't answer the question of what happens if you call stop() twice.

(Also, having to call a method and hook an event so that you can read an 
attribute seems like a rather round-about way of getting data. Is calling 
a method with a callback not simpler?)

> Preferably quota should be expressed in media time and that is heavily 
> dependent on the format being used and, regardless of any codecs

Quota doesn't seem particularly important here. It's not like you can 
really do lasting damage. It would just be a DOS attack, like creating a 
Web page with an infinite number of 10000x10000 canvases. We can just let 
the "hardware limitation" clause handle it.

> I still think that the format has to be specified somehow.

I continue to agree and continue to intend to wait for implementation 
experience before specifying this. I can't write the spec for this until I 
know what makes sense to spec.

> Perhaps it would be best to push recording to v2 since this does not 
> seem to be the primary use case for people currently showing the most 
> interest in this part of the spec.

Just because you talk more doesn't mean your use cases are more important. :-)

> > > Instead of blob: we would like to use stream: for the Stream URLs so 
> > > that we very early on in the media engine selection can use the 
> > > protocol scheme to determine how the URL will be handled. Blobs are 
> > > typically handled in the same way as other media playback. The 
> > > definition of stream: could be the same as for blob:.
> > 
> > Why can't the UA know which blob: URLs point to streams and which 
> > point to blobs?
> 
> I was not saying that it would not be possible to keep track of which 
> blob: URLs that point to blobs and which point to streams just that we 
> want to avoid doing that in the early stage of the media engine 
> selection. In my opinion a stream is quite the opposite of a blob 
> (unknown, perhaps infinite length vs. fixed length) so when printing the 
> URLs for debugging purposes it would also be much nicer to have two 
> different protocol schemes. If I remember correctly the discussions 
> leading up to the renaming of createBlobURL to createObjectURL assumed 
> that there would be stream: URLs.

You wouldn't be able to remove that logic, since http: URLs would still 
have the same needs. You can have finite and infinite http: resources, 
just like you can have finite and infinite blob: resources. I don't really 
see the problem here. Indeed, with blob:, it's trivial to find out if the 
resource is finite or not; with http: you might not know until the whole 
finite resource is downloaded.

If there is something I'm missing here please do let me know.

> > Actually, the spec doesn't currently say anything happens when a 
> > stream that is being transmitted just ends, either. I guess I should 
> > spec that...
> > 
> > ...ok, now the spec is clear that an ended stream transmits blackness 
> > and silence. Same with if some tracks are disabled. (Blackness only if 
> > there's a video track; silence only if there's an audio track.)
> 
> OK, I guess that could work. In that case you have to manually remove 
> the stream from being sent over a PeerConnection if it unexpectedly 
> ends.

Right. This is desireable; scripts that have not been written to deal with 
unexpected conditions should fail safely, not suddenly find variables are 
changing values in ways that the authors didn't consider.

> Should playing back an ended stream locally in a video element also 
> produce silence and blackness then?

As currently specified, yes.

> We have been assuming network error which has the nice consequence that 
> you can discard ended streams that are only reachable by their stream: 
> URLs even if the URLs are not revoked since trying to play back such a 
> stream: URL is indistinguishable from trying to play back an invalid 
> stream: URL, in that case.

Well by the point a stream: URL can only provide silence and blackness 
there's really not much to keep around, so it's not like it costs much. 
But having them give a network error would be quite tidy, true.

I could go either way on this. On the one hand, it seems good for the 
behaviour to be consistent on both sides of the PeerConnection. On the 
other hand, there's a conceptual neatness to having ended streams act as a 
network error.

Anyone else have a preference?

> > > PeerConnection is an EventTarget but it still uses a callback for 
> > > the signaling messages and this mixture of events and callbacks is a 
> > > bit awkward in my opinion. If you would like to change the function 
> > > that handles signaling messages after calling the constructor you 
> > > would have to wrap a function call inside the callback to the actual 
> > > signal handling function, instead of just (re-)setting an onsignal 
> > > (or whatever) attribute listener (the event could reuse the 
> > > MessageEvent interface).
> > 
> > When would you change the callback?
> 
> If you would like to send the signaling messages peer-to-peer over the 
> data channel, once it is established.

That seems like a disaster waiting to happen. The UDP data channel is 
unreliable, the signaling channel has to be reliable. Worse, the UDP data 
channel might go down at any second, and then the user agent would try to 
re-establish it using the signaling channel.

> > My concern with making the callback an event handler is that it leads to a
> > set of poor failure modes and complications in the API:
> > 
> >   - It's easy to not register a callback, which makes no sense. There's
> >     literally never a use for create a PeerConnection without a signaling
> >     channel, as far as I can tell, so making it easier to create one
> >     without a callback than with seems like a bad design.
> 
> I think this is the case for many APIs.

Just because we made bad APIs before doesn't mean we should do it again!

> For example, creating an EventSource without registering any listener 
> for incoming events equally does not make sense.

Actually, it does. One operation mode for EventSource is to have events 
with different names, each triggering a different event listener.

> >   - It's easy to register multiple callbacks. This equally makes no sense,
> >     and would likely only ever be the source of bugs.
> 
> Yes it would be possible but I am not sure that it would be any easier 
> to make such a mistake than to make any other mistake that would cause 
> the application to malfunction.

We're not comparing how easy it is to make this mistake compared to other 
mistakes, we're comparing how easy it is to make this mistake compared to 
not being _able_ to make this mistake.

The net number of mistakes will be some fraction of the net number of 
possible mistakes. Reduce the second number and you reduce the first.

> >   - It makes getting the data more complicated. Instead of passing the
> >     callback the string to send, we end up passing an object which has on
> >     it one attribute that contains the string.
> 
> Yeah, if getting 'event.data' is more complicated than just getting 
> 'data'.

It's not like the difference between a "Hello World" document and a video 
conferencing app, but it's still more, needless, complexity.

> To get the incoming stream from a StreamEvent (should probably be 
> renamed to MediaStreamEvent for consistency) the author would have to 
> know how to do this anyway.

I'm not saying authors won't have to figure out how to use events; events 
are all over the platform. I'm just saying that if we can make things 
simpler without breaking use cases, as here, we should.

(I fixed the event interface name.)

> > > There is a potential problem in the exchange of SDPs in that glare 
> > > conditions can occur if both peers add streams simultaneously, in 
> > > which case there will be two different outstanding offers that none 
> > > of the peers are allowed to respond to according to the SDP 
> > > offer-answer model. Instead of using one SDP session for all media 
> > > as the specification suggests, we are handling the offer-answer for 
> > > each stream separately to avoid such conditions.
> > 
> > Why isn't this handled by the ICE role conflict processing rules? It 
> > seems like simultaneous ICE restarts would be trivially resolvable by 
> > just following the rules in the ICE spec. Am I missing something?
> 
> This problem is not related to ICE but rather to the SDP offer-answer 
> model which is separate from the ICE processing. The problem is that SDP 
> offer-answer does not allow you to respond to an offer when you have an 
> outstanding offer for the same set of streams.

As far as I can tell, your interpretation is incorrect. This is entirely 
related to ICE, and ICE, as far as I can tell, defines this exact case in 
its role conflict resolution.

The only time this can happen is if you have both ends do an ICE restart 
at exactly the same time. The offer from each ICE agent will be received 
by the other as if it was the response, and thus there will be a role 
conflict and the ICE role conflict resolution process will kick in. No?

> The way we are avoiding this is by sending SDP fragments (only the SDP 
> lines related to the affected stream) rather than the full SDP each 
> time.

That seems completely incompatible with ICE requirements.

On Wed, 20 Jul 2011, Tommy Widenflycht (á~[~Oá~Z®á~[~Xá~[~Xá~Z¤) wrote:
> On Mon, Jul 18, 2011 at 20:38, Ian Hickson <ian at hixie.ch> wrote:
> > On Mon, 18 Jul 2011, Tommy Widenflycht (Ã¡~[~OÃ¡~ZÂ®Ã¡~[~XÃ¡~[~XÃ¡~ZÂ¤) wrote:
> > >
> > > I am very confused regarding the below paragraph from the latest 
> > > spec:
> > >
> > > When a track in a MediaStream parent is disabled, any 
> > > MediaStreamTrack objects corresponding to the tracks in any 
> > > MediaStream objects that were created from parent are disassociated 
> > > from any track, and must not be reused for tracks again. If a 
> > > disabled track in a MediaStream parent is re-enabled, from the 
> > > perspective of any MediaStream objects that were created from parent 
> > > it is a new track and thus new MediaStreamTrack objects must be 
> > > created for the tracks that correspond to the re-enabled track.
> > >
> > > After cloning a LocalMediaStream it looks like this:
> > >
> > > LocalMediaStream -> MediaStream1
> > > Track1(E)           Track1(E)
> > > Track2(E)           Track2(E)
> > > Track3(E)           Track3(E)
> > >
> > > and as I interpret the spec it looks like this if Track1 in the 
> > > LocalMediaStream is disabled:
> > >
> > > LocalMediaStream -> MediaStream1
> > > Track1(D)           Track2(E)
> > > Track2(E)           Track3(E)
> > > Track3(E)
> >
> > Correct so far (though I'd avoid the term "cloning" since it's not 
> > quite what's going on here -- the spec uses "forking", which may be 
> > closer though is still not ideal).
> >
> > > So Track1 disappears from the MediaStream1 object and doesn't come 
> > > back even if Track1 in the LMS object is enabled:
> > >
> > > LocalMediaStream -> MediaStream1
> > > Track1(E)           Track2(E)
> > > Track2(E)           Track3(E)
> > > Track3(E)
> >
> > No, it'll create a new track object:
> >
> >  LocalMediaStream -> MediaStream1
> >  Track1(E)           Track4(E)
> >  Track2(E)           Track2(E)
> >  Track3(E)           Track3(E)
> >
> > This is specified in the sentence that starts "If a disabled track in 
> > a MediaStream parent is re-enabled...".
>
> Thanks for the explanation. To me this sounds overly complicated, why 
> not just make it so that an disable of a track will override the track 
> settings for forked MediaStreams?

I don't understand what you mean. How would that be different?

> This will definitely simplify implementation, and confuse web developers 
> less imho. MSTracks coming and going doesn't feel right. Especially 
> since there are no callbacks/events that a MS changes.

You don't need a callback, since only the author can do this in the firstp 
place.

> Also some follow-up questions regarding the new TrackLists:
> 
> What should happen when a track fails? Should the entire stream fail, 
> the MSTrack silently be removed or the MSTrack disassociated with the 
> track (and thus becoming a do-nothing object)?

What do you mean by "fails"?

> What should happen when a stream with two or more video tracks is 
> associated to a <video> tag? Just render the first enabled one?

Same as if you had a regular video file with multiple tracks.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'