[whatwg] Peer-to-peer communication, video conferencing, and related topics (2)

Mon Mar 28 18:00:53 PDT 2011

On Tue, 15 Mar 2011, Lachlan Hunt wrote:
> 
> In chat clients, like Skype, it's common for users to be able to adjust 
> the microphone volume or mute the audio stream, or to enable or disable 
> the video stream, without interupting the call.  However, the 
> GeneratedStream interface only provides a very simple API to pause, 
> resume or stop the entire stream, and not individual tracks within the 
> stream.
> 
> e.g.
> 
> var stream;
> navigator.getUserMedia("audio,video", success);
> 
> function success(s) {
>   stream = s;
>   // ... Code to make P2P connection for video chat
> }
> 
> In this case, stream.pause() will pause both the audio and video 
> streams, whereas the user, for example, may just temporarily want to 
> pause the video stream, leaving the audio enabled.
> 
> While it may be possible for the browser to allow such control entirely 
> from the browser chrome, independently of the page, the page author may 
> wish to provide customised controls for these features.  I believe the 
> API should be adjusted to allow the individual tracks within a stream to 
> be paused or resumed independently of each other, and for there to be 
> some way to adjust or mute the microphone volume.

On Fri, 25 Mar 2011, Per-Erik Brodin wrote:
>
> On 2011-03-22 11:01, Stefan Håkansson LK wrote:
> > On 2011-03-18 05:45, Ian Hickson wrote:
> > > 
> > > All of this except selectively muting audio vs video is currently 
> > > possible in the proposed API.
> > > 
> > > The simplest way to make selective muting possible too would be to 
> > > change how the pause/resume thing works in GeneratedStream, so that 
> > > instead of pause() and resume(), we have individual controls for 
> > > audio and video. Something like:
> > > 
> > >     void muteAudio();
> > >     void resumeAudio();
> > >     readonly attribute boolean audioMuted;
> > >     void muteVideo();
> > >     void resumeViduo();
> > >     readonly attribute boolean videoMuted;
> > > 
> > > Alternatively, we could just have mutable attributes:
> > > 
> > >     attribute boolean audioEnabled;
> > >     attribute boolean videoEnabled;
> > > 
> > > Any opinions on this?
> >
> > We're looking into this and will produce a more elaborate input 
> > related to this.
> 
> Basically we would like to be able to address the Stream components 
> individually and also not limit them to zero or one audio and zero or 
> one video components per Stream. That way we could activate/deactivate 
> them individually and also split out components and combine components 
> from different Stream objects into a new Stream object.
> 
> One good use case is the multi-party video conference where you would 
> like to record the audio from all participants using a StreamRecorder. 
> This would be done by taking the audio component from the local 
> GeneratedStream and combining it with the audio components from the 
> remote streams to form a new Stream object which can then be recorded.
> 
> This could also be a way to handle multiple cameras such as front and 
> back cameras of mobile devices that was mentioned in another thread. 
> When playing a Stream containing several video components, the first 
> active component (if any) would be shown. Active audio components would 
> be mixed.

To address this use case I've taken the audioTracks and videoTracks 
features recently added to HTMLMediaElement and reused them on 
GeneratedStream. They control which of the available sources get used 
when generating the stream.

On Wed, 16 Mar 2011, Robert O'Callahan wrote:
>
> Instead of creating new state signalling and control API for streams, 
> what about the alternative approach of letting <video> and <audio> use 
> sensors as sources, and a way to connect the output of <video> and 
> <audio> to encoders? Then we'd get all the existing state machinery for 
> free. We'd also get sensor input for audio processing (e.g. Mozilla or 
> Chrome's audio APIs), and in-page video preview, and using <canvas> to 
> take snapshots, and more...

I don't really understand how that would work. <video> is an output for a 
video stream, it doesn't generate a video stream. I completely agree that 
we should reuse <video> for playback. I don't really see that <video> gets 
us anything for generation, though.

On Wed, 16 Mar 2011, Lachlan Hunt wrote:
> 
> We can already do in-page video preview with the existing design.
> 
> var v = querySelector("video");
> navigator.getUserMedia("video", function(stream) {
>   v.src = stream;
> });
> 
> From there, taking snapshots with canvas is also possible.

Indeed there's an example of exactly that in the spec (search for the 
example that contains the text "Snapshot Kiosk").

On Wed, 16 Mar 2011, Olli Pettay wrote:
> 
> I think roc did suggest that.
> Perhaps navigator.getUserMedia("audio,video", success, error);
> could return an url to the device in the success callback, and that url
> could be then set to video.src.

Indirectly, it does; you can pass a Stream to URL.getObjectURL(). This 
works the same as Blob objects, in fact.

> Audio (and video) data could be modified before encoding and streaming 
> it using PeerConnection. That way one could for example reduce 
> background noise from the audio stream, or 'crop' the video before 
> sending it. Or if the camera doesn't support grayscale, the web page 
> could convert the colorful video to grayscale in order to save network 
> bandwidth.

I agree that (on the long term) we should support stream filters on 
streams, but I'm not sure I understand <video>'s role in this. Wouldn't it 
be more efficient to have something that takes a Stream on one side and 
outputs a Stream on the other, possibly running some native code or JS in 
the middle? Ideally you could then pass this down to a Worker and have it 
happen off the main thread.

On Thu, 17 Mar 2011, Lachlan Hunt wrote:
> 
> The creation of a URL is unnecessary indirection.  It's easier to avoid 
> creating special URLs entirely, and instead assign the the Stream object 
> directly to video.src.
> 
> e.g.
> 
> navigator.getUserMedia("video", function(stream) {
>   video.src = stream;
> }
> 
> This is then reflected in the src content attribute as 
> "about:streamurl", and is returned upon getting video.src.  This 
> requires that the HTMLMediaElement src property definition needs to be 
> changed from DOMString to any.

As far as this goes, my goal is to reuse whatever machinery we have for 
Blobs. I'm happy to change the way this is specced for Streams, but I do 
think it is important that we be consistent here.

(It seems that reusing URLs here is a lot easier than making everything in 
the platform that accepts a URL also accept an object. I mean, for 
instance, how do you propose to make CSS 'background-image' accept a 
Stream or Blob?)

On Thu, 17 Mar 2011, Olli Pettay wrote:
> 
> Also, if getUserMedia would return just an URL, browser wouldn't need to 
> create any stream object (unless someone then want to stream from 
> <video> to PeerConnection).

Just returning a URL would leave us with no good way to control the 
generated stream, unfortunately.

Also, URLs can leak a lot easier than objects.

On Thu, 17 Mar 2011, Philip JÃ¤genstedt wrote:
> 
> Sure, but instead one would have to mint URLs and keep a mapping between 
> those URLs and the streams that they actually represent. If people copy 
> those URLs around, how long are they supposed to work for?

This is all defined through the use of URL.getObjectURL().

On Thu, 17 Mar 2011, Philip JÃ¤genstedt wrote:
> 
> I wasn't aware of this API, it's in 
> http://www.w3.org/TR/FileAPI/#dfn-createObjectURL for reference.
> 
> That API has an explicit revokeObjectURL to solve the lifetime issue, 
> but there's no such thing for the Stream API.

It's the same API.

On Thu, 17 Mar 2011, Robert O'Callahan wrote:
> 
> In Gecko, we allow seeking within cached segements of streamed video, 
> and we could easily allow that for local devices too --- user-controlled 
> "instant replay".

That's entirely what we want, indeed. However, that seems distinct from 
the stream that we are sending to the other peer.

> So for an HTML video element, the following attributes could all make sense
> for streaming from local devices, IMHO:
> -- videoWidth/videoHeight
> -- width/height (reflected to CSS)
> -- poster (to show a placeholder before camera input becomes available)
> -- controls (in-page controls for mute, start/stop)
> -- src
> -- readyState
> -- currentTime (read and write)
> -- paused
> -- ended (the user turned off the camera)
> -- duration
> -- volume
> -- seeking
> -- seekable
> -- buffered

All of these make sense as a "sink" for a stream, and that's entirely how 
this is specified. You would use <video> to display the local video (a 
GeneratedStream from getUserMedia()) and the remote video (a Stream from a 
PeerConnection). But that doesn't mean it makes sense for the <video> 
element to be the source.

> > But that's not particularly useful for the audio element. It's rare 
> > that the user would want their microphone input to be echoed back to 
> > them via an audio element. In most cases, when a microphone stream is 
> > input into an audio element, the audio element itself would need to be 
> > muted to prevent unwanted and annoying echo or, worse, feedback loops.
> 
> Yes, direct audio output would have to be muted. This could be done 
> automatically when input is coming directly from a local device. 
> (Assuming that using your Web browser as a megaphone is not a valid 
> use-case :-).)

It seems sensible to want to use an Audio object as a sink for a local 
audio source (microphone) which you can then rewind and play back, not 
muted. This seems like exactly the kind of thing authors should be in 
control of.

On Thu, 17 Mar 2011, Lachlan Hunt wrote:
> 
> ----------     -------------------             -----------
> | Camera | --> | GeneratedStream | --+-------> | <video> |
> ----------     -------------------   |         -----------
>                                      |
>                                  ---------     -----------------
>                                  | Codec | --> | Recorded blob |
>                                  ---------     -----------------
>                                      |
>                                      |         ------------------
>                                      +-------> | PeerConnection |
>                                                ------------------

Indeed.

> The stated of the stream, in terms of what gets streamed over P2P or 
> recorded locally, must be controlled at the GeneratedStream and given as 
> input into the codec.  This includes things like controlling the input 
> microphone volume, video height and width, etc.  In particular, the 
> encoded height and width for streaming may differ significantly from the 
> rendered height and width in the local video preview, so this is not 
> something that can be controlled by the video element itself.

Indeed. The remote peer can indeed negotiate some of these settings using 
SDP offer/answer, too.

> > In Gecko, we allow seeking within cached segements of streamed video, 
> > and we could easily allow that for local devices too --- 
> > user-controlled "instant replay".
> 
> We don't buffer any streamed data in our initial device implementation 
> and seeking is not possible.

:-(

On Tue, 15 Mar 2011, jesperg at opera.com wrote:
> 
> I was really looking forward to start playing around with USB MIDI 
> interfaces to control my synth and maybe even do really creative stuff 
> the other way around. Just imagine being able to play on your synth (or 
> any other device with MIDI output) and generate sound or graphics in a 
> <canvas> web application or so!

For manipulation of audio, something more like what the Audio incubator 
group is working on seems more appropriate:

   http://lists.w3.org/Archives/Public/public-xg-audio/

That API could then probably integrate with a MIDI device as a source.

> Or... be able to control other devices using serial connection. Maybe do 
> lirc-alike stuff, using your IR based remote to control Youtube or other 
> HTML5 <video> services, etc.

I'd love to do this (I myself have some RS232-driven hardware). I'm not 
sure it makes sense to use the same API as for video conferencing, though.

My recommendation for people who would like to follow up on 
non-audio/video-related use cases is to follow the steps described in the 
FAQ for handling new use cases:

   http://wiki.whatwg.org/wiki/FAQ#Is_there_a_process_for_adding_new_features_to_a_specification.3F

On Tue, 15 Mar 2011, Rich Tibbett wrote:
> 
> We noticed a number of deficiencies with the way a developer can obtain 
> a GeneratedStream object. Hopefully I can explain those succinctly 
> below.
> 
> A callback-based model fires a single success event. An events-based API 
> allows for ongoing intermediate readyState changes to be fired at web 
> pages following an initial success state change. With an events-based 
> model we would be able to provide ongoing events such as 'disconnected' 
> and, theoretically at least, extend that with events like 'unplugged', 
> 'sleeping', etc.

I'm not sure I follow exactly what you mean here. Could you elaborate on 
which use cases you'd like to address?

> Secondly, getUserMedia is restricted to only handle audio/video streams. 
> In the original proposal there was potential for us to connect and 
> disconnect other device classes, such as USB or RS232 device types.

Indeed; see above for a discussion on this matter.

> Essentially, our proposal is to improve the device bootstrap mechanism
> four-fold:
> 
> 1) Use an events-dispatch model instead of callbacks

It's not clear to me what you mean by callbacks here. The Stream object 
uses the DOM Events model. The only thing that uses callbacks is the 
getUserMedia() object, where success or failure are the only options. This 
is modelled on the geolocation API.

> 2) Allow for future device classes to inherit standard
> connect/disconnect functionality from a standard bootstrap interface
> called 'Device'.

What is the use case or design rationale for this?

> 3) Provide additional generic device state information in the 
> events-dispatch model (a DISCONNECT readyState providing feedback to a 
> web page that the device has been disconnected by the user and/or the 
> connected device has been ripped out of the USB socket).

What is the use case for handling the removal of a microphone differently 
than the user revoking permission for that input device?

> 4) Allow developers to instantiate a particular device class (e.g. 
> UserMedia) with constructor parameters applicable to that device class.

I don't understand what you mean here; could you elaborate?

On Wed, 16 Mar 2011, Lachlan Hunt wrote:
> 
> For me, this event model approach seems more natural and fits with 
> pre-existing design patterns used for other APIs, better than the 
> callback approach does.

As far as I can tell, the only other API that asks for user permission in 
a blocking fashion the way that getUserMedia() does is geolocation, which 
uses a callback model. That was why I used a callback model here.

> The event model has the advantage of being able to scale up to handle 
> more events in the future, such as handling disconnections or the user 
> switching cameras or microphones.

Such events would fire on the GeneratedStream object in the current model. 
getUserMedia() is just a binary check for permission, not an interface to 
the underlying device(s).

> One problem with both models is that they don't easily distinguish 
> between different input devices, which is a problem because both the 
> proposed Device interface and the Stream/GeneratedStream interfaces can 
> potentially represent multiple Devices/Streams (this is the case when 
> "audio,video" is passed as the type).

GeneratedStream now has .audioTracks and .videoTracks to address this.

> This creates a problem when a user, for example, unplugs or revokes 
> permission for one of the devices or streams but not the other, 
> triggering either an error or disconnect event, it's not clear how the 
> script can identify which specific device was disconnected.

Currently there isn't support in the API for only part of the granted 
device permissions being revoked (e.g. revoking just access to the 
microphone). However, if this is something user agents want to support, we 
can definitely add support for it pretty easily by just firing events at 
the GeneratedStream and updating the .videoTracks and .audioTracks lists.

> Finally, the object passed to the error callback/event currently only 
> has a PERMISSION_DENIED error code. It might be worth investigating the 
> need for other codes like PERMISSION_REVOKED, DEVICE_REMOVED, etc. as 
> well, to handle the case where permission was granted, but then the user 
> later changed their mind or unplugged the device.  (It's possible that 
> the proposed ondisconnect event in the event model could be handled as 
> an error event with an appropriate code, though I'm not sure if that's 
> better or worse than separate event.)

Indeed. I made it an object in part for consistency with geolocation and 
in part because that gives us the ability to add more later if we find we 
need more. Currently it's not clear it's really necessary to have more.

On Thu, 24 Mar 2011, Robin Berjon wrote:
> 
> Most notably, some devices might expose ways of controlling them and 
> exposing those on a GeneratedStream seems clunky.

Could you elaborate on "clunky"?

We could rename "GeneratedStream" to "LocalMediaDevice" if that would make 
people feel better about it. It's both really.

> var device;
> navigator.getUserMedia("whatever", function (d) { device = d; });
> 
> Once you have it, there are a couple improvements that can be made over 
> GeneratedStream.
> 
> * It's an EventTarget.

So is GeneratedStream.

> This is primarily for the purpose of listening to devicemotion and 
> deviceorientation events (they currently only target window, but that's 
> not a big deal to change).

Yeah, I think it would make sense to put those events on this object. I 
haven't done it yet, mainly because the DeviceOrientation API doesn't seem 
particularly stable yet.

> This could work with GeneratedStream, but it seems more logical to have 
> events for "I moved the camera" (and possibly others such as "I changed 
> the focal length" or "autofocus acquired at 2.77m") and for "stream 
> paused" on different objects.

Why?

> * It provides an extension point for device control. Say you're 
> streaming from a camera and you want to take a picture. The chances are 
> high that the camera can take a much better picture than the frame you 
> can grab off its view-finding video stream.
> 
> // device is a CameraDevice
> device.captureStill(function (file) {
>   // ... got my picture
> });

What is the use case for this? Is it not handled by <input type=file 
accept=image/*>? If not, why not?

We can definitely add something like the above to GeneratedStream in the 
future, though.

> We might not be there yet and would probably want to wait a little, but 
> there's plenty more that can be added there.
> 
> // silly examples
> device.zoom = 2;
> device.flash = true;
> 
> Again, these could go on GeneratedStream but it seems too conflated. 
> Given that a device exposes a stream, the coding cost is a minimal 
> switch to:
>
> video.src = device.stream;

Why would we want to split the device from the stream? I'm very wary of 
adding more 1:1 object mappings to the platform. They tend to make the API 
very verbose and annoying to use.

> Additionally, I wonder if it wouldn't be useful to make it possible for 
> the getUserMedia callback to return an array of devices in one go. If 
> you're making a 3D movie (or just 3D videoconferencing) you probably 
> want multiple cameras returned at once (alternatively, it could be a 
> single device exposing two streams).

I think we're getting a bit ahead of ourselves here, but there's no reason 
getUserMedia() couldn't be extended in the future to return a 
3DGeneratedStream if passed a "3d" argument, or some such. Or 
alternatively we could define specific "left" and "right" video tracks, or 
some such, exposed on .videoTracks; or we could just expose it as a 3D 
video stream. The latter would have the added bonus of automatically 
working in all the Web apps that had been written for 2D, without them 
having to change at all.

> Likewise if you have a sound setup more advanced than just the one mike. 
> Of course, the user could effect multiple requests and grant access to 
> each device one by one, but UI-wise, it's probably a lot simpler to 
> allow her to do it all at once.

That's already possible in the existing API to some extent, but it's not 
clear to me what the use case is. Video conferencing is something that can 
apply today to many sites. Multitrack recording seems like something that 
people are not really looking for Web apps to solve. Even in the native 
app market it's still a very evolving area.

> Especially considering the following:
> 
>   1. User wants to add a camera, clicks a button that calls getUserMedia()
>   2. Infobar of some kind shows, user picks camera source, checks [always allow]
>   3. User wants to add second camera, clicks the same button: same camera is picked
>   4. Failure

Instead of clicking the same button in the app, it seems the user should 
click a button in the browser chrome to change the permissions.

> Multiple simultaneous inputs isn't science fiction nor is it limited to 
> professional contexts. I could easily want to use both back and front 
> cameras on my phone, one with which to film what's going on around me in 
> a documentary, the other to insert a small view of myself as I comment 
> on what I'm seeing. 3D home videos are probably not that far around the 
> corner (yes, it scares me too). It's likely that laptops will ship with 
> arrays of mikes in order to better figure out where you're talking from 
> (spatially) and eliminate all other sources — accessing would be sweet.
> 
> I don't much care about the syntax, but I guess we could be looking at 
> something like
> 
> navigator.getUserMedia("video multiple", function (devices) {
>   // ... show each different view
> });

This is supported with the .videoTracks feature now, though without 
change notification at the moment.

On Thu, 17 Mar 2011, Stefan Håkansson LK wrote:
> 
> It is not totally clear how the UI would work for granting access to use 
> mics and cams, and furthermore how it would be possible to select 
> several cameras (many terminals have both a front and a rear view 
> camera) and then "tell" the web app which camera is which.

There's no current way to tell the app which camera is selected. I think 
what we might want to do is define some default labels for the video 
tracks, but I'm interested in implementation experience on that front 
before I spec that further.

> The spec says that the user-agent-specific prompt may allow user to 
> select pre-recorded media. In that case, shouldn't it be possible to 
> also create a Stream from a File/Blob object, containing media data?

What's the use case?

> Shouldn't the "ended" event be call simply "end" to match the present 
> tempus of the other events ("pause", "play")?

Yes, but 'ended' is what we are using in <video>, so I stuck with it for 
consistency. (My bad when I was designing that API.)

> The green box describes an attribute called paused which is not present 
> in the Stream idls.

That is gone now, but it was on GeneratedStream.

> The asynchronous StreamRecorder.getRecordedData should be void.

Fixed, thanks.

> Further, the StreamRecorder API doesn't seem to support stopping a 
> recording without stopping the entire Stream.

StreamRecorder doesn't supporting stopping explicitly at all, but the uer 
agent can stop recording whenever the object is GC'ed. In practice we 
couldn't rely on the author saying when to stop anyway, so this is what 
browsers would have to implement regardless.

> a) We interpret the spec as "addStream" triggers a new ICE procedure 
> that sets up a new "channel" (5-tuple) for the stream. Correct?

It invokes the ICE feature that adds a media stream ("9.3.1.2.  New Media 
Stream"), or at lesat that's my intent. It's hard to reference ICE 
sometimes because it doesn't provide very explicit hyperlinkable hooks.

> b) Also related to addStream: it is not clear if the SDP (to be 
> transmitted to the other end at getting the callback) contains all 
> descriptions for all streams set up (minus the removed ones) so far or 
> just the new one. The former would simplify SIP interop (re-invite).

That's an ICE issue. This does whatever ICE says to do. (Since ICE is what 
SIP uses, it seems that means you're bound to be compatible assuming you 
have a PeerConnection/SIP gateway for the signaling channel).

> c) addStream is uni-directional, so in our interpretation the sdp-data 
> transmitted from sender to receiver would indicate "send-only". We guess 
> (as mentioned above) from the description that a new ICE procedure would 
> be deployed to set up a "channel" that is used for RTP (send direction) 
> and RTCP (feedback). In many cases the service calls for symmetric 
> flows, the two web apps would do "addStream" more or less 
> simultaneously. Ideally, the "channel" (5-tuple) should be re-used. I am 
> not sure how this can be accomplished.

The spec as written now does everything with "sendonly" streams. I'm open 
to changing that, but I don't really see how the API would work with 
"sendrecv" media streams, which is why I did it this way.

> d) As you already mention, it is not defined how the application could 
> influence the media format selected. It could be discussed to what level 
> this should be possible. But the very least should be some kind of 
> connection between the rendering (e.g. large area at screen, small area, 
> mono, 5.1) and the selected format.

That's basically up to the browser, currently. I'm open to adding some 
more control here, but I think it's the kind of thing for which 
implementation experience would be really useful, so I haven't added 
anything yet.

> Unclear how to protect the "PeerConnection data UDP media stream" to be 
> used by "send()" messages (sent with "send") and streams. dTLS? SRTP? 
> How to set up and exchange keys?

The spec defines all this already, no?

> Unclear how to protect the new "channel" set up by an ICE procedure at 
> "addStream". dTLS? SRTP? How to set up and exchange keys?

The spec doesn't define what codec (H.263, WebM, whatever) or network 
transport (e.g. RTP) and encryption protocols to use. I'm happy to specify 
particular codec, transport, or encryption mechanisms if there are any 
codec, transport, or encryption mechanisms that everyone is going to 
implement.

> It is stated that the data size can be up to 65467 bytes in "send()". 
> Our network guys tell us that this is unrealistic to get over such big 
> chunks using UDP.

Is that true? I thought they'd just get fragmented at the IP level, but 
would still make it through eventually, am I wrong?

Obviously you want to avoid fragmentation too if possible, but limiting 
all packets to a few bytes seems a bit extreme...

> The StreamEvent has a function called initCloseEvent.

Fixed, thanks.

On Thu, 17 Mar 2011, Glenn Maynard wrote:
>
> PeerConnection defines packet encryption, but it uses AES-128-CTR 
> without actually defining the counter.  It also generates a new AES key 
> for each packet.  A major point of using CTR is to not have to do that; 
> you have a single key and vary the counter.
> 
> The inputs to AES-128-CTR are a key, a counter and a message.  A single 
> key is used for the whole connection[1].

This is UDP, there is no connection. Each packet is independent.

> Each counter value can only be used once.  A nonce isn't created for 
> each packet; only once for the entire connection, as part of the key.

> The mechanism I'd recommend is: [...]

This proposal removes the payload type signature, which seems like an 
unrelated concern and would be a bad change since it removes an extension 
point, so I haven't removed this.

It also introduces a predictable set of bytes in each packet (the 
counter, which can be predicted because it increments monotonically with 
each packet). This fails to achieve the goal of making the packet payload 
completely random to a non-PeerConnection observer.

> The magic PeerConnection "salt" (DB 68 B5 FD 17 0E 15 77 56 AF 7A 3A 1A 
> 57 75 02) seems unnecessary, replaced with the connection nonce, but 
> could still be appended to the connection key if desired.

It's needed to make the data random even when interpreted by the 
implementation of another protocol that happens to have the same 
mechanism. By having a protocol-specific salt, we can ensure that 
different protocols that use this scheme can never be attacked either.

> There should also be a mechanism to support new hashes and ciphers in 
> the future.  There's no need to actually specify other hashes at this 
> point (except perhaps for testing purposes), just forward-compatibility 
> for when AES and/or SHA-1 need to be replaced.

This is already possible over the signaling channel (we can just invent a 
new attribute when we need it).

> This protocol is reinventing the wheel, and I'm sure a cryptography
> expert will find many more issues.  Can anyone more familiar with DTLS
> say whether it fits here?

DTLS is inappropriate here because it does the handshake over the 
connection, which is unnecessary in this case. Also, if I understand it 
correctly, it uses TLS-style certificates which doesn't really make sense 
when you're communicating with another user agent (as opposed to a 
server).

On Thu, 17 Mar 2011, Adam Barth wrote:
> 
> Theoretically, we could just use an initial counter value of zero for 
> each message, but, as you point out, that would require re-keying AES 
> for each message.  Rather than the scheme you propose, it's probably 
> easier to just use the nonce as the initial counter value.  The chance 
> of randomly choosing the same nonce twice is essentially zero.
> 
> Specifically, in 
> <http://www.whatwg.org/specs/web-apps/current-work/#the-data-stream>:
> 
> - 3. Let key be the first 16 bytes of the HMAC-SHA1 of the
> concatenation of the 16 nonce bytes, the 16 data UDP media stream salt
> bytes, and the 16 ice-key bytes. [HMAC] [SHA1]
> + 3. Let key be the first 16 bytes of the HMAC-SHA1 of the
> concatenation of the 16 data UDP media stream salt bytes and the 16
> ice-key bytes. [HMAC] [SHA1]
> 
> - 5. Let masked message be the result of encrypting typed raw message
> using AES-128-CTR keyed with key. [AES128CTR]
> + 5. Let masked message be the result of encrypting typed raw message
> using AES-128-CTR keyed with key and using the 16 nonce bytes as the
> initial counter value. [AES128CTR]

That makes sense. Done.

On Thu, 17 Mar 2011, Glenn Maynard wrote:
> 
> The issue isn't just making sure the sender doesn't reuse a counter 
> (though that's also critical with CTR).  The issue is replay attacks: 
> making sure an attacker can't replay a previously-sent packet later on.
>
> By using an increasing counter, the anti-replay algorithm from DTLS and 
> IPsec ESP can be used.  It's very straightforward; see 
> http://www.ietf.org/rfc/rfc4347 section 4.1.2.5, which explains it 
> better than I can.  This requires an increasing sequence number--this 
> algorithm won't work if the counter is a random value.

On Thu, 17 Mar 2011, Adam Barth wrote:
> 
> Sure.  That's fine.  If you like, we can XOR a monotonically
> increasing value with the nonce to provide the initial counter value.

On Thu, 17 Mar 2011, Glenn Maynard wrote:
>
> Do you mean including both a random 16-byte nonce *and* a (say) 6-byte 
> sequence number in each packet?

We wouldn't be able to do that since the sequence number isn't random in 
this situation.

If we want to prevent replay attacks, we're better off doing it by 
putting a packet identifier inside the packet data itself, IMHO. No need 
to make it part of the masking.

I've added a sequence number inside the data, and made out-of-order 
messages get discarded. I'm open to preserving out-of-order messages with 
some sort of receive window, if someone can make a compelling argument for 
what the window should be (either in terms of time or number of packets or 
both, possibly as a function of some other metric).

On Fri, 18 Mar 2011, Glenn Maynard wrote:
> On Thu, Mar 17, 2011 at 9:28 PM, Adam Barth <w3c at adambarth.com> wrote:
> > 
> > So, the salt and the nonce play different roles.  The salt is to make 
> > sure the message appears random if you haven't read the spec (and so 
> > don't know the salt).  The nonce is to prevent the attacker from 
> > crafting plaintexts that encrypt to a chosen ciphertext, even when the 
> > attacker sees both sides of the connection.  Picking a new nonce for 
> > each message means that the attack cannot choose the bytes sent on the 
> > wire.  The nonce can be communicated in-band, just like the IV for CBC 
> > mode.
> 
> If you can send messages to an arbitrary IP address and port, then this 
> definitely matters: you don't want people to be able to send packets 
> that look like DNS responses to arbitrary ports, for example.  However, 
> here the communication is negotiated over STUN/TURN.  The protocol 
> should have ensured that the port you're talking to is actually 
> expecting to receive data using this protocol, and isn't, say, a DNS 
> server.  You shouldn't be able to send data at all except to a peer that 
> agreed to receive data on the port.
> 
> It's possible that ICE doesn't actually negotiate this securely, since 
> the STUN server itself is untrusted.  Do you (or anyone else) know if 
> STUN negotiation is secure under these circumstances?  Or do you think 
> it doesn't matter?

It's defense-in-depth: it means we can introduce this protocol without 
first guaranteeing that ICE can't be tricked, because even if ICE is 
tricked somehow, you still can do nothing more than send a stream of 
random bytes to your victim.

> I don't mean to harp on this, but an additional 16 bytes of nonce per 
> packet is significant for small payloads, so if it's necessary I'd like 
> to understand why.

It's not _that_ expensive.

On Wed, 23 Mar 2011, Harald Alvestrand wrote:
> 
> The potential attack we can't avoid is that a hostile webapp, possibly 
> with the help of a hostile STUN server, can cause an ICE handshake 
> request to be sent to an UDP IP+port of their choice. The browser can 
> rate-limit such attacks easily, and may implement a port-number 
> blocklist if that seems appropriate (not sending to port 53 seems 
> reasonable).
> 
> That seems like a risk that's not unreasonable to accept, given that 
> we've survived having the same problem for HTTP links since day one of 
> the Web (any web page can dupe a client into launching a TCP session to 
> any IP:port and sending "GET /<ASCII string of their choice>" to it).

On Wed, 23 Mar 2011, Matthew Kaufman wrote:
>
> STUN connectivity check packets are already carefully crafted (with a 
> very long initial magic number) to *not* look like anything else (SNMP 
> queries, DNS queries, etc.) and so sending them at a limited rate to 
> arbitrary addresses should be safe.

That's good to hear.

On Wed, 23 Mar 2011, Glenn Maynard wrote:
> 
> From a *cursory* (an hour or so) examination of the ICE and STUN 
> protocols, it appears that even if the web server, STUN/TURN server(s) 
> and a remote peer are hostile, it should not be possible to convince a 
> user's browser (via its ICE agent) to send packets to an arbitrary IP 
> and port.  It should only be possible to send packets to an IP which has 
> handshaked a port via ICE.

That is my conclusion too, for what it's worth.

> *If* that's accurate, does that remove the masking requirement? 16 bytes 
> per packet is significant overhead to pay if it's not needed.

Why do you consider 16 bytes expensive?

On Thu, 24 Mar 2011, Adam Barth wrote:
> 
> Our experience with WebSockets indicates that masking is still important 
> even when communicating between the browser and an attacker-controlled 
> server.  The problem is that intermediaries attempt to "sniff" the 
> protocol by looking at the bytes on the wire. For example, one could 
> easily imagine an intermediary attempting to do "helpful" things to 
> transiting UDP packets that look like DNS requests or responses.  
> Rather than play whack-a-mole with these possibilities, we're better off 
> building a protocol that's secure by design.

Indeed.

On Thu, 24 Mar 2011, Matthew Kaufman wrote:
> 
> That goal is incompatible with legacy interoperability.

There is no legacy when it comes to UDP data media streams. This is a new 
protocol, no existing servers implement it.

> It is also probably unnecessary in the case where we use real encryption 
> (DTLS / DTLS-SRTP) for the media flows.

It doesn't affect the media flows. The media flows should keep using 
whatever mechanisms already used for encrypting them.

On Thu, 24 Mar 2011, Harald Alvestrand wrote:
>
> We know [that some intemediaries sniff the protocol]. Some of them are 
> doing totally broken things (for instance looking for the bit pattern 
> corresponding to 10.0.0.1 and changing it to a NAT's external address 
> without regard for context - which is the excuse for some of the more 
> baroque constructs of the STUN protocol).
> 
> There is also rumoured to be devices that look for packet streams with 
> regular 20 ms spacing, and block them in an attempt to prevent people 
> from using nonapproved VoIP devices.
> 
> At some point, we have to declare that there is breakage introduced by 
> other people's incompetence where we accept that failure will result a 
> certain percentage of the time until those devices are replaced.

It's probably reasonable to reach that conclusion when the workaround is 
worse than the breakage (e.g. with the 20ms spacing thing -- the 
workaround would introduce artefacts into the communication that might be 
worse than simply not using this channel at all). However, given the ease 
with which we can mask these game data packets, it seems we haven't 
reached that point yet with this particular subfeature.

> I believe that the STUN XOR-ing of addresses (RFC 5389 section 15.2) was 
> an example of going too far (we should have detected the brokenness and 
> signalled it rather than routing around it; we traded a clear "doesn't 
> work because of this kind of bogosity" function for a "will corrupt a 
> known percentage of your traffic" function.... but I digress).

Actually to me that seems like a pretty neat solution and a clear example 
of something where the minor pain of the solution is better than having 
the breakage.

> There's a cost to the complexity we're imposing too.

The cost seems minimal here, but I've been wrong before!

> I would like to get the facts straight and be able to think in terms of 
> cost/benefit, rather than accepting blanket statements of requirement.

Absolutely.

On Thu, 24 Mar 2011, Glenn Maynard wrote:
> 
> It's expensive resilience: 16 bytes of added overhead for every 
> datagram. That's overhead added to every PeerConnection datagram 
> protocol, in order to help hide problems in something catastrophically 
> broken and inherently insecure.

This is only 16 bytes added to the data channel, not to every protocol. 
For example, how media is sent is an issue for the media transport 
protocols (like RTP), and the 16 byte nonce mechanic described for the UDP 
data media stream doesn't apply there.

On Wed, 23 Mar 2011, Harald Alvestrand wrote:
>
> Is there really an advantage to not using SRTP and reusing the RTP 
> format for the data messages?

Could you elaborate on how (S)RTP would be used for this? I'm all in 
favour of defering as much of this to existing protocols as possible, but 
RTP seemed like massive overkill for sending game status packets.

On Wed, 23 Mar 2011, Matthew Kaufman wrote:
> 
> I'd go one further... why not DTLS-SRTP for the media and DTLS with some 
> other header shim for the data messages?

The spec doesn't say what should happen for the media; that's left up to 
the UAs to negotiate via SDP offer/answer (as done by ICE). Regarding DTLS 
around a shim for the data messages, DTLS seems inappropriate for the 
reasons discussed earlier in this reply.

> In particular, there are significant security advantages to end-to-end 
> keying rather than transmitting keys over the signaling channel.

Could you elaborate on these?

On Thu, 17 Mar 2011, Glenn Maynard wrote:
> 
> The particulars of the AES-128-CTR algorithm should be defined--the NIST 
> reference only defines AES itself, not the CTR mode.  It also needs to 
> specify a padding method, eg. PKCS7 or ANSI X.923, to pad to AES's block 
> size of 16 bytes.

On Fri, 18 Mar 2011, Glenn Maynard wrote:
> 
> Actually, I was wrong about padding: it's a CBC thing, CTR doesn't need 
> it. With CTR, the length of the ciphertext determines the length of the 
> plaintext directly.

So just to confirm, there's nothing to add for padding?

On Thu, 17 Mar 2011, Glenn Maynard wrote:
>
> A hash should also be included in each packet, to prevent semi-random 
> tampering with packets on the wire.

On Thu, 17 Mar 2011, Adam Barth wrote:
> 
> <http://www.w3.org/Bugs/Public/show_bug.cgi?id=12316> is the bug on
> file about that.  Rather than MACing the plaintext, as you suggest, we
> should encrypt-then-mac, as recommended by this classic paper
> <http://cseweb.ucsd.edu/~mihir/papers/oem.pdf>.

I've added a hash for integrity checking.

These data packets now consist of:

   IPv4 or IPv6 header (20 or 40 bytes)
   UDP header (8 bytes)
   16 byte hash
   16 byte nonce
   8 byte sequence number
   4 byte payload description for future expansion

On Fri, 18 Mar 2011, Lachlan Hunt wrote:
> > 
> > In getUserMedia() the input is extensible; we could definitely add 
> > "prefer-user-view" or "prefer-environment-view" flags to the method 
> > (with better names, hopefully, but consider that 'rear' and 'front' 
> > are misleading terms -- the front camera on a DSLR faces outward from 
> > the user, the front camera on a mobile phone faces toward the user). 
> > The user still has to OK the use of the device, though, so maybe it 
> > should just be left up to the user to pick the camera? They'll need to 
> > be able to switch it on the fly, too, which again argues to make this 
> > a UA feature.
> 
> We could just add flags to the options string like this:
> 
> "video;view=user, audio" or "video;view=environment, audio"

That seems a bit complicated to parse. Instead I've just gone with having 
a space-separated list of tokens inside the comma-separated list of 
tokens, so the above examples would be "video user, audio" and "video 
environment, audio" respectively.

> It's worth pointing out that The HTML Media Capture draft from the DAP 
> WG uses the terms "camera" and "camcorder" for this purpose, but I find 
> these terms to be very ambiguous and inappropriate, and so we should not 
> use them here.
> 
> http://dev.w3.org/2009/dap/camera/

Pity that they didn't use better names. I agree that those names aren't 
good enough to warrant reuse here.

> > Similarly for exposing the kind of stream: we could add to 
> > GeneratedStream an attribute that reports this kind of thing. What is 
> > the most useful way of exposing this information?
> 
> I'm not entirely clear about what the use cases are for knowing if the 
> camera is either user-view or environment-view.  It seems the more 
> useful information to know is the orientation of the camera.  If the 
> user switches cameras, that could also be handled by firing orientation 
> events.

Agreed.

> > > There are some use cases for which it would be useful to know the 
> > > precise orientation of the camera, such as augmented reality 
> > > applications.  The camera orientation may be independent of the 
> > > device's orientation, and so the existing device orientation API may 
> > > not be sufficient.
> > 
> > It seems like the best way to extend this would be to have the Device 
> > Orientation API apply to GeneratedStream objects, either by just 
> > having the events also fire on GeneratedStream objects, or by having 
> > the API be based on a pull model rather than a push model and exposing 
> > an object on GeneratedStream objects as well as Window objects.
> 
> This could work.  But it would make more sense if there were an object 
> representing the device itself, as in Rich's proposal, and for the 
> events to be fired on that object instead of the stream.

Would renaming GeneratedStream address this? I don't really think it makes 
sense to have two objects that always have a 1:1 mapping. I guess in 
theory each GeneratedStream could have multiple devices attached (a camera 
and a microphone, in the simple case) but that just seems excessively 
complicated...

> > On Mon, 24 Jan 2011, Anne van Kesteren wrote:
> > > 
> > > There is a plan of allowing direct assigning to IDL attributes besides
> > > creating URLs.
> > > 
> > > I.e. being able to do:
> > > 
> > >   audio.src = blob
> > > 
> > > (The src content attribute would then be something like
> > > "about:objecturl".)
> > > 
> > > I am not sure if that API should work differently from creating URLs and
> > > assigning those, but we could consider it.
> > 
> > Could you elaborate on this plan?
> 
> This is basically what Philip and I were discussing in the other thread
> yesterday, where we avoid the unnecessary overhead of creating a magic URL,
> and instead just assign the object directly to the src property. This lets the
> implementation handle all the magic transparently in the background, without
> bothering to expose a URLs string to the author.
> 
> This is what we had implemented in our experimental implementation of the
> <device> element, and now getUserMedia.
> 
> i.e.
> 
> <video></video>
> <script>
> var v = document.querySelector("video");
> navigator.getUserMedia("video", function(stream) {
>   v.src = stream;
>   v.play();
> });
> </script>
> 
> The getter for v.src then returns "about:streamurl".
> 
> My understanding is that we don't really want to have to implement the
> create/revokeObjectURL() methods for this.

I strongly recommend taking this up with the WebApps group. I think it 
would be far better for us to be consistent throughout than for the stream 
stuff to be different, especially over something like this.

> > On Wed, 16 Feb 2011, Anne van Kesteren wrote:
> > > This is just a thought. Instead of acquiring a Stream object 
> > > asynchronously there always is one available showing transparent 
> > > black or some such. E.g. navigator.cameraStream. It also inherits 
> > > from EventTarget. Then on the Stream object you have methods to 
> > > request camera access which triggers some asynchronous UI. Once 
> > > granted an appropriately named event is dispatched on Stream 
> > > indicating you now have access to an actual stream. When the user 
> > > decides it is enough and turns of the camera (or something else 
> > > happens) some other appropriately named event is dispatched on 
> > > Stream again turning it transparent black again.
> > 
> > This is a very interesting idea.
> 
> This suggests that there would be a separate property available for the 
> microphone, and any other input device.  This differs from the existing 
> spec, which allowed a single stream to represent both audio and video.

My assumption is that if we did this we would implement it by just having 
getUserMedia() always return a stream straight away, not by deing quite 
what Anne describes -- what Anne describes would limit the user to 
exposing only one device of each type.

> > On Mon, 14 Mar 2011, Lachlan Hunt wrote:
> > > The API includes both readystatechange event, as well as independent 
> > > events for play, paused and ended.  This redundancy is unnecessary. 
> > > This is also inconsistent with the design of the HTMLMediaElement 
> > > API, which does not include a readystatechange event in favour on 
> > > separate events only.
> > 
> > I've dropped readystatechange.
> > 
> > I expect to drop play and pause events if we move to the model 
> > described above that pauses and resumes audio and video separately.
> 
> It may still be useful to have events for this, if the event object had 
> a property that indicated which type of stream it applied to, or if 
> there were separate objects for both the audio and video streams.

Separate objects seems awkward, especially for, e.g., video conferencing.

On Fri, 18 Mar 2011, Olli Pettay wrote:
> 
> And I was arguing that we could avoid creating the probably somewhat 
> heavy stream object if we could just assign the url, or perhaps some 
> DOMURL object to video/audio.src.

I don't really see why a Stream object would be heavy. It's just a very 
light wrapper around what has to exist in the background anyway, no?

On Tue, 22 Mar 2011, Stefan Håkansson LK wrote:
>
> We've since produced an updated use case doc: 
> <http://www.ietf.org/id/draft-holmberg-rtcweb-ucreqs-01.txt>

Are there any use cases you feel are not handled?

> > > !The web application must be able to    !If the video is going to be displayed !
> > > !define the media format to be used for !in a large window, use higher bit-    !
> > > !the streams sent to a peer.            !rate/resolution. Should media settings!
> > > !                                       !be allowed to be changed during a     !
> > > !                                       !session (at e.g. window resize)?      !
> > 
> > Shouldn't this be automatic and renegotiated dynamically via SDP 
> > offer/answer?
>
> Yes, this should be (re)negotiated via SDP, but what is unclear is how 
> the SDP is populated based on the application's preferences.

Why would the Web application have any say on this? Surely the user agent 
is in a better position to know what to negotiate, since it will be doing 
the encoding and decoding itself.

> > > !Streams being transmitted must be      !Do not starve other traffic (e.g. on  !
> > > !subject to rate control                !ADSL link)                            !
> > 
> > Not sure whether this requires any thing special. Could you elaborate?
>
> What I am after is that the RTP/UDP streams sent from one UA to the 
> other must have some rate adaptation implemented. HTTP uses TCP 
> transport, and TCP reduces the send rate when a packet does not arrive 
> (so that flows share the available throughput in a fair way when there 
> is a bottleneck). For UDP there is no such mechanism, so unless 
> something is added in the RTP implementation it could starve other 
> traffic. I don't think it should be visible in the API though, it is a 
> requirment on the implemenation in the UA.

Ok. This seems like an issue for RTP, not the API, if it is a spec issue 
at all (as opposed to just an implementation detail as you suggest above).

> > > !The web application must be made aware !To be able to inform user and take   !
> > > !of when streams from a peer are no     !action (one of the peers still has   !
> > > !longer received                        !connection with the server)          !
> > > 
> > > !The browser must detect when no streams!                                     !
> > > !are received from a peer               !                                     !
> > 
> > These aren't really yet supported in the API, but I intend for us to 
> > add this kind of thing at the same time sa we add similar metrics to 
> > <video> and <audio>. To do this, though, it would really help to have 
> > a better idea what the requirements are. What information should be 
> > available? "Packets received per second" (and "sent", maybe) seems 
> > like an obvious one, but what other information can we collect?
>
> I think more studies are required to answer this one.

Any advice you may have in the future on this would definitely be welcome.

On Tue, 22 Mar 2011, Harald Alvestrand wrote:
> >
> >   * locally-generated streams can be paused and resumed.
>
> I believe this property should be moved up to the "stream" level (which 
> I prefer to call "StreamSource", because I think we also need an 
> interface named "StreamSink").

This is now on the GeneratedStream object's audioTracks and videoTracks 
objects.

> I also believe that the recording interface should be removed from this 
> part of the specification; there should be no requirement that all 
> streams be recordable.

Recording of streams is needed for some use cases unrelated to video 
conferencing, such as recording messages.

> The streams should be regarded as a control surface, not as a data channel; in
> many cases, the question of "what is the format of the stream at this point"
> is literally unanswerable; it may be represented as hardware states, memory
> buffers, byte streams, or something completely different.

Agreed.

> Recording any of these requires much more specification than just 
> "record here".

Could you elaborate on what else needs specifying?

> >   * the ConnectionPeer interface has been replaced with a PeerConnection
> >     interface that interacts directly with ICE and its dependencies.
>
> I disagree with a number of aspects of this interface. In particular, I 
> believe the relationship between SDP and ICE is fundamentally misstated; 
> it is possible, and often desirable, to use ICE without using SDP; there 
> are other ways of encoding the information we need to pass.

Certainly, but for compatibility with SIP it seems easiest to just use SDP 
as ICE uses it, unmodified. One can then translate the SDP to other forms 
in a gateway if it is necessary to communicate with other ICE stacks that 
use a different format for the SDP data.