[whatwg] PeerConnection feedback

Ian Hickson ian at hixie.ch
Thu Apr 21 15:27:15 PDT 2011

On Mon, 11 Apr 2011, Justin Uberti wrote:
> On Mon, Apr 11, 2011 at 7:09 PM, Ian Hickson <ian at hixie.ch> wrote:
> > >
> > > This has made UDP packets larger than the MTU pretty useless.
> >
> > So I guess the question is do we want to limit the input to a fixed 
> > value that is the lowest used MTU (576 bytes per IPv4), or dynamically 
> > and regularly determine what the lowest possible MTU is?
> >
> > The former has a major advantage: if an application works in one 
> > environment, you know it'll work elsewhere, because the maximum packet 
> > size won't change. This is a erious concern on the Web, where authors 
> > tend to do limited testing and thus often fail to handle rare edge 
> > cases well.
> >
> > The latter has a major disadvantage: the path MTU might change, 
> > meaning we might start dropping data if we don't keep trying to 
> > determine the Path MTU. Also, it's really hard to determine the Path 
> > MTU in practice.
> >
> > For now I've gone with the IPv4 "minimum maximum" of 576 minus 
> > overhead, leaving 504 bytes for user data per packet. It seems small, 
> > but I don't know how much data people normally send along these 
> > low-latency unreliable channels.
> >
> > However, if people want to instead have the minimum be dynamically 
> > determined, I'm open to that too. I think the best way to approach 
> > that would be to have UAs implement it as an experimental extension at 
> > first, and for us to get implementation experience on how well it 
> > works. If anyone is interested in doing that I'm happy to work with 
> > them to work out a way to do this that doesn't interfere with UAs that 
> > don't yet implement that extension.
> In practice, applications assume that the minimum MTU is 1280 (the 
> minimum IPv6 MTU), and limit payloads to about 1200 bytes so that with 
> framing they will fit into a 1280-byte MTU. Going down to 576 would 
> significantly increase the packetization overhead.


Is there any data out there about what works in practice? I've seen very 
conflicting information, ranging from "anything above what IPv4 allows is 
risky" to "Ethernet kills everything above 1500". Wikipedia seems to think 
that while IPv4's lowest MTU is 576, practical path MTUs are only 
"generally" higher, which doesn't seem like a good enough guarantee for 
Web-platform APIs.

I'm happy to change this, but I'd like solid data to base the decision on.

On Wed, 13 Apr 2011, Harald Alvestrand wrote:
> The practical MTU of the current Internet is the Ethernet MTU: 1500 
> bytes minus headers. The IPv6 "minimum maximum" of 1280 bytes was chosen 
> to leave some room for headers, tunnels and so on.

That was my guess, yeah. Certainly once IPv6 is more widely deployed I 
think it would obviously make sense to increase the limit.

> My suggestion would be to note that applications need to be aware that 
> due to firewalls and other types of black holes, you might get 
> consistent packet loss for packets larger than a given size, typically 
> 1280 bytes or 1480 bytes, and leave it at that.

Unlike most other platforms, where authors ("programmers") tend to be at 
least somewhat experienced and where errors tend to be blamed on the 
application, on the Web, many of the authors are amateurs, and 
intermittent errors tend to be blamed on the browser. As such, we have to 
design our APIs to be as reliable as possible.

On Mon, 11 Apr 2011, Justin Uberti wrote:
> > On Tue, 29 Mar 2011, Harald Alvestrand wrote:
> >
> > > multiplexing of multiple data streams on the same channel using 
> > > SSRC,
> >
> > I don't follow. What benefit would that have?
> If you are in a conference that has 10 participants, you don't want to 
> have to set up a new transport for each participant. Instead, SSRC 
> provides an excellent way to multiplex multiple media streams over a 
> single RTP session (and network transport).

Could you elaborate on this? I have tried finding information on how SSRC 
works but cannot find anything useful. Can you point to the relevant parts 
of the RFCs that describe this mechanism maybe? I'm having trouble 
understanding how it works even for audio/video streams, let alone whether 
it would actually be appropriate for data.

> > On Fri, 8 Apr 2011, Glenn Maynard wrote:
> > >
> > > FWIW, I thought the block-of-text configuration string was peculiar 
> > > and unlike anything else in the platform.  I agree that using a 
> > > configuration object (of some kind) makes more sense.
> >
> > An object wouldn't work very well as it would add additional steps in 
> > the case where someone just wants to transmit the configuration 
> > information to the client as data. Using JSON strings as input as 
> > Harald suggested could work, but seems overly verbose for such a 
> > simple data.
> I have a feeling that this configuration information will only start off 
> simple.

The configuration information mechanism is extensible. But generally 
speaking, we should not solve problems we don't yet have.

On Wed, 13 Apr 2011, Harald Alvestrand wrote:
> Since Ian seems to prefer to jumble all threads on a given group of 
> issues together in one message, I'll attempt to use the same format this 
> time.

FWIW, over the years I have received much feedback to the effect that 
people overall prefer it when I coallesce all feedback about a particular 
topic into a single big e-mail reply, which is a big reason why I continue 
to use this style.

> > On Tue, 29 Mar 2011, Harald Alvestrand wrote:
> > > On 03/29/11 03:00, Ian Hickson wrote:
> > > > On Wed, 23 Mar 2011, Harald Alvestrand wrote:
> > > > > > Is there really an advantage to not using SRTP and reusing the RTP
> > > > > > format for the data messages?
> > > > Could you elaborate on how (S)RTP would be used for this? I'm all in
> > > > favour of defering as much of this to existing protocols as possible,
> > > > but RTP seemed like massive overkill for sending game status packets.
> > >
> > > If "data" was defined as an RTP codec ("application/packets?"), SRTP
> > > could be applied to the packets.
> > > 
> > > It would impose a 12-byte header in front of the packet and the
> > > recommended authentication tag at the end, but would ensure that we
> > > could use exactly the same procedure for key exchange
> >
> > We already use SDP for key exchange for the data stream.
> Yes, with a means of applying encryption that is completely unique to 
> this specification. I'm not fond of novel cryptography designed by 
> non-cryptographers; seen that done before. (I've also seen flaws found 
> in novel cryptography designed by cryptographers....)

If there is an existing mechanism for encryption that is designed to 
handle the particular situation of an attacker-controlled plaintext, 
attacker-controlled victim, attacker-controlled signalling channel, and 
that must be resilient against cross-protocol attacks, I'm more than happy 
to use it, as, like you, I am not fond of new security mechanisms.

Unfortunately, I'm not aware of any such system.

(Note that the system _was_ designed by someone with experience in 
cryptography, and is based on a system that was itself studied for some 
time in the IETF hybi group. It's not entirely novel, the only really 
novel aspects are in how it is applied to a UDP stream.)

> > > multiplexing of multiple data streams on the same channel using 
> > > SSRC,
> >
> > I don't follow. What benefit would that have?
> If, for instance, a FPS wants one stream of events for bullet 
> trajectories and another stream of events for sound-source movements, 
> multiple data streams will allow the implementor to not invent his own 
> multiplexing layer.

If this is a use case that turns out to be actually useful, we can 
trivially add multiplexing to this channel without having to take on all 
of RTSP. However, this doesn't seem like a particularly compelling use 
case. Multiplexing systems tend to be pretty application-specific and are 
reasonably easy to create.

> > > and procedures for identifying the stream in SDP (if we continue to 
> > > use SDP) - I believe SDP implicitly assumes that all the streams it 
> > > describes are RTP streams.
> >
> > That doesn't seem to be the case, but I could be misinterpreting SDP. 
> > Currently, the HTML spec includes instructions on how to identify the 
> > stream in SDP; if those instructions are meaningless due to a 
> > misunderstanding of SDP then we should fix it (and in that case, it 
> > might indeed make a lot of sense to use RTP to carry this data).
> I'm not familiar with any HTTP-in-SDP spec; can you point out the 
> reference?

What is HTTP-in-SDP?

> > > I've been told that defining RTP packetization formats for a codec 
> > > needs to be done carefully, so I don't think this is a full 
> > > specification, but it seems that the overhead of doing so is on the 
> > > same order of magnitude as the currently proposed solution, and the 
> > > security properties then become very similar to the properties for 
> > > media streams.
> >
> > There are very big differences in the security considerations for 
> > media data and the security considerations for the data stream. In 
> > particular, the media data can't be generated by the author in any 
> > meaningful way, whereas the data is entirely under author control. I 
> > don't think it is safe to assume that the security properties that we 
> > have for media streams necessarily work for data streams.
> If we support streaming from recorded files, without transcoding, the 
> difference is a lot smaller, since the attacker can create a handcrafted 
> "audio/video data" file. If we allow simplistic codecs like L16 or 
> mu-law, we can't even tell by file analysis that it's not a valid file.
> Have we ruled out the transmission of recorded data, or mandated 
> transcoding?

It's not supported in the current API. If we add it, it's something we 
will have to examine very carefully.

> I was looking at this from the other end: When I as a script author 
> start a Record() process, I need to have some insight into what the 
> format of the Blob (or whatever it is) is going to be.

Not necessarily. It depends on your use case. If all you're doing is 
recarding the user's name to play back later as part of an audiovisual 
"madlib", then all you need to know is that the browser supports the same 
format that it records, for example.

On the short term, there will only be a limited number of formats 
generated, and each browser will probably only support one, so on the 
short term if you really do need a particular format, you can use browser 
sniffing to work out whether the format is good or not.

> It's possible that a reasonable method is generate-and-test:
>    recorder = stream.record()
>    recorder.callback = testFormat()
>    recorder.getRecordedData()
>    function testFormat(blob) {
>        mimetype = blob.mimetype()
>        if (!acceptableMimeType()) {
>            report("Can't record, I don't like this format")
>       }
>    }
> but it doesn't seem optimal to me; if the browser is able to record in 
> OGG and MP3, and the application is willing to accept uploaded MP3 files 
> but not OGG (or vice versa), it seems unreasonable to be unable to 
> record just because the default format for the browser is the "wrong 
> one".

Are any browser vendors intending to support both OGG and MP3?

If so, it would be good to get implementation experience from those 
vendors regarding what inputs they find they need. That's then what we 
would base an extension to the spec on.

> > [the configuration format is extensible]
> The cost of supporting formats is the cost of writing parsers; the JSON 
> string parser already exists, and allows extensibility within the scope 
> of JSON, while the parser for the new string object will have to be 
> written, and changed each time the spec gets extended.

You have to write the code that interprets the JSON just like you have to 
write the code that parses the configuration string. It's not less code 
(indeed it might be more, depending on your JSON library).

Also, parsing the current configuration string is truly trivial. Using 
JSON is not free, however: we would have to define how to do 
error-handling for JSON strings, which would be far more expensive.

> One of the reasons people have given for why they use XML rather than 
> the RFC-822 key:value (or key:value, value, value) syntax is that the 
> parsers for XML are regular, while the RFC-822 parsers fill up with 
> special-casing all the time; they're willing to pay the (hefty) overhead 
> of XML in order to have a regularized parser.

I agree that RFC822 is not a particularly nice format to parse, though I'm 
not sure where it would fit relative to XML. However, using XML doesn't 
solve the problem. You just move the complexity from the parser to the 
code that analyses the XML parser's output.

> > > - For use with STUN and TURN, we need to support the case where we 
> > > need a STUN server and a TURN server, and they're different.
> >
> > TURN servers are STUN servers, at least according to the relevant 
> > RFCs, as far as I can tell. Can you elaborate on which TURN servers do 
> > not implement STUN, or explain the use cases for having different TURN 
> > and STUN servers? This is an area where I am most definitely not an 
> > expert, so any information here would be quite helpful.
> They use the same protocol, but for two different purposes: STUN servers 
> tell you what your address is, and TURN servers relay data. STUN is so 
> cheap, it's not unreasonable to assume that people will not bother with 
> authentication-for-use; for TURN, limiting access to your own customers 
> is definitely something you expect people to do.
> Google Talk deploys its STUN service at stun.l.google.com, and its 
> TURN-like service (it's not quite TURN compliant) at relay.l.google.com.
> At the moment, they are backed by the same binary, but the DNS lookup 
> for the two names does not return the same result.

The ICE spec seems to disagree with this. For example, RFC 5245 section Server Reflexive and Relayed Candidates says:

# If an agent is gathering both relayed and server reflexive
# candidates, it uses a TURN server.  If it is gathering just server
# reflexive candidates, it uses a STUN server.

At no point in the ICE spec does there seem to be a case where an ICE 
agent would use both a STUN server _and_ a TURN server.

Now if this isn't an accurate reflection of reality, then we can fix that, 
ideally by updating the ICE RFC, or alternatively by having an explicit 
"extension" of the ICE RFC (also known as a willful violation of the RFC). 
But then we'd have to very carefully define what the semantics of having 
both would mean.

> > > - The method of DNS lookup is not specified. In particular, it is 
> > > not specified whether SRV records are looked up or not.
> >
> > This seems to be entirely specified. Please ensure that you are 
> > reading the normative conformance criteria for user agents, and not 
> > the non-normative authoring advice, which is only a brief overview.
> Yes, I spoke a bit hastily. The authoritative text says (unless I missed
> something):
>    * The IP address, host name, or domain name of the server is host.
>    * The port to use is port. If this is the empty string, then only a
>      domain name is configured (and the ICE Agent will use DNS SRV
>      requests to determine the IP address and port).
> This needs a reference to the relevant RFC to be complete (section 9 of 
> RFC 5389 for STUN, RFC 5766 section 6.1 for TURN). It doesn't specify 
> what will happen if there is a domain name, no port, and no SRV records 
> (by referencing the RFCs, this becomes clear - you look up the A/AAAA 
> record and use port 3478/5389 as appropriate).

The requirement is in the ICE RFC, section Server Reflexive and 
Relayed Candidates, third paragraph (second on page 21). The HTML spec 
just fills in the fields the ICE RFC asks for, and defers to ICE for the 
requirements on processing those fields.

> The same section says:
> > The long-term username for the STUN or TURN server is the ASCII 
> > serialization of the entry script's origin; the long-term password is 
> > the empty string.
> I found this exceedingly surprising; this effectively means that you're 
> not protecting your STUN/TURN exchanges.

It's protected using the same-origin policy (only a page from origins 
that are allowed to us the server can use it from a browser, anyone can 
use it from outside a browser, same as, e.g. HTTP). Is there a better 
mechanism we could use?

On Wed, 13 Apr 2011, Stefan Håkansson LK wrote:
> >
> > I had assumed that the video would at first be sent with some more or 
> > less arbitrary dimensions (maybe the native ones), and that the 
> > receiving UA would then renegotiate the dimensions once the stream was 
> > being displayed somewhere. Since the page can let the user change the 
> > <video> size dynamically, it seems the UA would likely need to be able 
> > to do that kind of dynamic update anyway.
> Yeah, maybe that's the way to do it. But I think the media should be 
> sent with some sensible default resolution initially. Having a very high 
> resolution could congest the network, and a very low would give bad user 
> experience until the format has been renegotiated.

Any suggestion as to what the default resolution should be?

Currently this is all left up to the UA.

On Wed, 13 Apr 2011, Harald Alvestrand wrote:
> One possible initial resolution is 0x0 (no video sent); if the initial 
> "addStream" callback is called as soon as the ICE negotiation concludes, 
> the video recipient can set up the destination path so that it knows 
> what a sensible resolution is, and can signal that back.
> Of course, this means that after the session negotiation and the ICE 
> negotiation, we have to wait for the resolution negotiation before we 
> have any video worth showing.

I've added some prose to the spec encouraging the initial negotiation to 
use the native resolution, since that seems a bit more likely to be the 
final resolution than 0x0, but I've also mentioned that the resolution is 
expected to be renegotiated to match the final output size.

As we get more implementation experience with this, we can adjust it 

On Sun, 17 Apr 2011, Stefan Håkansson LK wrote:
> I think this is an interesting idea. "Don't transmit until someone 
> consumes". I guess some assessment should be made of how long the extra 
> wait would be - but the ICE "channel" is available so maybe it would not 
> be too long.

I think in general it's unlikely that the stream won't be consumed pretty 
quickly, so sending something, even if it's a bit higher res than needed, 
seems better than sending nothing.

Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

More information about the whatwg mailing list