[whatwg] Exposing framerate / statistics of <video> playback and related feedback

Mon Apr 30 16:37:08 PDT 2012

There was a lot of e-mail on this topic, but a stark lack of descriptions 
of actual end-user use cases for these features, as will be clear in the 
responses I give below.

A quick reminder therefore that in adding features to HTML the first thing 
we want to look for is the problem that we are trying to solve. Without 
that, we don't know how to evaluate the proposed solutions! See this FAQ:

   http://wiki.whatwg.org/wiki/FAQ#Is_there_a_process_for_adding_new_features_to_a_specification.3F

On Fri, 28 May 2010, Ian Fette wrote:
>
> Has any thought been given to exposing such metrics as framerate, how 
> many frames are dropped, rebuffering, etc from the <video> tag?

It has come up a lot, but the main question is: what is the use case?

> This is interesting for things not just like benchmarking,

Could you elaborate on this? Benchmarking what, by whom, and why?

> but for a site to determine if it is not working well for clients and 
> should instead e.g. switch down to a lower bitrate video.

If the problem is making sure the user can stream a resource with a 
bitrate such that the user can play the stream in real time as it is 
downloading, then the better solution seems to be a rate-negotiating media 
protocol, not an API to expose the frame rate. The frame rate may have 
nothing at all to do with the bandwidth: for example, if the user has a 
software decoder, the framerate could be low because the CPU is overloaded 
due to the user running a complicated simulation in another process. 
Similarly, the download rate could be slow because the bandwidth was 
throttled by the user because the user is doing other things and is happy 
to wait for the video to download in the background so that the user can 
watch it later.

On Sun, 30 May 2010, Jeroen Wijering wrote:
> 
> For determining whether the user-agent is able to play a video, these 
> are the most interesting properties:
> 
>  readonly attribute unsigned long bandwidth::
>     The current maximum server » client bandwidth, in bits per second.

How is this to be determined? In particular, for example, what should 
happen if the user has the page opened twice, or three times? Is the value 
the same in each tab, or is it reduced accordingly?

>  readonly attribute unsigned long droppedframes::
>     The number of frames dropped by the user agent since playback of this video was initialized.

What use case does this number address?

On Fri, 2 Jul 2010, Jeroen Wijering wrote:
> 
> The most useful ones are:
> 
> *) droppedFrames: it can be used to determine whether the client can play the video without stuttering.

The user can presumably tell if the video can play without stuttering just 
by watching it stutter or not stutter. Why would the page need to know? 
Surely there's nothing the page can do about it -- e.g., the playback 
might just be stuttering because the user agent is intentionally dropping 
every other frame because the video is actually not visible on the screen 
currently, or because the video is being played back at 2x speed.

> *) maxBytesPerSecond: it can be used to determine the bandwidth of the 
> connection.

Only if nobody else is using the connection at the same time. What if two 
pages are both open at the same time and both use this to determine the 
connection speed? They'll start intefeering with each other.

On Fri, 7 Jan 2011, Rob Coenen wrote:
> 
> are there any plans on adding frame accuracy and/or SMPTE support to 
> HTML5 video?

Not without a use case. :-)

> As far as I know it's currently impossible to play HTML5 video 
> frame-by-frame, or seek to a SMPTE compliant (frame accurate) time-code. 
> The nearest seek seems to be precise to roughly 1-second (or nearest 
> keyframe perhaps, can't tell).

The API supports seeking to any frame, if you know its precise time 
index.

On Sun, 9 Jan 2011, Rob Coenen wrote:
>
> I have written a simple test using a H264 video with burned-in timecode 
> (every frame is visually marked with the actual SMPTE timecode) Webkit 
> is unable to seek to the correct timecode using 'currentTime', it's 
> always a whole bunch of frames off from the requested position. I reckon 
> it simply seeks to the nearest keyframe?

That's a limitation of the implementation, not of the specification.

On Tue, 11 Jan 2011, Rob Coenen wrote:
>
> just a follow up question in relation to SMPTE / frame accurate 
> playback: As far as I can tell there is nothing specified in the HTML5 
> specs that will allow us to determine the actual frame rate (FPS) of a 
> movie? In order to do proper time-code calculations it's essential to 
> know both the video.duration and video.fps - and all I can find in the 
> specs is video.duration, nothing in video.fps

What is the use case?

On Tue, 12 Jan 2011, Rob Coenen wrote:
> 
> [...] I'd like the 'virtual' FPS of the WebM file exposed to the 
> webbrowser- similar to how my other utilities report a FPS.

Why?

On Wed, 12 Jan 2011, Dirk-Willem van Gulik wrote:
> 
> So that means that SMPTE time code have a meaning - and 
> skipping/scrubbing through a video at one output frame at the time makes 
> perfect sense.

An "advance one frame" or "rewind one frame" API would be reasonable, I 
think; would it address your use case? What is your use case, exactly? 
Film editing or some such?

> Likewise for audio.

What is a "frame" for the purposes of audio?

> And for any creative scenario - being able to do exactly that - pause, 
> jump, cut at an exact time code - is pretty much the number one 
> requirement.

Why do you need an exact time code for this, as opposed to knowing what 
the user was watching when the user requested a pause/jump/cut?

> So being able to ensure that an exact SMPTE timecode is show - or know 
> which is shown - is the basic need.

Why is the precise time in fractions of a second not sufficient?

On Tue, 11 Jan 2011, Rob Coenen wrote:
>
> [...] in an ideal world I'd want to seek to a time expressed as a SMPTE 
> timecode (think web apps that let users step x frames back, seek y 
> frames forward etc.). In order to convert SMPTE to the floating point 
> value for video.seekTime I need to know the frame rate.

Why not always use the "floating point value"?

On Tue, 11 Jan 2011, Rob Coenen wrote:
>
> David I agree- however that's common practice in any video editing tool, 
> in any digital video camera, etc. It's the defacto industry standard for 
> anyone working with digital video.

Could you elaborate on how you would use this timecode if we exposed it?

On Wed, 12 Jan 2011, Dirk-Willem van Gulik wrote:
>
> Right - but that foregoes a bit how subtle the SMPTE timecode definition 
> is (http://en.wikipedia.org/wiki/SMPTE_time_code is a good start) - and 
> this is exactly why it is defined in such odd a manner (as you do have 
> exactly this tautology problem between, say, NTSC and PAL).
> 
> So yes - you want do express this - knowing full well that once you have 
> less than one frame/second the interpretation is a bit odd. But 
> ultimately it does let you define exactly where a cut/splice/etc is - 
> and how exactly two things are overlaid, etc.

Why isn't a time in seconds sufficient?

Given that SMPTE timecodes essentially only make sense for some very 
specific frame rates, and in particular given the rather esoteric nature 
of legacy drop frame timecodes, it seems more sensible to just convert 
SMPTE timecodes to seconds for internal purposes when dealing with the 
media API, and display them as SMPTE timecodes for users who expect them. 
It's not clear to me what benefit there would be to actually requiring 
that user agents fake the SMPTE timecodes.

On Wed, 12 Jan 2011, Rob Coenen wrote:
>
> I guess that I'm looking at HTML5 from the POV as a video-producer 
> rather than a video-consumer.
> 
> As a producer I'm much more intrested in the "legacy" video formats. The 
> way video is being produced is simply on a frame-by-frame basis. I 
> cannot think of any 3D animation tool video sequencer, video editor, or 
> anything that allows you to work with video- that works with anything 
> but full frames.
> 
> video-consumer who only playback the video in a linear way are probably 
> much more intrested in bandwith saving features such as he WebM 
> non-frame based approach.
> 
> Obviously we do don't want to have some API that break future video 
> standards, but I cannot see why we can't have both to make at the same 
> time. It would make the video-producers happy: frame-by-frame accuracy, 
> fixed frame rates and SMPTE timecodes.

If you have fixed frame rates, it's trivial to do the conversion to and 
from SMTPE timecode in JavaScript; you don't need any direct support from 
the media element API.

On Wed, 12 Jan 2011, Philip JÃ¤genstedt wrote:
> 
> For the record, this is the solution I've been imagining:
> 
> * add HTMLMediaElement.seek(t, [exact]), where exact defaults to false 
> if missing
> 
> * make setting HTMLMediaElement.currentTime be a non-exact seek, as that 
> seems to be the most common case

Since this was also filed as a bug, I'll deal with it there:

   https://www.w3.org/Bugs/Public/show_bug.cgi?id=14851

On Tue, 15 Feb 2011, Kevin Marks wrote:
> 
> Frame stepping is used when you want to mark an accurate in or our 
> point, or catch a still frame. This needs to be accurate, and it is 
> always local.

We can address this using a .advanceOneFrame()/.rewindOneFrame() method. 
Would that work? Does it have to be more generic? How would we handle 
formats that don't have well-defined frames?

> Chapter stepping means 'move me to the next meaningful break point in 
> this media. There is a very natural structure for this in almost all 
> professional media, and it is definitely worth getting this right. This 
> is a long range jump, but it is likely to be a key frame or start of new 
> file segment.

This is already specced.

> Scrubbing is when you are dragging the bar back and forth to find a 
> particular point. It is intermediate in resolution between the previous 
> two, but it needs to be responsive to work - the lag between moving the 
> bar and showing something. In many cases decoding only key frames in 
> this state makes sense, as this is most responsive, and also likely to 
> catch scene boundaries, which are commonly key frames anyway.

This would be addressed by the proposal in the bug cited above.

> The degenerate case of scrubbing is 'fast-forwarding', where the stream 
> is fetched faster than realtime, but again only keyframes are shown.

This is already specced.

On Tue, 15 Feb 2011, Rob Coenen wrote:
>
> Rather than trying to sum up all use cases I think that the media asset 
> should be fully random accessible and frame accurate to cover any 
> current and future use cases. You should be able to write Javascripts 
> that tell the asset to go to any point in time.

That's not how we design APIs -- if we try to solve all possible problems, 
first of all the API would be unusably complicated, and secondly we'd 
still miss some so we'd actually be no better off.

> That way a web developer (or implementers such as the guys of JWPlayer) 
> can come up with their own mechanisms for stuff such as "chapters" etc. 
> I don't believe that chapters should be part of the HTML5 spec.

They are, actually. :-)

On Wed, 12 Jan 2011, Dirk-Willem van Gulik wrote:
> On 12 Jan 2011, at 00:48, Dirk-Willem van Gulik wrote:
> > 
> > the clock relative to shutter/gating to the end user - as this is what 
> > you need to avoid flicker, interlace issues, half the frame showing > 
> > the next scene, etc.
> 
> Apologies - got some private mail asking for examples. So the simplest 
> example I can think of is a second of black video followed by a second 
> of white. At the moment of transition - the creative person designing 
> this wanted it to perfectly 'flash' from back to white.
> 
> If somehow you updated the display halfway a refresh cycle (and lets 
> assume your update process happens from top to bottom) then for 'one 
> refresh cycle' you'd show a black (old) top half and a white bottom 
> half. You can get similar fun and games with objects moving faster (or 
> around) the speed of your shown-as-size/update cycle. And it gets more 
> complex when your 'screen' does not update in a simple way - but uses 
> interlacing[1].
> 
> Now in practice we generally avoid this by having double buffer, slaving 
> our frame refresh to the video card or in the video card to something 
> else, etc. But it is easy to get wrong.

This is not a problem we need to worry about in the API, since the 
browsers would take care of this as a quality of implementation issue.

On Wed, 12 Jan 2011, Mikko Rantalainen wrote:
> 2011-01-12 00:40 EEST: Rob Coenen:
> > Hi David- that is b/c in an ideal world I'd want to seek to a time 
> > expressed as a SMPTE timecode (think web apps that let users step x 
> > frames back, seek y frames forward etc.). In order to convert SMPTE to 
> > the floating point value for video.seekTime I need to know the frame 
> > rate.
> 
> It seems to me that such an application really requires a method for 
> querying the timestamp for previous and next frames when given a 
> timestamp. If such an application requires FPS value, it can then 
> compute it by itself it such a value is assumed meaningful. (Simply get 
> next frame timestamp from zero timestamp and continue for a couple of 
> frames to compute FPS and check if the FPS seems to be stable.)
> 
> Perhaps there should be a method
> 
> getRelativeFrameTime(timestamp, relation)
> 
> where timestamp is the "current" timestamp and relation is one of 
> previousFrame, nextFrame, previousKeyFrame, nextKeyFrame?
> 
> Use of this method could be allowed only for paused video if needed for 
> simple implementation.

Without knowing more about the use case here it's hard to evaluate this 
proposal.

On Wed, 12 Jan 2011, Jeroen Wijering wrote:
> 
> Alternatively, one could look at a step() function instead of a 
> seek(pos,exact) function. The step function can be used for 
> frame-accurate controls. e.g. step(2) or step(-1). The advantage over a 
> seek(pos,exact) function (and the playback rate controls) is that the 
> viewer really knows the video is X frames offset. This is very useful 
> for both artistic/editing applications and for video analysis 
> applications (think sports, medical or experiments).
> 
> The downside of a step() to either always accurate seeking or a 
> seek(pos,exact) is that it requires two steps in situations like 
> bookmarking or chaptering.
> 
> It seems like the framerate / SMPTE proposals done here are all a means 
> to end up with frame-accurate seeking. With a step() function in place, 
> there's no need for such things. In fact, one could do a step(10) or so 
> and then use the difference in position to calculate framerate.

This still leaves the problem of media formats without a concept of 
"frames" (e.g. a SMIL animation, or a MIDI file).

On Wed, 12 Jan 2011, Rob Coenen wrote:
>
> glad that you are mentioning these artistic/editing/video analysis type 
> of applications. I'd like to add video 
> archiving/logging/annotating/subtitling to the list of potential 
> applications. But also experiments and time-based interaction. Most 
> online ad-agencies have been using Flash to design eg. highly 
> interactive mini-sites where banners, etc. are shown or hidden based up 
> on the exact timing of the video. Also think projects such as 
> http://www.thewildernessdowntown.com/

If you could elaborate on these use cases, specifically detailing what 
problems they expose that are not yet solved, that would be very helpful.

On Wed, 12 Jan 2011, Jeroen Wijering wrote:
> 
> With the step() in place, this would be a simple convenience function. 
> This pseudo-code is not ideal and making some assumptions, but the 
> approach should work:
> 
> function seekToTimecode(timecode) {
>     var seconds = convert_timecode_to_seconds(timecode);
>     videoElement.seek(seconds);
>     var delta = seconds - videoElement.currentTime;
>     while (delta > 0) {
>         videoElement.step(1);
>         delta = seconds - videoElement.currentTime;
>     }
> };

Why not just this?:

   videoElement.currentTime = convert_timecode_to_seconds(timecode);

(Note that "seek()" and "step()" and similar APIs would be asynchronous, 
so you couldn't implement the approach above as written.)

On Thu, 27 Jan 2011, Steve Lacey wrote:
> 
> The original suggestion for the video element looks good:
> 
> [Video Element]
> 
> // Frames decoded and available for playback.
> unsigned long decodedFrames;
> 
> // Frames dropped during playback for performance reasons.
> unsigned long droppedFrames;
> 
> But for the media element I'd like to propose raw bytes instead of a
> rate as this allows the developer to construct their own rates (if
> needed) based on whatever window they want. It would also be useful to
> separate audio from video. A suggestion might be:
> 
> [Media Element]
> 
> unsigned long audioBytesDecoded;
> unsigned long videoBytesDecoded;
> 
> Though this seems a little strange to have these specifically on the
> media element as they reference particular media types. Another idea
> would be to move these to the video element and also add
> audioBytesDecoded to the audio element.
> 
> Another open question: what are sensible values if the information is
> not available. Zero seems wrong.
> 
> Thoughts?

I'm still unclear as to what the use cases are, so it's hard to evaluate 
this proposal.

On Thu, 31 Mar 2011, Silvia Pfeiffer wrote:
>
> Please note that I've started a wiki page at 
> http://wiki.whatwg.org/wiki/Video_Metrics to try and collect all ideas 
> around media element statistics. Please add freely!

The use cases on this page aren't really use cases, especially for 
"decodedFrames" and subsequent proposals. They just describe what the 
attributes do, not what problem they solve.

On Thu, 27 Jan 2011, Steve Lacey wrote:
> On Thu, Jan 27, 2011 at 3:53 PM, Chris Pearce <chris at pearce.org.nz> 
> wrote:
> >
> > Out of curiosity, why do you want this feature? What does it give you 
> > that @buffered and @currentTime does not?
> 
> Being able to determine the bitrate that's currently being decoded has 
> been a request from devs (for similar reasons that devs on the FOMS list 
> have I expect). Raw data seems generally useful.

Can you elaborate on these reasons? Raw data is often interesting, but not 
always useful. There's terabytes of data per second that we could be 
exposing that we do not, due to lack of use cases.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'