[whatwg] <track> / WebVTT issues

Philip Jägenstedt philipj at opera.com
Wed Sep 21 02:15:25 PDT 2011

Implementors of <track> / WebVTT from several browser vendors (Opera,  
Mozilla, Google, Apple) met at the Open Video Conference recently. There  
was a session on video accessibility,[1] a bunch of new bugs were filed  
[2] and there was much rejoicing.

There were a few issues that weren't concrete enough to file bugs on, but  
which I think are still worthwhile discussing further:

== Comments ==

If you look at the source of the spec, you'll find comments as a v2  
feature request:

this is a comment, bla bla

I do not think this would be very useful. As a one-line comment at the top  
of the file (for authorship, etc) it is rather verbose and ugly, while for  
commenting out cues you would have to comment out each cue individually.  
It also doesn't work inside cues, where something like <! comment > is  
what would be backwards compatible with the current parser. If comments  
are left for v2, the above is what it'll be, because of compatibility  
constraints. If anyone is less than impressed with that, now would be the  
time to suggest an alternative and have it spec'd.

== Scrolling captions ==

The WebVTT layout algorithm tries to not move cues around once they've  
been displayed and to never obscure other cues. This means that for cues  
that overlap in time, the rendering will often be out of order, with the  
earliest cue at the bottom. This is quite contrary to the (mainly US?)  
style of (live) scrolling captions, where cues are always in order and  
scroll to bring new captions into view. (I am not suggesting any specific  

== Scaling up and down ==

Scaling the font size with the video will not be optimal for either small  
screens (text will be too small) or very large screens (text will be too  
big). Do we change the default rendering in some way, or do we let users  
override the font size? If users can override it, do we care that this may  
break the intended layout of the author?

== Strict vs forgiving parsing ==

The parser is fairly strict in some regards:

  * double id line discards entire cue  
  * must use exactly 2 digits for minutes and seconds
  * minutes and seconds must be <60
  * must use "." as the decimal separator
  * must use exactly 3 decimal digits
  * stray "<" consumes the rest of the cue text

A small percentage of cues (or cue text) will be dropped because of these  
constraints and this is not very likely to be noticed unless the entire  
video+captions are watched. Possible remedies:

  * make the parser more forgiving where it does not conflict with  
  * make browsers complain a lot in the error console
  * point and laugh at those who failed to use a (non-existent) validator

== Chapter end time ==

In most systems chapters are really chapter markers, a point in time. A  
chapter implicitly ends when the next begins. For nested chapters this  
isn't so, as the end time is used to determine nesting. Do we expect that  
UIs for chapter navigation make the end time visible in some fashion (e.g.  
highlighting the chapter on the timeline) or that when a chapter it is  
chosen, it will pause at the end time?

== --> next ==

A suggestion that was brought up when discussing chapters. When one simply  
wants the chapter to end when the next starts, it's a bit of a hassle to  
always include the end time. Some additional complexity in the parser  
could allow for this:

00:00.000 --> next
Chapter 1

01:00.000 --> next

02:00.000 --> next
Last Chapter

Cues would be created with endTime = Infinity, and be modified to the  
startTime of the following cue (in source order) if there is a following  
cue. This would IMO be quite neat, but is the use case strong enough?

[1] http://openvideoconference.org/standards-for-video-accessibility/
[2] http://wiki.whatwg.org/wiki/WebVTT

Philip Jägenstedt
Core Developer
Opera Software

More information about the whatwg mailing list