[whatwg] Reconsidering how we deal with text track cues

Silvia Pfeiffer silviapfeiffer1 at gmail.com
Tue Jun 11 22:11:27 PDT 2013

Hi all,

The model in which we have looked at text tracks (<track> element of
media elements) thus far has some issues that I would like to point
out in this email and I would like to suggest a new way to look at
tracks. This will result in changes to the HTML and WebVTT specs and
has an influence on others specifying text track cue formats, so I am
sharing this information widely.

Current situation
Text tracks provide lists of timed cues for media elements, i.e. they
have a start time, an end time, and some content that is to be
interpreted in sync with the media element's timeline.

WebVTT is the file format that we chose to define as a serialisation
for the cues (just like audio files serialize audio samples/frames and
video files serialize video frames).

The means in which we currently parse WebVTT files into JS objects has
us create objects of type WebVTTCue. These objects contain information
about any kind of cue that could be included in a WebVTT file -
captions, subtitles, descriptions, chapters, metadata and whatnot.

The WebVTTCue object looks like this:

enum AutoKeyword { "auto" };
[Constructor(double startTime, double endTime, DOMString text)]
interface WebVTTCue : TextTrackCue {
           attribute DOMString vertical;
           attribute boolean snapToLines;
           attribute (long or AutoKeyword) line;
           attribute long position;
           attribute long size;
           attribute DOMString align;
           attribute DOMString text;
  DocumentFragment getCueAsHTML();

There are attributes in the WebVTTCue object that relate only to cues
of kind captions and subtitles (vertical, snapToLines etc). For cues
of other kinds, the only relevant attribute right now is the text

This works for now, because cues of kind descriptions and chapters are
only regarded as plain text, and the structure of the content of cues
of kind metadata is not parsed by the browser. So, for cues of kind
descriptions, chapters and metadata, that .text attribute is

The consequence
As we continue to evolve the functionality of text tracks, we will
introduce more complex other structured content into cues and we will
want browsers to parse and interpret them.

For example, I expect that once we have support for speech synthesis
in browsers [1], cues of kind descriptions will be voiced by speech
synthesis, and eventually we want to influence that speech synthesis
with markup (possibly a subpart of SSML [2] or some other simpler
markup that influences prosody).

Since we have set ourselves up for parsing all cue content that comes
out of WebVTT files into WebVTTCue objects, we now have to expand the
WebVTTCue object with attributes for speech synthesis, e.g. I can
imagine cue settings for descriptions to contain a field called
"channelMask" to contain which audio channels a particular cue should
be rendered into with values being center, left, right.

Another example is that eventually somebody may want to introduce
ThumbnailCues that contain data URLs for images and may have a
"transparency" cue setting. Or somebody wants to formalize
MidrollAdCues that contain data URLs for short video ads and may have
a "skippableAfterSecs" cue setting.

All of these new cue settings would end up as new attributes on the
WebVTTCue object. This is a dangerous design path that we have taken.

[1] https://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html#tts-section
[2] http://www.w3.org/TR/speech-synthesis/#S3.2

Problem analysis
What we have done by restricting ourselves to a single WebVTTCue
object to represent all types of cues that come from a WebVTT file is
to ignore that WebVTT is just a serialisation format for cues, but
that cues are the ones that provide the different types of timed
content to the browser. The browser should not have to care about the
serialisation format. But it should care about the different types of
content that a track cue could contain.

For example, it is possible that a WebVTT caption cue (one with all
the markup and cue settings) can be provided to the browser through a
WebM file or through a MPEG file or in fact (gasp!) through a TTML
file. Such a cue should always end up in a WebVTTCue object (will need
a better name) and not in an object that is specific to the
serialisation format.

What we have done with WebVTT is actually two-fold:
1. we have created a file format that serializes arbitrary content
that is time-synchronized with a media element.
2. and we have created a simple caption/subtitle cue format.

That both are called "WebVTT" is the cause of a lot of confusion and
not a good design approach.

The solution
We thus need to distinguish between cue formats in the browser and not
between serialisation formats (we don't distinguish between different
image formats or audio formats in the browser either - we just handle
audio samples or image pixels).

Once a WebVTT file is parsed into a list of cues, the browser should
not have to care any more that the list of cues came from a WebVTT
file or anywhere else. It's a list of cues with a certain type of
content that has a parsing and a rendering algorithm attached.

Spec consequences
What needs to change in the specs to deal with this different approach
to text tracks is not hard to deduct.

Firstly, there are consequences on the WebVTT spec.

I suggest we rename WebVTTCue [1] to VTTCaptionCue and allow such cues
only on tracks of kind={caption, subtitle}.
Also, we separate out the WebVTT serialisation format syntax
specification from the cue syntax specification [2] and introduce
separate parsers [3] for the different cue syntax formats.
The rendering section [4] has already started distinguishing between
cue rendering for chapters and for captions/subtitles. This will
easily fit with the now separated cue syntax formats.

We will then introduce a ChapterCue which adds a .text attribute and a
constructor onto AbstractCue for cues (in WebVTT or from elsewhere)
that are interpreted as chapters and have their own rendering
Similarly, we introduce a DescriptionCue which adds a .text attribute
and a constructor onto AbstractCue and we define a rendering algorithm
that makes use of the new speech synthesis API [5].
Similarly, we introduce a MetadataCue which adds a .content attribute
and a constructor onto AbstractCue with no rendering algorithm.
I think these new cue objects would even make more sense being defined
in HTML including their rendering algorithms rather than in the WebVTT
spec, because they are generic and we don't want chapters to be
rendered differently just because they have originated from a
different serialisation format.

[1] http://dev.w3.org/html5/webvtt/#webvtt-api
[2] http://dev.w3.org/html5/webvtt/#syntax
[3] http://dev.w3.org/html5/webvtt/#parsing
[4] http://dev.w3.org/html5/webvtt/#rendering
[5] https://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html#tts-section

Secondly, there are consequences for the TextTrackCue object hierarchy
in the HTML spec.

I suggest we rename TextTrackCue [6] to AbstractCue (or just Cue). It
is simply the abstract result of parsing a serialisation of cues (e.g.
a WebVTT file) into its individual cues.

Similarly TextTrackCueList [7] should be renamed to CueList and should
be a cue list of only one particular type of cue. Thus, the parsing
and rendering algorithm in use for all cues in a CueList is fixed.
Also, a CueList of e.g. ChapterCues should only be allowed to be
attached to a track of kind=chapters, etc.

[6] http://www.whatwg.org/specs/web-apps/current-work/#texttrackcue
[7] http://www.whatwg.org/specs/web-apps/current-work/#texttrackcuelist

Doing this will make WebVTT and the TextTrack API extensible for new
cue formats, such as cues in SSML format, or ThumbnailCues, or
MidrollAdCues or whatnot else we may see necessary in the future.

This may look like a lot of changes, but it's really just some
renaming and an introduction of a small number of semantically clean
new objects.

Feedback welcome.


More information about the whatwg mailing list