[whatwg] Reconsidering how we deal with text track cues

Silvia Pfeiffer silviapfeiffer1 at gmail.com
Sun Jun 16 23:41:18 PDT 2013

On Thu, Jun 13, 2013 at 3:08 AM, Ian Hickson <ian at hixie.ch> wrote:
> On Wed, 12 Jun 2013, Silvia Pfeiffer wrote:
>> As we continue to evolve the functionality of text tracks, we will
>> introduce more complex other structured content into cues and we will
>> want browsers to parse and interpret them.
> I think it's a mistake to try to solve problems before they exist. We
> don't know exactly what we'll be adding in the future, so we don't know
> what we'll need yet.

I'm preparing to start specifying how to render chapters. There's
already been mention of need for a thumbnail image in chapters.

I'll also have to specify how to "render" descriptions. Since the
target audience are blind and vision-impaired users, there will be a
rendering algorithm that includes speech synthesis.

This is a problem I have to deal with now.

>> For example, I expect that once we have support for speech synthesis in
>> browsers [1], cues of kind descriptions will be voiced by speech
>> synthesis, and eventually we want to influence that speech synthesis
>> with markup (possibly a subpart of SSML [2] or some other simpler markup
>> that influences prosody).
> I think it's highly unlikely that we'll actually ever want that, but if we
> ever do, then we should fix the problem then.

Rendering description cues with speech synthesis is 100% something
that is coming. Richer markup of description cues is then just the
logical next step - it won't be required now, but is certainly on the
roadmap. How likely it will be to be SSML is unclear - I'd much prefer
a simpler markup for WebVTT, too.

>> All of these new cue settings would end up as new attributes on the
>> WebVTTCue object. This is a dangerous design path that we have taken.
> This is wrong on two points. One, there's nothing forcing a text track
> format to only generate one kind of object -- just like HTML generates
> different objects for different elements, WebVTT could generate different
> objects for different cues.

Indeed, that's what I believe will be necessary.

> Two, it's not dangerous to have an object with
> lots of fields.

Why then do we then distinguish between a HTMLMediaElement, a
HTMLVideoElement and a HTMLAudioElement? What reasons make us create
new objects?

>> What we have done with WebVTT is actually two-fold:
>> 1. we have created a file format that serializes arbitrary content
>> that is time-synchronized with a media element.
>> 2. and we have created a simple caption/subtitle cue format.
>> That both are called "WebVTT" is the cause of a lot of confusion and not
>> a good design approach.
> I think it's a mistake to view these as distinct. It's just one format.
> But as you're that spec's editor, that's your choice. :-)

We've actually done more - we also have a chapter and a metadata cue format:

"WebVTT chapter title text is syntactically a subset of WebVTT cue
text, and WebVTT cue text is syntactically a subset of WebVTT metadata
text. Conformance checkers, when validating WebVTT files, may offer to
restrict all cues to only having WebVTT chapter title text or WebVTT
cue text as their cue payload; WebVTT metadata text cues are only
useful for scripted applications (using the metadata text track

They are already hierarchically defined upon each other (already when
you were the editor).

They just aren't represented in objects in this way.

>> Firstly, there are consequences on the WebVTT spec.
>> I suggest we rename WebVTTCue [1] to VTTCaptionCue and allow such cues
>> only on tracks of kind={caption, subtitle}.
> I don't think that makes any sense. Any WebVTT file can be used for any
> "kind" of <track>. These are orthogonal contexts.

Yes, there are two different things at play: the format of the cue and
the interpretation of the cue format in the browser. The second one is
driven by the "kind".

However, WebVTT files are authored with a certain usage target in
mind. If I author a caption file, I'd not expect it to work when
interpreted as a chapter track or a description track.

It is possible to interpret a caption cue on any kind of track, but
then it needs to follow the parsing and rendering approach of cues on
that kind of track. Hooking these different parsing and rendering
algorithms up to the WebVTTCue object and dynamically applying them
depending on the kind of track is a lot of magic to be hidden in an
object. Normally every object that we have in HTML has a single
rendering approach and doesn't change depending on an attribute
setting of a member object.

Thus, I suggest that a cue coming from a WebVTT file on a kind=chapter
track will be interpreted as a ChapterCue, on a kind=captions track as
a VTTCaptionsCue, and on a kind=metadata track as a MetadataCue. The
cue as authored in WebVTT could, however, contain anything.

> It would be like having a different DOM for an HTML file in an <iframe>
> and in a top-level browsing context.

Contrast that to applying a different parsing and rendering algorithm
of the <iframe> depending on the parent element that it is put into,
which is what we are currently doing with WebVTTCue.

Since all cues need to inherit from AbstractCue, the DOM is not really
different - just has a more specific interpretation.

An alternative would be to create explicit <captiontrack>,
<descriptiontrack> etc elements, which was something that was under
discussion initially.

> You don't necessarily know, when
> parsing the WebVTT file or HTML file, what it's going to be used for. In
> the case of WebVTT, it could even change from one to another.

I'd disallow changing the kind on a track. Then, because the cue is
never parsed and rendered without having been associated with a
TextTrack, it is always clear what it is to be interpreted as.

>> Also, we separate out the WebVTT serialisation format syntax
>> specification from the cue syntax specification [2] and introduce
>> separate parsers [3] for the different cue syntax formats. The rendering
>> section [4] has already started distinguishing between cue rendering for
>> chapters and for captions/subtitles. This will easily fit with the now
>> separated cue syntax formats.
> This sounds like a lot of complication for no particularly good reason,
> but again, you're the editor. :-)

This is work that has to be done even if we decide to only have a
single object represent all cues of a WebVTT file.

>> Secondly, there are consequences for the TextTrackCue object hierarchy
>> in the HTML spec.
>> I suggest we rename TextTrackCue to AbstractCue (or just Cue). It is
>> simply the abstract result of parsing a serialisation of cues (e.g. a
>> WebVTT file) into its individual cues.
>> Similarly TextTrackCueList should be renamed to CueList and should be a
>> cue list of only one particular type of cue. Thus, the parsing and
>> rendering algorithm in use for all cues in a CueList is fixed. Also, a
>> CueList of e.g. ChapterCues should only be allowed to be attached to a
>> track of kind=chapters, etc.
> I don't understand the value in changing these names. This seems quite
> orthongonal to the rest of this e-mail.

The point of this email is to introduce a hierarchy of objects that
represent cues (or at least an agreement on when such new objects
should be created).

> In general, I am strongly against changing names unless there's a
> seriously compelling reason, like compatibility requirements. Churn in a
> specification is extremely negative, as it leads implementors to lose
> respect in the spec, and makes them think there's no point in following
> specs in the first place.
> This is one of the core requirements of a Living Standard: that things
> *not change arbitrarily*. We can't just change our minds on things every
> few weeks. We have to pick a direction and then stick with it. Basically,
> we have to have confidence in our decisions. This doesn't mean we can't
> change things, but it means that to change things we should have a
> compelling reason. I don't see one for this proposed change.

Right, we just recently renamed TextTrackCue and introduced WebVTTCue,
which I believe no implementer has followed yet. This is why I am
bringing this up now, while we can still fix it without much churn. If
a browser implemented TextTrackCue now in the way it has been
re-specified, a JS developer could end up in a situation where their
implementation is not backwards compatible - we've already broken
compatibility requirements. I wouldn't want to make such a change ever
again for cues, which is why this has to be done now.

Choice of a different name for the abstract TextTrackCue would cause
less backwards compatibility issues. Also, you have said yourself that
a TextTrackCue may contain cues that may have no text in them at all,
so the name "Text" is misleading. This is why AbstractCue or Cue would
be a better name.

Renaming TextTrackCueList is indeed be a pain point. If we really want
to stick with the (misleading) "text track" approach (which, I guess,
is too engrained for now to change it), we can just change the
TextTrackCue object name, since that is currently breaking
compatibility anyway.

Maybe browser vendors can speak up and state their opposition?

>> Doing this will make WebVTT and the TextTrack API extensible for new cue
>> formats, such as cues in SSML format, or ThumbnailCues, or MidrollAdCues
>> or whatnot else we may see necessary in the future.
> It's already plenty extensible enough.

Right, you brought in the extensibility a few weeks ago by introducing
TextTrackCue as an abstract object and pushing all its extended
attributes into WebVTTCue, which is great. I'm just trying to come up
with the best scheme to make use of this extensibility, and I think
creating new objects makes more sense to be based on cue content than
on text track file mime type.

On Thu, Jun 13, 2013 at 3:25 AM, Brendan Long <self at brendanlong.com> wrote:
> On 06/11/2013 11:11 PM, Silvia Pfeiffer wrote:
>> I suggest we rename WebVTTCue [1] to VTTCaptionCue and allow such cues
>> only on tracks of kind={caption, subtitle}.
> Why VTTCaptionCue and not just HTMLCue? It seems like any cue that can
> be rendered needs to be able to provide its content as HTML, and once we
> have that, the browser shouldn't care where we got that HTML from.

That could indeed be a different way to approach caption cues.
However, authoring caption text on video with only the formatting
markup that a caption may need and limiting HTML functionality to
features that captions need was one of the motivations for creating

Doesn't stop us from doing HTMLCue in the future, though.

> Do we expect browsers to have special rendering rules for every caption
> format?

When the cues come in a specific caption format (such as TTML cues), probably.

> It seems like the most likely result would be that the browser
> vendors just don't bother implementing anything besides WebVTT.

IE already supports a basic feature set of TTML as input, too. But IE
hasn't implemented the TextTrack API yet FAIK.


More information about the whatwg mailing list