[whatwg] Reconsidering how we deal with text track cues

Silvia Pfeiffer silviapfeiffer1 at gmail.com
Tue Sep 3 15:01:16 PDT 2013

On Wed, Sep 4, 2013 at 7:38 AM, Ian Hickson <ian at hixie.ch> wrote:
> On Mon, 17 Jun 2013, Silvia Pfeiffer wrote:
>> On Thu, Jun 13, 2013 at 3:08 AM, Ian Hickson <ian at hixie.ch> wrote:
>> > On Wed, 12 Jun 2013, Silvia Pfeiffer wrote:
>> >>
>> >> As we continue to evolve the functionality of text tracks, we will
>> >> introduce more complex other structured content into cues and we will
>> >> want browsers to parse and interpret them.
>> >
>> > I think it's a mistake to try to solve problems before they exist. We
>> > don't know exactly what we'll be adding in the future, so we don't
>> > know what we'll need yet.
>> I'm preparing to start specifying how to render chapters. There's
>> already been mention of need for a thumbnail image in chapters.
>> I'll also have to specify how to "render" descriptions. Since the target
>> audience are blind and vision-impaired users, there will be a rendering
>> algorithm that includes speech synthesis.
>> This is a problem I have to deal with now.
> I don't think the problems you describe here require any changes to the
> API or to the format, but maybe I'm missing something. (Images for
> chapters would, I guess, if you're not using the images from the video
> file, but why wouldn't you use those actual images?)

It's possible that the chapter track author wants to use other images.
The images in DVD chapters aren't always from the video either. Also,
if you have a audio-only src in the <video> element, but you want to
provide images for chapter navigation, then you'd provide them in the
WebVTT file, probably as dataURLs. I've seen that done before.

>> >> For example, I expect that once we have support for speech synthesis
>> >> in browsers [1], cues of kind descriptions will be voiced by speech
>> >> synthesis, and eventually we want to influence that speech synthesis
>> >> with markup (possibly a subpart of SSML [2] or some other simpler
>> >> markup that influences prosody).
>> >
>> > I think it's highly unlikely that we'll actually ever want that, but
>> > if we ever do, then we should fix the problem then.
>> Rendering description cues with speech synthesis is 100% something that
>> is coming. Richer markup of description cues is then just the logical
>> next step - it won't be required now, but is certainly on the roadmap.
>> How likely it will be to be SSML is unclear - I'd much prefer a simpler
>> markup for WebVTT, too.
> I'm not even remotely convinced that speech synthesis in description cues
> needs any markup, let alone markup more elaborate than VTT already has.

That would be good. I've been told though that synthesised speech does
not convey enough prosodic and emotional content and that markup would
help with that (see
Not a top priority though.

>> >> What we have done with WebVTT is actually two-fold:
>> >> 1. we have created a file format that serializes arbitrary content
>> >> that is time-synchronized with a media element.
>> >> 2. and we have created a simple caption/subtitle cue format.
>> >>
>> >> That both are called "WebVTT" is the cause of a lot of confusion and not
>> >> a good design approach.
>> >
>> > I think it's a mistake to view these as distinct. It's just one format.
>> > But as you're that spec's editor, that's your choice. :-)
>> We've actually done more - we also have a chapter and a metadata cue format:
>> http://dev.w3.org/html5/webvtt/#dfn-webvtt-cue
>> "WebVTT chapter title text is syntactically a subset of WebVTT cue
>> text, and WebVTT cue text is syntactically a subset of WebVTT metadata
>> text. Conformance checkers, when validating WebVTT files, may offer to
>> restrict all cues to only having WebVTT chapter title text or WebVTT
>> cue text as their cue payload; WebVTT metadata text cues are only
>> useful for scripted applications (using the metadata text track
>> kind)."
>> They are already hierarchically defined upon each other (already when
>> you were the editor).
>> They just aren't represented in objects in this way.
> I don't think the way you're viewing it is the right way to view it. IMHO
> there's just one format, it just can be used in various ways, just like
> HTML can be used for applications and documents and games, etc. Just like
> how HTML sometimes requires a <title> and sometimes does not, based on
> context, there are different contextual constraints on authoring VTT. But
> it's still just one format.

I've followed that view (you might have seen such a message on the W3C
list). I'm expecting it will lead to interesting consequences, such as
hyperlinks and dataURLs becoming part of WebVTT cue markup. But it is
likely the easier route.


More information about the whatwg mailing list