[whatwg] Reconsidering how we deal with text track cues

Tue Sep 3 14:38:35 PDT 2013

On Wed, 12 Jun 2013, Brendan Long wrote:
> On 06/11/2013 11:11 PM, Silvia Pfeiffer wrote:
> > 
> > I suggest we rename WebVTTCue [1] to VTTCaptionCue and allow such cues 
> > only on tracks of kind={caption, subtitle}.
> 
> Why VTTCaptionCue and not just HTMLCue? It seems like any cue that can 
> be rendered needs to be able to provide its content as HTML, and once we 
> have that, the browser shouldn't care where we got that HTML from.

This premise seems quite wrong. Can you elaborate on why you think that?

> Do we expect browsers to have special rendering rules for every caption 
> format?

Each caption format _does_ have special rendering rules.

> It seems like the most likely result would be that the browser vendors 
> just don't bother implementing anything besides WebVTT.

That seems like a likely scenario, which seems fine.

On Mon, 17 Jun 2013, Silvia Pfeiffer wrote:
> On Thu, Jun 13, 2013 at 3:08 AM, Ian Hickson <ian at hixie.ch> wrote:
> > On Wed, 12 Jun 2013, Silvia Pfeiffer wrote:
> >>
> >> As we continue to evolve the functionality of text tracks, we will 
> >> introduce more complex other structured content into cues and we will 
> >> want browsers to parse and interpret them.
> >
> > I think it's a mistake to try to solve problems before they exist. We 
> > don't know exactly what we'll be adding in the future, so we don't 
> > know what we'll need yet.
> 
> I'm preparing to start specifying how to render chapters. There's 
> already been mention of need for a thumbnail image in chapters.
> 
> I'll also have to specify how to "render" descriptions. Since the target 
> audience are blind and vision-impaired users, there will be a rendering 
> algorithm that includes speech synthesis.
> 
> This is a problem I have to deal with now.

I don't think the problems you describe here require any changes to the 
API or to the format, but maybe I'm missing something. (Images for 
chapters would, I guess, if you're not using the images from the video 
file, but why wouldn't you use those actual images?)

> >> For example, I expect that once we have support for speech synthesis 
> >> in browsers [1], cues of kind descriptions will be voiced by speech 
> >> synthesis, and eventually we want to influence that speech synthesis 
> >> with markup (possibly a subpart of SSML [2] or some other simpler 
> >> markup that influences prosody).
> >
> > I think it's highly unlikely that we'll actually ever want that, but 
> > if we ever do, then we should fix the problem then.
> 
> Rendering description cues with speech synthesis is 100% something that 
> is coming. Richer markup of description cues is then just the logical 
> next step - it won't be required now, but is certainly on the roadmap. 
> How likely it will be to be SSML is unclear - I'd much prefer a simpler 
> markup for WebVTT, too.

I'm not even remotely convinced that speech synthesis in description cues 
needs any markup, let alone markup more elaborate than VTT already has.

> >> All of these new cue settings would end up as new attributes on the 
> >> WebVTTCue object. This is a dangerous design path that we have taken.
> >
> > This is wrong on two points. One, there's nothing forcing a text track 
> > format to only generate one kind of object -- just like HTML generates 
> > different objects for different elements, WebVTT could generate 
> > different objects for different cues.
> 
> Indeed, that's what I believe will be necessary.
> 
> > Two, it's not dangerous to have an object with lots of fields.
> 
> Why then do we then distinguish between a HTMLMediaElement, a 
> HTMLVideoElement and a HTMLAudioElement? What reasons make us create new 
> objects?

It's just a matter of convenience. <video> and <audio> elements have 
different results, so they expose different APIs. They share almost the 
entirety of their API via their abstract superclass, HTMLMediaElement.

> >> What we have done with WebVTT is actually two-fold:
> >> 1. we have created a file format that serializes arbitrary content
> >> that is time-synchronized with a media element.
> >> 2. and we have created a simple caption/subtitle cue format.
> >>
> >> That both are called "WebVTT" is the cause of a lot of confusion and not
> >> a good design approach.
> >
> > I think it's a mistake to view these as distinct. It's just one format.
> > But as you're that spec's editor, that's your choice. :-)
> 
> We've actually done more - we also have a chapter and a metadata cue format:
> http://dev.w3.org/html5/webvtt/#dfn-webvtt-cue
> 
> "WebVTT chapter title text is syntactically a subset of WebVTT cue
> text, and WebVTT cue text is syntactically a subset of WebVTT metadata
> text. Conformance checkers, when validating WebVTT files, may offer to
> restrict all cues to only having WebVTT chapter title text or WebVTT
> cue text as their cue payload; WebVTT metadata text cues are only
> useful for scripted applications (using the metadata text track
> kind)."
> 
> They are already hierarchically defined upon each other (already when
> you were the editor).
> 
> They just aren't represented in objects in this way.

I don't think the way you're viewing it is the right way to view it. IMHO 
there's just one format, it just can be used in various ways, just like 
HTML can be used for applications and documents and games, etc. Just like 
how HTML sometimes requires a <title> and sometimes does not, based on 
context, there are different contextual constraints on authoring VTT. But 
it's still just one format.

> Thus, I suggest that a cue coming from a WebVTT file on a kind=chapter 
> track will be interpreted as a ChapterCue, on a kind=captions track as a 
> VTTCaptionsCue, and on a kind=metadata track as a MetadataCue.

IMHO this doesn't make any sense.

You don't create different objects for different elements in HTML based on 
whether the document was opened in an <iframe> or a top-level browsing 
context or whatever. The API doesn't, shouldn't, change based on context.

It looks like I already said that, though:

> > It would be like having a different DOM for an HTML file in an 
> > <iframe> and in a top-level browsing context.
> 
> Contrast that to applying a different parsing and rendering algorithm
> of the <iframe> depending on the parent element that it is put into,
> which is what we are currently doing with WebVTTCue.

No, we're parsing VTT the same in all cases. It's the conformance rules 
that change, and the rendering rules. Just like with HTML, where the 
rendering can changed e.g. based on the CSS of the page, or where the 
element is (e.g. an <h1> in a <body> vs one in three nested <section>s).

> An alternative would be to create explicit <captiontrack>, 
> <descriptiontrack> etc elements, which was something that was under 
> discussion initially.

IMHO the current state is fine as it is.

> > You don't necessarily know, when parsing the WebVTT file or HTML file, 
> > what it's going to be used for. In the case of WebVTT, it could even 
> > change from one to another.
> 
> I'd disallow changing the kind on a track.

Content attributes can't be made read-only.

Also, that seems artificial. Why would we do that?

> >> Also, we separate out the WebVTT serialisation format syntax 
> >> specification from the cue syntax specification [2] and introduce 
> >> separate parsers [3] for the different cue syntax formats. The 
> >> rendering section [4] has already started distinguishing between cue 
> >> rendering for chapters and for captions/subtitles. This will easily 
> >> fit with the now separated cue syntax formats.
> >
> > This sounds like a lot of complication for no particularly good 
> > reason, but again, you're the editor. :-)
> 
> This is work that has to be done even if we decide to only have a single 
> object represent all cues of a WebVTT file.

No, you don't need separate parsers. There's just one parser, and the 
rendering rules are contextual, not based on anything about the cues.

> The point of this email is to introduce a hierarchy of objects that 
> represent cues (or at least an agreement on when such new objects should 
> be created).

We have a hiearchy. TextTrackCue is an abstract object that is defined in 
the HTML spec. VTTCue is the object used by VTT cues. I don't see any need 
for any others in HTML or VTT.

> >> Doing this will make WebVTT and the TextTrack API extensible for new 
> >> cue formats, such as cues in SSML format, or ThumbnailCues, or 
> >> MidrollAdCues or whatnot else we may see necessary in the future.
> >
> > It's already plenty extensible enough.
> 
> Right, you brought in the extensibility a few weeks ago by introducing 
> TextTrackCue as an abstract object and pushing all its extended 
> attributes into WebVTTCue, which is great. I'm just trying to come up 
> with the best scheme to make use of this extensibility, and I think 
> creating new objects makes more sense to be based on cue content than on 
> text track file mime type.

I would recommend against using the extensibility. :-)

On Mon, 17 Jun 2013, Brendan Long wrote:
>
> I don't think it's necessary to use the same language for authoring as 
> display though. Since we already have rules for rendering HTML, and 
> WebVTT seems to be a subset of HTML (with some special CSS rules, and 
> some shorthand tags), I think the easiest way to handle it would be to 
> translate WebVTT cues into HTML+CSS, then rely on the existing rendering 
> engine.

VTT's rendering rules differ from basic CSS in some pretty important ways. 
In the ways that it is similar, it already reuses CSS infrastructure.

There's not really any relationship with HTML, though.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'