[whatwg] Google Feedback on the HTML5 media a11y specifications

Tue Feb 8 18:57:37 PST 2011

Hi Philip, all,

On Sun, Jan 23, 2011 at 1:23 AM, Philip Jägenstedt <philipj at opera.com> wrote:
> On Fri, 14 Jan 2011 10:01:38 +0100, Silvia Pfeiffer
> <silviapfeiffer1 at gmail.com> wrote:
>
>> There are two sections - the first one concerns the WebVTT format and
>> the second one the <track> specification.
>
> Thanks for compiling all of this feedback, Silvia! As usual, my inline
> replies are sometimes terse, not to be mistaken for impatience or disrespect
> :)

We all just want to get this right! :-)
(and: sorry for the delayed reply - I am keen to keep discussing this)

>> A. Feedback on the WebVTT format
>
>> 1. Introduce file-wide metadata
>>
>> WebVTT requires a structure to add header-style metadata. We are here
>> talking about lists of name-value pairs as typically in use for header
>> information. The metadata can be optional, but we need a defined means
>> of adding them.
>>
>> Required attributes in WebVTT files should be the main language in use
>> and the kind of data found in the WebVTT file - information that is
>> currently provided in the <track> element by the @srclang and @kind
>> attributes. These are necessary to allow the files to be interpreted
>> correctly by non-browser applications, for transcoding or to determine
>> if a file was created as a caption file or something else, in
>> particular the @kind=metadata. @srclang also sets the base
>> directionality for BiDi calculations.
>
> Are there non-browsers that use the language for font-selection or bidi? Is
> auto-detection not likely to give a better user experience? Are there any
> other use cases for knowing the language of the captions *after* they've
> been opened?

I can't see a different way to let non-browser applications know what
font to choose, even how to provide the user with a menu of available
caption tracks for a video, or to set the base directionality for
BiDi. Also, language auto-detection is a huge burden to put onto
non-browser applications. Having a readable language tag at the
beginning of the file is useful to quickly figure it all out.

The language set in <track> would certainly overrule what is in the
file. Also, the last language attribute in the header would probably
win.

I guess it would also be ok to have language and kind optional -
different applications may then default to interpreting WebVTT files
differently, such as by default English and Captions - or English and
Descriptions, but that's probably acceptable from context.

> Why do non-browser players need to know the kind? All kinds are processed in
> the same way except metadata, and there's no reason to use metadata tracks
> with external players.

Maybe I have a different view of what applications will make use of
WebVTT files than most. My thinking is that there will also be uses
for metadata tracks in external applications. Aside from this, there
will be authoring applications and players, yes, but there will also
be automated processing tools. So, to know what type of content is
inside a file without having to look at more than the file's headers
is really important.

>> Further metadata fields that are typically used by authors to keep
>> specific authoring information or usage hints are necessary, too. As
>> examples of current use see the format of MPlayer mpsub’s header
>> metadata [2], EBU STL’s General Subtitle Information block [3], and
>> even CEA-608’s Extended Data Service with its StartDate, Station,
>> Program, Category and TVRating information [4]. Rather than specifying
>> a specific subset of potential fields we recommend to just have the
>> means to provide name-value pairs and leave it to the negotiation
>> between the author and the publisher which fields they expect of each
>> other.
>
> This approach has worked very well with Vorbis Comments, probably mostly
> because all interesting fields have been pre-defined in
> http://www.xiph.org/vorbis/doc/v-comment.html
>
> For a web format though, wouldn't some kind of wiki registry be good to
> avoid total mayhem, especially if there are some predefined fields? (Not
> having file-wide metadata would also avoid such mayhem.)

It might be good to define a base set - the Vorbis Comments one or the
ID3 ones could be appropriate. Even the old Dublin Core set (the first
ones, not the current chaos) could be good. I could also analyse the
sets used in current typical caption formats and propose a superset of
those.

While I think you're right with suggesting a predefined set of fields,
I am mostly keen right now to agree on the general format of the
fields and how we need to parse them rather than what they actually
are.

So, I would suggest we allow lines of "name=value" under the WEBVTT
magic string. A blank line defines the end of the header section and
the beginning of the cues. Would be simple enough to parse, right?

>> 2. Introduce file-wide cue settings
>>
>> At the moment if authors want to change the default display of cues,
>> they can only set them per cue (with the D:, S:, L:, A: and T:. cue
>> settings) or have to use an external CSS file through a HTML page with
>> the ::cue pseudo-element. In particular when considering that all
>> Asian language files would require a “D:vertical” marker, it becomes
>> obvious that this replication of information in every cue is
>> inefficient and a waste of bandwidth, storage, and application speed.
>> A cue setting default section should be introduced into a file
>> header/setup area of WebVTT which will avoid such replication.
>>
>> An example document with cue setting defaults in the header could look
>> as follows:
>> ==
>> WEBVTT
>> Language=zh
>> Kind=Caption
>> CueSettings= A:end D:vertical
>>
>> 00:00:15.000 --> 00:00:17.950
>> 在左边我们可以看到...
>>
>> 00:00:18.160 --> 00:00:20.080
>> 在右边我们可以看到...
>>
>> 00:00:20.110 --> 00:00:21.960
>> ...捕蝇草械.
>> ==
>>
>> Note that you might consider that the solution to this problem is to
>> use external CSS to specify a change to all cues. However, this is not
>> acceptable for non-browser applications and therefore not an
>> acceptable solution to this problem.
>
> Indeed, repeating settings on each cue would be annoying. However, file-wide
> settings seems like it would easily be too broad, and you'd have to
> explicitly reverse the effect on the cues where you don't want it to apply.
> Maybe classes of cue settings or some kind of macros would work better.

Hmm, maybe we can have file-wide cue settings and classes that can be
explicitly used to override the file-wide ones. I am not overly fussed
how we solve it, but I do want to avoid the repetition.

> Nitpick: Modern Chinese, including captions, is written left-to-right,
> top-to-bottom, just like English.

Gah, I should have used the Japanese examples that I had! I speak
neither so it's all unreadable to me anyway.

>> 3. Cue settings requirements
>
>> * naming: The usage of single letter abbreviations for cue settings
>> has created quite a discussion here at Google. We all agree that
>> file-wide cue settings are required and that this will reduce the need
>> for cue-specific cue settings. We can thus afford a bit more
>> readability in the cue settings. We therefore believe that it would be
>> better if the cue settings were short names rather than single letter
>> codes. This would be more like CSS, too, and easier to learn and get
>> right. In the interface description, the 5 dimensions have proper
>> names which could be re-used (“direction”, “linePosition”,
>> “textPosition”, “size” and “align"). We therefore recommend replacing
>> the single-letter cue commands with these longer names.
>>
>> An example document with more verbose cue settings could look as follows:
>> ==
>> WEBVTT
>> Language=zh
>> Kind=Caption
>> CueSettings= align:end direction:vertical
>>
>> 00:00:15.000 --> 00:00:17.950 linePosition:80%
>> 在左边我们可以看到...
>>
>> 00:00:18.160 --> 00:00:20.080
>> 在右边我们可以看到...
>>
>> 00:00:20.110 --> 00:00:21.960 size:70%
>> ...捕蝇草械.
>> ==
>
> I agree, every time I see the single-letter settings I have to go look at
> the spec to figure out what they mean. I'd be happy to have more explicit
> names. I'd be even happier if they match CSS terminology where possible.

I've been considering whether it should match CSS terminology where
possible. Since it's not possible for all of the settings and for some
it makes no sense to use a different term to the one CSS uses, we will
probably end up with some overlap. I'd not worry about it and just
ignore this fact, but not make it a design principle.

>> 4. Cue formatting requirements
>>
>> In analysing the available cue formatting functionality, we have found
>> that some features are missing. Most of these features can be added
>> through using CSS on cues that have received a <b>, <i>, <c> or <v>
>> marker. The following features are core to traditional TV and exist in
>> EBU STL and CEA-608/708 captions. Support of these will be a core
>> requirement for browsers as well as non-browser applications and it
>> makes sense to add these to WebVTT rather than relying on external CSS
>> which cannot be used for non-browser captions:
>
> The unstated requirement here seems to be that WebVTT needs to work as an
> interchange format for various TV captioning formats even in user agents
> without any support for CSS (or JavaScript). I'm trying to not make a straw
> man argument, but if want an interchange format, we should pick TTML, which
> is explicitly designed to be just that and doesn't depend on CSS.
>
> Is it not enough that a lossy conversion can be made from various formats
> into WebVTT+CSS(+JavaScript)? If not, the "Web" in "WebVTT" is highly
> misleading...

We're trying to avoid the need for multiple transcodings and are
trying to achieve something like the following pipeline:
broadcast captions -> transcode to WebVTT -> show in browser ->
transcode to broadcast devices -> show

If we have to plug TTML into this pipeline, too, it will be much
slower and we would need to additionally define a mapping from TTML to
WebVTT and back.

I'm sure with SMPTE-TT around we will end up seeing things like
broadcast->TTML->WebVTT->browser, but even then we don't want WebVTT
to be a lossy format.

>> * textcolor: In particular on European TV it is common to distinguish
>> between speakers by giving their speech different colors. The
>> following colors are supported by EBU STL, CEA-608 and CEA-708 and
>> should be supported in WebVTT without the use of external CSS: black,
>> red, green, yellow, blue, magenta, cyan, and white. As default we
>> recommend white on a grey transparent background.
>
> What's wrong with <v Speaker>? If a completely automatic conversion is
> needed, why not <c.yellow>...</c>? Both methods have the distinct advantage
> of making it easy to change or disable the colors with only CSS changes.

I think indeed <c.yellow>...</c> will satisfy this requirement.

>> * underline: EBU STL, CEA-608 and CEA-708 support underlining of
>> characters. The underline character is also particularly important for
>> some Asian languages. Please make it possible to provide text
>> underlines without the use of CSS in WebVTT.
>
> Which Asian languages? If it's just the Chinese
> <http://en.wikipedia.org/wiki/Proper_name_mark>, then I don't think that
> needs <u> or similar. In my experience, use of the Chinese proper name mark
> is in fact extremely rare in Chinese captions, at least in movies and TV
> series from the mainland and Taiwan. It would be best to use e.g.
> 我來自<c.pnm>中國</c> to make it easy to change the style between
> single/double/wavy/no underline.

OK. So if we need underlined text, it will need to be
<c.underline>..</c> and CSS underline? I guess in a Web context
underline text is usually a hyperlink so it makes sense to discourage
<u> for the Web. But is that also an argument for
captions/subtitles/descriptions? What is the argument against using
<u> in captions?

>> * blink: As much as we would like to discourage blinking subtitles,
>> they are actually a core requirement for EBU STL and CEA-608/708
>> captions and in use in particular for emergency messages and similar
>> highly important information. Blinking can be considered optional for
>> implementation, but we should allow for it in the standard.
>
> 00:00.000 --> 00:00.500
> blinking
>
> 00:01.000 --> 00:02.500
> blinking
>
> 00:02.000 --> 00:02.500
> blinking
>
> Is that enough? In the context of the web there are much better ways to
> convey very import information than through blinking captions. Event alert()
> would be better.

If we were talking only about Web captions, I would agree. I laughed
about your solution and personally kinda like it. But from a
captioning/subtitling point of view it's probably hard to convert that
back to blinking text, since we've just lost the semantic by ripping
it into multiple cues (and every program would use different ways of
doing this). But I do think that <c.alert>...</c> or <c.blink>
...</c>could work as a solution. I hadn't really grasped the power of
the class span element yet.

>> * font face: CEA-708 provides a choice of eight font tags: undefined,
>> monospaced serif, proportional serif, monospaced sans serif,
>> proportional sans serif, casual, cursive, small capital. These fonts
>> should be available for WebVTT as well. Is this the case?
>
> Does the choice of font ever carry any semantic meaning? Isn't it a good
> thing that captions can't specify their own fonts, so that it's easy to pick
> a style that's suitable for the embedding site?

The choice of fonts for captions has traditionally been a key to
providing quality captions. Some fonts are more readable than others.
So, captioning handbooks have traditionally prescribed the best fonts
to use for captioning to explicitly point out those that are easily
readable. After having checked with the handbooks that are available
to me it seems sans serif and proportional are the preferred ones, so
I do wonder why CEA-708 provides this choice of fonts. You are right
though that it makes more sense to provide semantic meaning and then
style through css. At minimum <c.cursive> etc would be possible with
an appropriate choice of font through styling, again using the class
span element to solve this.

Coming at it from a devices background, it's actually all a matter of
pre-defined choices. They're not going to package a large number of
fonts with every device, so it's good if all devices support a basic
subset that can be relied on to exist cross-device. We're increasingly
going to have to consider such requirements, too, because we will see
Web browsers run on devices with restricted capabilities, not just the
browser on a computer where you can install missing fonts.

I guess what we are discovering is that we can define the general
format of WebVTT for the Web, but that there may be an additional need
to provide minimum implementation needs (a "profile" if you want - as
much as I hate this word). This seems to apply to the file-wide
metadata fields, to some specific standard classes (underline, blink),
to the set of colors supported and to the set of fonts supported. I
don't think these are issues that browsers need to worry about, and
therefore are probably beyond what we need to specify here for WebVTT.
But there probably needs to be a group to do this eventually.

>> [On a side note, we wonder if it would make sense to introduce an
>> @kind=”annotation” type of TimedText track, which can then allow full
>> innerHTML markup be rendered on top of the video viewport. This would
>> probably need to be matched with full CSS support, too. It would allow
>> people to introduce unconventional caption display such as captions in
>> speech bubbles that can track the characters as they move or know
>> about what important objects are on the screen, so never overlap them.
>> Note that script in innerHTML needs to be dealt with carefully to
>> avoid XSS attacks. @kind=”annotation” is not required for ordinary
>> captions, so we have not investigated this need in full detail.]
>
> Won't Implement ;) For reasons already discussed at length, I think HTML in
> captions is a bad idea. Having *both* WebVTT cue text parsing and innerHTML
> parsing would be even more complicated, though.

It's not a problem. I definitely think we need some experimentation
with @kind="metadata" type applications before we define further @kind
values anyway. I know there are people that want to use it for
annotations and I know there are people that want hyperlinks. So,
let's see how this pans out before doing anything further here. I'd
definitely want to see captions, subtitles and descriptions supported
natively in browsers first - anything else can wait.

>> 5. Markup changes
>>
>> We have a couple of recommendations for changes mostly for aesthetic
>> and efficiency reasons. We would like to point out that Google is very
>> concerned with the dense specification of data and every surplus
>> character, in particular if it is repeated a lot and doesn’t fulfill a
>> need, should be removed to reduce the load created on worldwide
>> networking and storage infrastructures and help render Web pages
>> faster.
>
> Nipick: Is network load really an issue here? Compared to the video files
> they accompany, caption files are tiny, even more so with gzip/deflate.

Even text can amount to a substantial amount of data. Compressed http
delivery will help. Keeping the caption/subtitle tracks in separate
files and only delivering those that a user really wants helps, too.
But even then a caption file for a 2 hour video can be a fairly big
file and we want them downloaded to the browser as quickly as
possible, such that the video player is not held back from playback of
the video through still downloading the captions. So, serving billions
of caption files at as little latency as possible are both good
arguments for keeping the format dense.

>> * Time markers: WebVTT time stamps follow no existing standard for
>> time markers. Has the use of NPT as introduced by RTSP[5] for time
>> markers been considered (in particular npt-hhmmss)?
>>
>> [5] http://www.ietf.org/rfc/rfc2326.txt
>
> Unfortunately, the hour component is not optional in NPT. Also, the decimal
> part of seconds is of arbitrary precision, which doesn't seem necessary.

OK.

>> * Suggest dropping “-->”: In the context of HTML, “-->” is an end
>> comment marker. It may confuse Web developers and parsers if such a
>> sign is used as a separator. For example, some translation tools
>> expect HTML or XML-based interchange formats and interpret the “>” as
>> part of a tag. Also, common caption convention often uses “>” to
>> represent speaker identification. Thus it is more difficult to write a
>> filter which correctly escapes “-->” but retains “>” for speaker ID.
>
> Trying to use an HTML or XML parser to make any sense of WebVTT is going to
> fail horrendously in any case, so if anything I think it's good that they
> fail early. Also, a translation tool that has no concept of WebVTT is going
> to make a mess of various magic strings used in the file format too.
>
>> Since the “-->” characters serve no obvious purpose, it should be
>> possible to safely replace them by a blank that separates start and
>> end time, thus making the format denser and removing annoying parsing
>> issues. (Or alternatively use a the npt-range spec of RTSP for time
>> ranges, which uses “-” as a separator.).
>
> No strong opinion, but I think a non-blank separator is more aesthetically
> pleasing.

Maybe just a dash "-" then, which can also remove the extra blanks?

>> * Duration specification: WebVTT time stamps are always absolute time
>> stamps calculated in relation to the base time of synchronisation with
>> the media resource. While this is simple to deal with for machines, it
>> is much easier for hand-created captions to deal with relative time
>> stamps for cue end times and for the timestamp markers within cues.
>> Cue start times should continue to stay absolute time stamps.
>> Timestamp markers within cues should be relative to the cue start
>> time. Cue end times should be possible to be specified either as
>> absolute or relative timestamps. The relative time stamps could be
>> specified through a prefix of “+” in front of a “ss.mmm” second and
>> millisecond specification. These are not only simpler to read and
>> author, but are also more compact and therefore create smaller files.
>>
>> An example document with relative timestamps is:
>> ==
>> WEBVTT
>> Language=en
>> Kind=Subtitle
>>
>> 00:00:15.000   +2.950
>> At the left we can see...
>>
>> 00:00:18.160    +1.920
>> At the right we can see the...
>>
>> 00:00:20.110   +1.850
>> ...the <+0.400>head-<+0.800>snarlers
>> ==
>
> I rather like it, although it might be confusing if "-" means "to absolute
> time" and "+" means "to relative time". That the intra-cue timings are
> relative but the timing lines are absolute has bugged me a bit, so if the
> distinction was more obvious just from the syntax, that'd be great!

With "-" you are referring to replacing "-->" with "-" to arrive at things like:
15.000-17.950
At the left we can see...

as compared to:
15.000+2.950
At the left we can see...

I actually think they read fairly given that people are used to the
double meaning of "-": to mean both "from ... to" and "minus".
But we could use a different character for "absolute time" if you
prefer, e.g. "/".
15.000/17.950
At the left we can see...

I find this fairly readable, too.

>> 6. Format identifier
>>
>> We are happy to see the introduction of  the magic file identifier for
>> WebVTT which will make it easier to identify the file format. We do
>> not believe the “FILE” part of the string is necessary.
>
> I agree, mostly because it's ugly. While we're bikeshedding, "WebSRT" is
> prettier than "WEBSRT".

"WebVTT" rather than "WebSRT"? ;-)

>> However, we
>> recommend to also introduce a format version number that the file
>> adheres to, e.g. “WEBVTT 0.7”. This helps to make non-browser systems
>> that parse such files become aware of format changes. It can also help
>> identify proprietary standard metadata sets as used by a specific
>> company, such as “WEBVTT 0.7 ABC-meta1” which could signify that the
>> file adheres to WEBVTT 0.7 format specification with the ABC-meta1
>> metadata schema. Parsers are then made aware of what fields to expect
>> and can alert human operators of unexpected fields or markup.
>>
>> Browsers can safely ignore such a marker and instead do a best effort
>> on parsing based on what they understand.
>
> I strongly disagree, WebVTT shouldn't have a version indicator for the same
> reasons that HTML, CSS and JavaScript don't. Making proprietary extensions
> easier to maintain should be an anti-goal.

In a contract between a caption provider and a caption consumer (I am
talking about companies here), the caption consumer will want to tell
the caption provider what kind of features they expect the caption
files to contain and features they want avoided. This links back to
the earlier identified need for "profiles". This is actually probably
something outside the scope of this group, but I am sure there is a
need for such a feature, in particular if we want to keep the
development of the WebVTT specification open for future extensions.

I guess you could argue that such a profile is metadata on the file
and indeed we could use a name-value metadata field like "profile=0.7"
to communicate this. I am not fussed if this is the way it will have
to go. I can understand that browsers will ignore this information
anyway. It's like a promise of the file towards any consuming
application that only features that satisfy that profile (or version)
are used in the file, but it indeed has no bearing on browsers.

>> 7. Comments
>
>> we recommend the introduction of comments.
>
> I agree and think it needs to happen before WebVTT starts to get implemented
> and used on the web. In other words: now.

Agreed. I'm happy for the previously suggested "//" at the line start
to be comments, or, for that matter, "#" or ";" or any other special
character. I would prefer not to use "/*" since it implies a "*/" is
required to end the comment. Similarly we should avoid "" or anything else that requires a special comment end mark and
more than one or two characters.

>> 8. Line wrapping
>>
>> CEA-708 captions support automatic line wrapping in a more
>> sophisticated way than WebVTT -- see
>> http://en.wikipedia.org/wiki/CEA-708#Word_wrap.
>>
>> In our experience with YouTube we have found that in certain
>> situations this type of automatic line wrapping is very useful.
>> Captions that were authored for display in a full-screen video may
>> contain too many words to be displayed fully within the actual video
>> presentation (note that mobile / desktop / internet TV devices may
>> each have a different amount of space available, and embedded videos
>> may be of arbitrary sizes). Furthermore, user-selected fonts or font
>> sizes may be larger than expected, especially for viewers who need
>> larger print.
>>
>> WebVTT as currently specified wraps text at the edge of their
>> containing blocks, regardless of the value of the 'white-space'
>> property, even if doing so requires splitting a word where there is no
>> line breaking opportunity. This will tend to create poor quality
>> captions.  For languages where it makes sense, line wrapping should
>> only be possible at carriage return, space, or hyphen characters, but
>> not on   characters.  (Note that CEA-708 also contains
>> non-breaking space and non-breaking transparent space characters to
>> help control wrapping.)However, this algorithm will not necessarily
>> work for all languages.
>>
>> We therefore suggest that a better solution for line wrapping would be
>> to use the existing line wrapping algorithms of browsers, which are
>> presumably already language-sensitive.
>>
>> [Note: the YouTube line wrapping algorithm goes even further by
>> splitting single caption cues into multiple cues if there is too much
>> text to reasonably fit within the area. YouTube then adjusts the times
>> of these caption cues so they appear sequentially.  Perhaps this could
>> be mentioned as another option for server-side tools.]
>
> Yeah, with SRT people are manually line-wrapping when authoring the captions
> and often enough the end result is that you get something rendered:
>
> - Who could have guessed that not all fonts are the same
> size?
> - That's news to me, so I get four lines of text where I
> wanted two!
>
> I'm inclined to say that we should normalize all whitespace during parsing
> and not have explicit line breaks at all. If people really want two lines,
> they should use two cues. In practice, I don't know how well that would
> fare, though. What other solutions are there?

I don't think I would go that far. The concern has mostly been with
the line wrapping of lines that are too long and the possibility of
splitting words that way. The particular concern was with this
paragraph:

"Text runs must be wrapped at the edge of their containing blocks,
regardless of the value of the 'white-space' property, even if doing
so requires splitting a word where there is no line breaking
opportunity."
see http://www.whatwg.org/specs/web-apps/current-work/multipage/rendering.html#timed-text-tracks-0

So we want to avoid splitting mid-word and we suggest introducing the
ability to have non-breaking spaces.

>> B. Feedback on the <track> element
>>
>>
>> 1. Pop-on/paint-on/roll-up support
>>
>> Three different types of captions are common on TV: pop-on, roll-up
>> and paint-on. Captions according to CEA-608/708 need to support
>> captions of all three of these types. We believe they are already
>> supported in WebVTT, but see a need to re-confirm.
>
> The underlying use case here is live captioning, right? Just copying the
> styling used on broadcast TV seems like it wouldn't be enough, you also need
> the ability to erase typos, right? Are there any existing captioning formats
> that handle live captioning well from which one could draw inspiration?

Yes, CEA-608/609 do these things and we have analysed them for these
features. They have control characters for backspace (only within
row), delete to end of row, erase displayed memory and erase
non-displayed memory. Further there is the concept of a cursor and
there are means to move the cursor to other screen locations.

I don't think we really need the concept of a cursor or display memory
and we don't need backspace and delete etc. because we have the
concept of mutableTimedTrack. So, a live captioning application can
always remove an existing TimedTrackCue and replace it with a new one
where the errors are fixed. At Google we came to the conclusion that
this was sufficient and therefore did not see a need to request
features for this type of application.

However, the three types of captions are actually not just used in
live captioning, but they are three different captioning styles that
could all be created by live or "canned" captions. We think they can
be supported, so this is good news.

>> 2. Duplicate track
>>
>> The HTML spec specifies that it is not allowed to have two tracks that
>> provide the same kind of data for the same language (potentially
>> empty) and for the same label (potentially empty). However, we need
>> clarification on what happens if there is a duplicate track, ie: does
>> the most recent one win or the first one or will both be made
>> available in the UI and JavaScript? The spec only states that the
>> combination of {kind, type, label} must be unique. It doesn't say what
>> happens if they are not.
>
> In <http://whatwg.org/html#sourcing-out-of-band-text-tracks> all track are
> added to the list of text tracks, even duplicates.
>
> In other words, it's just a requirement for validators, not user agents.

OK, so a browser still has to deal with "duplicate tracks" as though
they were not duplicates?

>> Further, the spec says nothing about duplicate labels altogether -
>> what is a browser supposed to do when two tracks have been marked with
>> the same label?
>
> We'd show the same text in the context menu and let the user be confused, I
> guess. It's very easy for authors who care about not confusing their users
> to fix, so I don't think browsers need to be clever here.

OK, fair enough. :-)

>> 4. Addressing individual cues through CSS
>>
>> As far as we understand, you can currently address all cues through
>> ::cue and you can address a cue part through ::cue-part(<voice> ||
>> <part> || <position> || <future-compatibility>). However, if we
>> understand correctly, it doesn’t seem to be possible to address an
>> individual cue through CSS, even though cues have individual
>> identifiers. This is either an oversight or a misunderstanding on our
>> parts. Can you please clarify how it is possible to address an
>> individual cue through CSS?
>
> Since I've been arguing against the id's in WebVTT, I'm curious about the
> use case here. Isn't using a unique class good enough?

This links in with the discussion above on CSS styling and classes.
Rather than define classes of cue settings and reference them from the
cues, this allows them to be applied to individual cues in style
sheets. I thought the whole reason of cue identifiers was to have this
addressing functionality, so this would just close the loop.

For example:

Style sheet of the Web page:
<style>
video track#t1 ::cue(cue10) {
  text-decoration: blink;
}
</style>

The Web page (extract):
<video src="video.webm" controls>
  <track id="t1" label="captions" kind="captions" srclang="en-US"
src="cap1.vtt"/>
</video>

The caption file cap1.vtt:
WEBVTT
Language=en-US
Kind=Captions

cue1
0.000-5.000
blab blah

cue10
40.000-60.000
ALERT: Your basement is flooding - evacuate!

Cue10 is addressed through CSS and turned into a blinking text without
a need to change the markup at all.

>> 5. Ability to move captions out of the way
>>
>> Our experience with automated caption creation and positioning on
>> YouTube indicates that it is almost impossible to always place the
>> captions out of the way of where a user may be interested to look at.
>> We therefore allow users to dynamically move the caption rendering
>> area to a different viewport position to reveal what is underneath. We
>> recommend such drag-and-drop functionality also be made available for
>> TimedTrack captions on the Web, especially when no specific
>> positioning information is provided.
>
> This would indeed be rather nice, but wouldn't it interfere with text
> selection? Detaching the captions into a floating, draggable window via the
> context menu would be a theoretically possible solution, but that's getting
> rather far ahead of ourselves before we have basic captioning support.

On YouTube you can only move them within the video viewport. You
should try it - it's really awesome actually.

When you say "interfere with text selection" are you suggesting that
the text of captions/subtitles should be able to be cut and pasted? I
wonder what copyright holders think about that.

Cheers,
Silvia.