[whatwg] Google Feedback on the HTML5 media a11y specifications

Silvia Pfeiffer silviapfeiffer1 at gmail.com
Fri Jan 14 01:01:38 PST 2011

Hi everyone,

Over the last months I have worked with several people at Google who
are amongst else from areas of accessibility, YouTube, Google Video,
Google Chrome and WebM. We have analysed the <track> specification and
the WebVTT (former WebSRT) specification mostly from a captioning
point of view. Here is the summary of all the feedback we have put
together. There are some questions of clarification and some proposals
for improvements. We hope this will spawn some discussion and help
improve the specifications.

Best Regards,
Silvia (with a Google contractor hat on).

Feedback on the HTML5 media a11y specifications

At Google we have reviewed the <track> and caption text file format
proposal to introduce caption support into HTML5 video. We have in
particular compared it to our needs for YouTube and to needs that
originate from making sure that caption files sourced from legacy TV
content can be represented as intended. We focused on the European EBU
STL and the US CEA-608 and 708 caption standards. We particularly care
that it is possible for people to watch their TV shows online with the
exact same experience as they receive it on TV, since this is a
contractual obligation with many content owners. We also care that a
transform from broadcast to WebVTT and back to broadcast can be done
without loss of information.

Overall, the WebVTT format and the <track> element and related
TimedTrack JavaScript API seem to satisfy most of the requirements. A
few major and a couple of minor gaps remain to be filled.

This document summarizes where we have seen gaps and proposes changes.
There are two sections - the first one concerns the WebVTT format and
the second one the <track> specification.

A. Feedback on the WebVTT format

In our opinion, WebVTT (with a few more improvements as proposed
below) *when used in Web browsers* in combination with <track> and CSS
can provide the same level of features as current TV caption formats.
Therefore we are happy for WebVTT to become the baseline caption
format for HTML5. We expect that browsers will implement support for
other formats, too, and are satisfied that this has been made possible
through the <track> element.

We are concerned, however, about the introduction of WebVTT as a
universal captioning format *when used outside browsers*. Since a
subset of CSS features is required to bring HTML5 video captions on
par with TV captions, non-browser applications will need to support
these CSS features, too. However, we do not believe that external CSS
files are an acceptable solution for non-browser captioning and
therefore think that those CSS features (see [1]) should eventually be
made part of the WebVTT specification.

[1] http://www.whatwg.org/specs/web-apps/current-work/multipage/rendering.html#the-'::cue'-pseudo-element

Aside from this general observation, we have the following feedback for WebVTT:

1. Introduce file-wide metadata

WebVTT requires a structure to add header-style metadata. We are here
talking about lists of name-value pairs as typically in use for header
information. The metadata can be optional, but we need a defined means
of adding them.

Required attributes in WebVTT files should be the main language in use
and the kind of data found in the WebVTT file - information that is
currently provided in the <track> element by the @srclang and @kind
attributes. These are necessary to allow the files to be interpreted
correctly by non-browser applications, for transcoding or to determine
if a file was created as a caption file or something else, in
particular the @kind=metadata. @srclang also sets the base
directionality for BiDi calculations.

Further metadata fields that are typically used by authors to keep
specific authoring information or usage hints are necessary, too. As
examples of current use see the format of MPlayer mpsub’s header
metadata [2], EBU STL’s General Subtitle Information block [3], and
even CEA-608’s Extended Data Service with its StartDate, Station,
Program, Category and TVRating information [4]. Rather than specifying
a specific subset of potential fields we recommend to just have the
means to provide name-value pairs and leave it to the negotiation
between the author and the publisher which fields they expect of each

[2] http://www.mplayerhq.hu/DOCS/tech/mpsub.sub
[3] https://docs.google.com/viewer?a=v&q=cache:UKnzJubrIh8J:tech.ebu.ch/docs/tech/tech3264.pdf
[4] http://edocket.access.gpo.gov/cfr_2007/octqtr/pdf/47cfr15.119.pdf

An example document with header metadata could look as follows:
Title=Weird Al Yankovic - The Saga Begin

00:00:15.000 --> 00:00:18.000
A long, long time ago...

2. Introduce file-wide cue settings

At the moment if authors want to change the default display of cues,
they can only set them per cue (with the D:, S:, L:, A: and T:. cue
settings) or have to use an external CSS file through a HTML page with
the ::cue pseudo-element. In particular when considering that all
Asian language files would require a “D:vertical” marker, it becomes
obvious that this replication of information in every cue is
inefficient and a waste of bandwidth, storage, and application speed.
A cue setting default section should be introduced into a file
header/setup area of WebVTT which will avoid such replication.

An example document with cue setting defaults in the header could look
as follows:
CueSettings= A:end D:vertical

00:00:15.000 --> 00:00:17.950

00:00:18.160 --> 00:00:20.080

00:00:20.110 --> 00:00:21.960

Note that you might consider that the solution to this problem is to
use external CSS to specify a change to all cues. However, this is not
acceptable for non-browser applications and therefore not an
acceptable solution to this problem.

3. Cue settings requirements

After much analysis, we believe that the five available cue settings
are largely sufficient for online caption needs. However, we would
like to clarify the following issues:

* positioning: Generally the way in which we need positioning to work
is to provide an anchor position for the text and then explain in
which direction font size changes and the addition of more text allows
the text segment to grow. It seems that the line position cue (L)
provides a baseline position and the alignment cue (A) provides the
growing direction start/middle/end. Can we just confirm this

* fontsize: When changing text size in relation to the video changing
size or resolution, we need to make sure not to reduce the text size
below a specific font size for readability reasons. And we also need
to make sure not to make it larger than a specific font size, since
otherwise it will dominate the display. We usually want the text to be
at least Xpx, but no bigger than Ypx. Also, one needs to pay attention
to the effect that significant player size changes have on relative
positioning - in particular for the minimum caption text size. Dealing
with min and max sizes is missing from the current specification in
our understanding.

* bidi text: In our experience from YouTube, we regularly see captions
that contain mixed languages/directionality, such as Hebrew captions
that have a word of English in it. How do we allow for bidi text
inside cues? How do we change directionality mid-cue? Do we deal with
the zero-width LTR-mark and RTL-mark unicode characters? It would be
good to explain how these issues are dealt with in WebVTT.

* internationalisation: D:vertical and D:vertical-lr seem to only work
for vertical text - how about horizontal-rl? For example, Hebrew is a
prime example of a language being written from right to left
horizontally. Is that supported and how?

* naming: The usage of single letter abbreviations for cue settings
has created quite a discussion here at Google. We all agree that
file-wide cue settings are required and that this will reduce the need
for cue-specific cue settings. We can thus afford a bit more
readability in the cue settings. We therefore believe that it would be
better if the cue settings were short names rather than single letter
codes. This would be more like CSS, too, and easier to learn and get
right. In the interface description, the 5 dimensions have proper
names which could be re-used (“direction”, “linePosition”,
“textPosition”, “size” and “align"). We therefore recommend replacing
the single-letter cue commands with these longer names.

An example document with more verbose cue settings could look as follows:
CueSettings= align:end direction:vertical

00:00:15.000 --> 00:00:17.950 linePosition:80%

00:00:18.160 --> 00:00:20.080

00:00:20.110 --> 00:00:21.960 size:70%

4. Cue formatting requirements

In analysing the available cue formatting functionality, we have found
that some features are missing. Most of these features can be added
through using CSS on cues that have received a <b>, <i>, <c> or <v>
marker. The following features are core to traditional TV and exist in
EBU STL and CEA-608/708 captions. Support of these will be a core
requirement for browsers as well as non-browser applications and it
makes sense to add these to WebVTT rather than relying on external CSS
which cannot be used for non-browser captions:

* textcolor: In particular on European TV it is common to distinguish
between speakers by giving their speech different colors. The
following colors are supported by EBU STL, CEA-608 and CEA-708 and
should be supported in WebVTT without the use of external CSS: black,
red, green, yellow, blue, magenta, cyan, and white. As default we
recommend white on a grey transparent background.

* underline: EBU STL, CEA-608 and CEA-708 support underlining of
characters. The underline character is also particularly important for
some Asian languages. Please make it possible to provide text
underlines without the use of CSS in WebVTT.

* blink: As much as we would like to discourage blinking subtitles,
they are actually a core requirement for EBU STL and CEA-608/708
captions and in use in particular for emergency messages and similar
highly important information. Blinking can be considered optional for
implementation, but we should allow for it in the standard.

* font face: CEA-708 provides a choice of eight font tags: undefined,
monospaced serif, proportional serif, monospaced sans serif,
proportional sans serif, casual, cursive, small capital. These fonts
should be available for WebVTT as well. Is this the case?

We are not sure about the best solution to these needs. Would it be
best to introduce specific tags for these needs? Or would it be best
to introduce a subset of CSS commands into WebVTT through something
like a @style attribute on <c>? We do not recommend support of the
full HTML/CSS markup for captions, since we believe captions do not
typically need this flexibility and conversion to other formats
becomes more difficult. But we require a solution that is an inherent
part of the WebVTT specification and does not rely on requiring
<track> and CSS because we want to use these features in non-browser
applications, too.

[On a side note, we wonder if it would make sense to introduce an
@kind=”annotation” type of TimedText track, which can then allow full
innerHTML markup be rendered on top of the video viewport. This would
probably need to be matched with full CSS support, too. It would allow
people to introduce unconventional caption display such as captions in
speech bubbles that can track the characters as they move or know
about what important objects are on the screen, so never overlap them.
Note that script in innerHTML needs to be dealt with carefully to
avoid XSS attacks. @kind=”annotation” is not required for ordinary
captions, so we have not investigated this need in full detail.]

5. Markup changes

We have a couple of recommendations for changes mostly for aesthetic
and efficiency reasons. We would like to point out that Google is very
concerned with the dense specification of data and every surplus
character, in particular if it is repeated a lot and doesn’t fulfill a
need, should be removed to reduce the load created on worldwide
networking and storage infrastructures and help render Web pages

* Time markers: WebVTT time stamps follow no existing standard for
time markers. Has the use of NPT as introduced by RTSP[5] for time
markers been considered (in particular npt-hhmmss)?

[5] http://www.ietf.org/rfc/rfc2326.txt

* Suggest dropping “-->”: In the context of HTML, “-->” is an end
comment marker. It may confuse Web developers and parsers if such a
sign is used as a separator. For example, some translation tools
expect HTML or XML-based interchange formats and interpret the “>” as
part of a tag. Also, common caption convention often uses “>” to
represent speaker identification. Thus it is more difficult to write a
filter which correctly escapes “-->” but retains “>” for speaker ID.

Since the “-->” characters serve no obvious purpose, it should be
possible to safely replace them by a blank that separates start and
end time, thus making the format denser and removing annoying parsing
issues. (Or alternatively use a the npt-range spec of RTSP for time
ranges, which uses “-” as a separator.).

An example document without the use of “-->” is:

00:00:15.000   00:00:17.950
At the left we can see...

00:00:18.160    00:00:20.080
At the right we can see the...

00:00:20.110    00:00:21.960
...the head-snarlers

* Duration specification: WebVTT time stamps are always absolute time
stamps calculated in relation to the base time of synchronisation with
the media resource. While this is simple to deal with for machines, it
is much easier for hand-created captions to deal with relative time
stamps for cue end times and for the timestamp markers within cues.
Cue start times should continue to stay absolute time stamps.
Timestamp markers within cues should be relative to the cue start
time. Cue end times should be possible to be specified either as
absolute or relative timestamps. The relative time stamps could be
specified through a prefix of “+” in front of a “ss.mmm” second and
millisecond specification. These are not only simpler to read and
author, but are also more compact and therefore create smaller files.

An example document with relative timestamps is:

00:00:15.000   +2.950
At the left we can see...

00:00:18.160    +1.920
At the right we can see the...

00:00:20.110   +1.850
...the <+0.400>head-<+0.800>snarlers

6. Format identifier

We are happy to see the introduction of  the magic file identifier for
WebVTT which will make it easier to identify the file format. We do
not believe the “FILE” part of the string is necessary. However, we
recommend to also introduce a format version number that the file
adheres to, e.g. “WEBVTT 0.7”. This helps to make non-browser systems
that parse such files become aware of format changes. It can also help
identify proprietary standard metadata sets as used by a specific
company, such as “WEBVTT 0.7 ABC-meta1” which could signify that the
file adheres to WEBVTT 0.7 format specification with the ABC-meta1
metadata schema. Parsers are then made aware of what fields to expect
and can alert human operators of unexpected fields or markup.

Browsers can safely ignore such a marker and instead do a best effort
on parsing based on what they understand.


We expect WebVTT to be authored manually as well as by automated
tools. File formats that are authored manually typically allow the use
of comments to allow authors to leave messages to themselves or others
during the process of authoring or for managing the files.

Further, many commercial files we’ve seen created by automated tools
also have various comments identifying which movie they are for, or
who created the captions, etc.  Comments allow them to inject this
sort of information without requiring the specification to think of
every possible metadata attribute type ahead of time.

Finally we note that even the binary STL format of EBU supports
comment indicators, where a complete cue can be turned into a comment.
This is another means of introducing comment markup and allows one to
easily turn a complete cue on/off.

Given all these use cases, we recommend the introduction of comments.
Comments may e.g. take on the form of Web page comments as in: <!--
comment --> or may take on another appropriate form.

An example document with the use of comments is:
<!-- maybe just “en” could be more appropriate? -->

00:00:15.000   00:00:17.950
At the left we can see...

00:00:18.160    00:00:20.080
At the right we can see the...

00:00:20.110    00:00:21.960
...the head-snarlers
<!-- whatever that word means... -->

8. Line wrapping

CEA-708 captions support automatic line wrapping in a more
sophisticated way than WebVTT -- see

In our experience with YouTube we have found that in certain
situations this type of automatic line wrapping is very useful.
Captions that were authored for display in a full-screen video may
contain too many words to be displayed fully within the actual video
presentation (note that mobile / desktop / internet TV devices may
each have a different amount of space available, and embedded videos
may be of arbitrary sizes). Furthermore, user-selected fonts or font
sizes may be larger than expected, especially for viewers who need
larger print.

WebVTT as currently specified wraps text at the edge of their
containing blocks, regardless of the value of the 'white-space'
property, even if doing so requires splitting a word where there is no
line breaking opportunity. This will tend to create poor quality
captions.  For languages where it makes sense, line wrapping should
only be possible at carriage return, space, or hyphen characters, but
not on   characters.  (Note that CEA-708 also contains
non-breaking space and non-breaking transparent space characters to
help control wrapping.)However, this algorithm will not necessarily
work for all languages.

We therefore suggest that a better solution for line wrapping would be
to use the existing line wrapping algorithms of browsers, which are
presumably already language-sensitive.

[Note: the YouTube line wrapping algorithm goes even further by
splitting single caption cues into multiple cues if there is too much
text to reasonably fit within the area. YouTube then adjusts the times
of these caption cues so they appear sequentially.  Perhaps this could
be mentioned as another option for server-side tools.]

B. Feedback on the <track> element

1. Pop-on/paint-on/roll-up support

Three different types of captions are common on TV: pop-on, roll-up
and paint-on. Captions according to CEA-608/708 need to support
captions of all three of these types. We believe they are already
supported in WebVTT, but see a need to re-confirm.

For pop-on captions, a complete caption cue is timed to appear at a
certain time and disappear a few seconds later. This is the typical
way in which captions are presented and also how WebVTT/<track> works
in our understanding. Is this correct?

For roll-up captions, individual lines of captions are presented
successively with older lines moving up a line to make space for new
lines underneath. Assuming we understand the WebVTT rendering rules
correctly, WebVTT would identify each of these lines as an individual,
but time-overlapping cue with the other cues. As more cues are created
and overlap in time, newer cues are added below the currently visible
ones and move the currently visible ones up, basically creating a
roll-up effect. If this is a correct understanding, then this is an
acceptable means of supporting roll-up captions.

Finally, for paint-on captions, individual letters or words are
displayed successively on screen. WebVTT supports this functionality
with the cue timestamps <xx:xx:xx.xxx>, which allows to specify
characters or words to appear with a delay from within a cue. This
essentially realizes paint-on captions. Is this correct? (Note that we
suggest using relative timestamps inside cues to make this feature
more usable.)

2. Duplicate track

The HTML spec specifies that it is not allowed to have two tracks that
provide the same kind of data for the same language (potentially
empty) and for the same label (potentially empty). However, we need
clarification on what happens if there is a duplicate track, ie: does
the most recent one win or the first one or will both be made
available in the UI and JavaScript? The spec only states that the
combination of {kind, type, label} must be unique. It doesn't say what
happens if they are not.
Further, the spec says nothing about duplicate labels altogether -
what is a browser supposed to do when two tracks have been marked with
the same label?

3. Default track choice
It is very important that there is a possibility for users to
auto-activate tracks. Which track is chosen as the default track to
activate depends on the language preferences of the user. The user is
assumed to have a list of language preferences which leads this

In YouTube, if any tracks exist that match the first language
preference, the first of those is used as the default.  A track with
no name sorts ahead of one with a name.  The sorting is done according
to that language's collation order. In order to override this you
would need (1) a default=true attribute for a track which gives it
precedence if its language matches, and (2) a way to force the
language preference. If no tracks exist for the first language pref,
the second language pref is checked, and so on.

If the user's language preferences are known, and there are no tracks
in that language, you have other options:
  (1) offer to do auto-translation (or just do it)
  (2) use a track in the same language that the video's audio is in (if known)
  (3) if only one track, use the first available track

Also make sure the language choice can be overriden by the user
through interaction.

We’d like to make sure this or a similar algorithm is the recommended
way in which browsers deal with caption tracks.

4. Addressing individual cues through CSS

As far as we understand, you can currently address all cues through
::cue and you can address a cue part through ::cue-part(<voice> ||
<part> || <position> || <future-compatibility>). However, if we
understand correctly, it doesn’t seem to be possible to address an
individual cue through CSS, even though cues have individual
identifiers. This is either an oversight or a misunderstanding on our
parts. Can you please clarify how it is possible to address an
individual cue through CSS?

5. Ability to move captions out of the way

Our experience with automated caption creation and positioning on
YouTube indicates that it is almost impossible to always place the
captions out of the way of where a user may be interested to look at.
We therefore allow users to dynamically move the caption rendering
area to a different viewport position to reveal what is underneath. We
recommend such drag-and-drop functionality also be made available for
TimedTrack captions on the Web, especially when no specific
positioning information is provided.

More information about the whatwg mailing list