[whatwg] Thoughts on video accessibility

Sat Dec 27 01:16:02 PST 2008

I have carefully read all the feedback in this thread concerning 
associating text with video, for various purposes such as captions, 
annotations, etc.

Taking a step back as far as I can tell there are two axes: where the 
timed text comes from, and how it is rendered.

Where it comes from, it seems, boils down to three options:
 - embedded in or referenced from the media resource itself
 - as a separate file parsed by the user agent
 - as a separate file parsed by the web page

Where the timed text is rendered boils down to two options:
 - rendered automatically by the user agent
 - rendered by the web page overlaying content on the video

For the purposes of this discussion I am ignoring burned-in captions, 
since they're basically equivalent to a different video, much like videos 
with overlayed sign language interpreters (or VH1 pop-up's annotations!).

These 5 options give us 6 cases:

1. Timed text in the resource itself (or linked from the resource itself), 
   rendered as part of the video automatically by the user agent.

This is the optimal situation from an accessibility and usability point of 
view, because it works when the video is shown full-screen, it works when 
the video is saved separate from the Web page, it works easily when other 
pages link to the same video file, it requires minimal work from the page 
author, and so forth.

This is what I think we should be encouraging.

It would probably make sense to expose the timed text track selection to 
the Web page through the API, maybe even expose the text itself somehow, 
but these are features that can and should probably wait until <video> has 
been more reliably implemented.

2. Timed text in the resource itself (or linked from the resource itself), 
   exposed to the Web page with no native rendering.

This allows pages to implement experimental subtitling mechanisms while 
still allowing the timed text tracks to survive re-use of the video file, 
but it seems to introduce a high cost (all pages have to implement 
subtitling themselves) with very little gain, and with several 
disadvantages -- different sites will have inconsistent subtitling, bugs 
will be prevalent in the subtitling and accessibility will thus suffer, 
and in all likelihood even videos that have subtitles will end up not 
having them shown as small sites sites don't bother to implement anything 
but the most basic controls.

3. Timed text stored in a separate file, which is then parsed by the user 
   agent and rendered as part of the video automatically by the browser.

This would make authoring subtitles somewhat easier, but would typically 
lose the benefits of subtitles surviving when the video file is extracted. 
It would also involve a distinct increase in implementation and language 
complexity. We would also have to pick a timed text format, or add yet 
another format war to the <video>/<audio> codec debacle, which I think 
would be a really big mistake right now. Given the immature state of timed 
text formats (it seems there are new formats announced every month), it's 
probably premature to pick one -- we should let the market pick one first.

4. Timed text stored in a separate file, which is then parsed by the user
   agent and exposed to the Web page with no native rendering.

This combines the disadvantages of the previous two options, without 
really introducing any groundbreaking advantages.

5. Timed text stored in a separate file, which is then fetched and parsed 
   by the Web page, which then passes it to the browser for rendering.

This is just an excessive level of complexity for a feature that could 
just be supported exclusively by the user agent. In particular, it doesn't 
actually provide for much space for experimentation -- whatever API we 
provide to expose the subtitles would limit what the rendering would be 
like regardless of what the pages want to try.

This option side-steps the issue of picking a format, though.

6. Timed text stored in a separate file, which is then fetched and parsed 
   by the Web page, and which is then rendered by the Web page.

We can't stop this from being available, and there's not much we can do to 
help with this case beyond what we do now. The disadvantages are that it 
doesn't work when the video is shown full-screen, when the video is saved 
separate from the Web page, when other pages link to the same video file 
without using their own implementation of the feature, and it requires 
substantial implementation work from the page. The _advantages_, and they 
are significant, are that pages can easily create subtitles separate from 
the video, they can easily provide features such as automated 
translations, and they can easily implement features that would otherwise 
seem overly ambitious, e.g. hyperlinked annotations with ad tracking.

Based on this analysis it seems to me that cases 1 and 6 are important to 
support, but that cases 2 to 5 aren't as compelling -- they either have 
disadvantages that aren't outweighed by their advantages, or they are just 
being less powerful than other options.

Cases 1 and 6 right now don't require changes to the spec. I think we 
should eventually provide the APIs mentioned above under case 1 since they 
would help bridge the gap between the two types of timed text solutions, 
but as noted above I think we should wait until implementations are more 
mature before extending the API further.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'