[whatwg] Thoughts on video accessibility

Sat Dec 27 04:14:40 PST 2008

Hi Ian,

Thanks for taking the time to go through all the options, analyse and
understand them - especially on your birthday! :-) Much appreciated!

I agree with your analysis and the 6 options you have identified.

However, I disagree slightly with the conclusions you have come to -
mostly from a strategic viewpoint rather than from where we currently
stand.

Your proposal is to support cases 1 and 6 and not to worry about the
others at this stage. This is a fair enough statement for the current
state of play.

Support for case 1 comes from the fact that there are indeed a number
of video container formats that have text codecs (e.g. QTtext for
Quicktime, TimedText for MPEG, CMML and Kate for Ogg).

Support for case 6 comes from the fact that it is already possible, it
is flexible, and it is therefore an easy way out of the need of
providing video accessibility support into Web pages. This is in fact
how this example http://v2v.cc/~j/jquery.srt/ is implemented.

As I said - for the current state of play, you have come to the right
conclusions. Theoretically.

But we should look at the practical implications.

For case 6, while it works for deaf people, we actually create an
accessibility nightmare for blind people and their web developers.
There is no standard means for a screen reader to identify that a
particular part in the DOM is actually text related to the video and
supposed to be "displayed" with the video (through a screenreader or a
braille reader). Such functionality would need to be implemented
through javascript by every single site that wanted to provide audio
annotations.

It's also a nightmare for search engines, since there is no clear way
of identifying a specific text as video-related and use it as such to
extend knowledge about the video.

As much as case 6 is the easy way out, I would like us to discourage
such solutions right before they start by providing a viable
alternative: a standard way of relating time-aligned text with video
(or audio). And that unfortunately means attacking case 3 (let me
address case 3 and your objections below).

For case 1, the practical implications are that browser vendors will
have to develop support for a large variety of text codecs, each one
providing different functionalities. It would indeed be nice if we had
one standard format that everybody used, but alas that is not the
case. What will browser vendors do in this situation? Probably just
simply nothing - maybe use the underlying media frameworks that are
being used to decode the video formats to also decode the text formats
and render them on top of the video - thus taking them completely out
of reach of the Web page. This again means that screenreaders cannot
get to them, search engines will need to find a different way of
extracting them form the video rather than the web page and generally
a worse accessibility experience.

Now, is it realistic to expect a standard format to emerge? I think
this is actually a chicken and egg problem. We currently have poor
solutions (e.g. srt as extra files, or the above mentioned text codecs
inside specific containers). Lacking an alternative, people will
continue to use these to author captions - and use their own hacked-up
formats to provide other formats such as video annotations in speech
bubbles at certain time points and coordinates etc. If there was
however a compelling case to use a different standard format, people
would go for it, IMHO.  If e.g. all browser vendors had agreed to
support one particular format. In fact, the easiest solution would be
if that particular format was really only HTML. Then, browser vendors
would find it trivial to implement, which in turn would encourage Web
developers to choose this format. Which in turn would encourage video
container formats to adopt it also inside itself. And then we have
created a uniform means of dealing with time-aligned text coming from
any of the three locations listed by you and going to the Web page.

As we haven't got any experience with this proposal yet, we can
obviously not support it. But strategically can we keep our options
open towards using such a format in HTML5?

And now to option 3:

> 3. Timed text stored in a separate file, which is then parsed by the user
>   agent and rendered as part of the video automatically by the browser.
>
> This would make authoring subtitles somewhat easier, but would typically
> lose the benefits of subtitles surviving when the video file is extracted.
> It would also involve a distinct increase in implementation and language
> complexity. We would also have to pick a timed text format, or add yet
> another format war to the <video>/<audio> codec debacle, which I think
> would be a really big mistake right now. Given the immature state of timed
> text formats (it seems there are new formats announced every month), it's
> probably premature to pick one -- we should let the market pick one first.

I think excluding option 3 from our list of ways of supporting
time-aligned text is a big mistake. The majority of subtitles
currently available on the Web come from separate files, in particular
in srt or sub format. They are simple formats, easily authored in a
text editor, and can be related to any container format. It is easy to
implement support for them in authoring applications and in player
applications. Encapsulating them into a video file and extracting them
from a video file again for decoding seems an unnecessary nuisance.
This is why I think dealing with separate caption files will continue
to be the main way we deal with captions into the future and why we
should consider supporting this natively in Web browsers rather than
leaving it to every web developer to sort this out himself.

The only real issue that we have with separate files is that the
captions may get lost when people download the video, store it
locally, and share it with friends. Maybe we should consider solving
this differently. Either we could encapsulate into the video container
upon download. Or we could create a zip-file or tarball upon download.
I'd just find it a big mistake to ignore the majority use case in the
standard, which is why I proposed the <text> elements inside the
<video> tag.

Here is my example again:
<video src="http://example.com/video.ogv" controls>
 <text category="CC" lang="en" type="text/x-srt" src="caption.srt"></text>
 <text category="SUB" lang="de" type="application/ttaf+xml"
src="german.dfxp"></text>
 <text category="SUB" lang="jp" type="application/smil"
src="japanese.smil"></text>
 <text category="SUB" lang="fr" type="text/x-srt"
src="translation_webservice/fr/caption.srt"></text>
</video>

These <text> elements could be dealt with as normal HTML, where the
content therein is rendered into a specific area. While we haven't got
a standard format for time-aligned text yet, we will also have to deal
with different formats, which is why I suggested the "type" attribute.
However, that is probably not necessary, because the file extention
indicates the format and the HTTP reply will provide the correct mime
type. "category" and "lang" would be interesting from an API POV for
option 1 as well as here.

I'm thinking we may be on the right track with this proposal for
attacking option 3. And it would be nice to get feedback on this.

So, let me summarise.

* I agree with Option 2, 4 and 5 being undesirable.
* Option 1 solves video accessibility by pushing it back at the
container formats and taking it out of the Web page. If we indeed
develop APIs for it, they should be consistent with what we get out of
separate files.
* Option 6 should not be desirable, since it creates multiple
incompatible accessibility solutions.
* Option 3 needs to be recognized as being the main use case and we
need to recognize that it is in need of getting solved.
* I think we need a standard format for time-aligned text and I think
it should be HTML. We're in the process of specifying this in more
detail and I hope to be able to report soon.

Cheers,
Silvia.

On Sat, Dec 27, 2008 at 8:16 PM, Ian Hickson <ian at hixie.ch> wrote:
>
> I have carefully read all the feedback in this thread concerning
> associating text with video, for various purposes such as captions,
> annotations, etc.
>
> Taking a step back as far as I can tell there are two axes: where the
> timed text comes from, and how it is rendered.
>
> Where it comes from, it seems, boils down to three options:
>  - embedded in or referenced from the media resource itself
>  - as a separate file parsed by the user agent
>  - as a separate file parsed by the web page
>
> Where the timed text is rendered boils down to two options:
>  - rendered automatically by the user agent
>  - rendered by the web page overlaying content on the video
>
> For the purposes of this discussion I am ignoring burned-in captions,
> since they're basically equivalent to a different video, much like videos
> with overlayed sign language interpreters (or VH1 pop-up's annotations!).
>
>
> These 5 options give us 6 cases:
>
> 1. Timed text in the resource itself (or linked from the resource itself),
>   rendered as part of the video automatically by the user agent.
>
> This is the optimal situation from an accessibility and usability point of
> view, because it works when the video is shown full-screen, it works when
> the video is saved separate from the Web page, it works easily when other
> pages link to the same video file, it requires minimal work from the page
> author, and so forth.
>
> This is what I think we should be encouraging.
>
> It would probably make sense to expose the timed text track selection to
> the Web page through the API, maybe even expose the text itself somehow,
> but these are features that can and should probably wait until <video> has
> been more reliably implemented.
>
>
> 2. Timed text in the resource itself (or linked from the resource itself),
>   exposed to the Web page with no native rendering.
>
> This allows pages to implement experimental subtitling mechanisms while
> still allowing the timed text tracks to survive re-use of the video file,
> but it seems to introduce a high cost (all pages have to implement
> subtitling themselves) with very little gain, and with several
> disadvantages -- different sites will have inconsistent subtitling, bugs
> will be prevalent in the subtitling and accessibility will thus suffer,
> and in all likelihood even videos that have subtitles will end up not
> having them shown as small sites sites don't bother to implement anything
> but the most basic controls.
>
>
> 3. Timed text stored in a separate file, which is then parsed by the user
>   agent and rendered as part of the video automatically by the browser.
>
> This would make authoring subtitles somewhat easier, but would typically
> lose the benefits of subtitles surviving when the video file is extracted.
> It would also involve a distinct increase in implementation and language
> complexity. We would also have to pick a timed text format, or add yet
> another format war to the <video>/<audio> codec debacle, which I think
> would be a really big mistake right now. Given the immature state of timed
> text formats (it seems there are new formats announced every month), it's
> probably premature to pick one -- we should let the market pick one first.
>
>
> 4. Timed text stored in a separate file, which is then parsed by the user
>   agent and exposed to the Web page with no native rendering.
>
> This combines the disadvantages of the previous two options, without
> really introducing any groundbreaking advantages.
>
>
> 5. Timed text stored in a separate file, which is then fetched and parsed
>   by the Web page, which then passes it to the browser for rendering.
>
> This is just an excessive level of complexity for a feature that could
> just be supported exclusively by the user agent. In particular, it doesn't
> actually provide for much space for experimentation -- whatever API we
> provide to expose the subtitles would limit what the rendering would be
> like regardless of what the pages want to try.
>
> This option side-steps the issue of picking a format, though.
>
>
> 6. Timed text stored in a separate file, which is then fetched and parsed
>   by the Web page, and which is then rendered by the Web page.
>
> We can't stop this from being available, and there's not much we can do to
> help with this case beyond what we do now. The disadvantages are that it
> doesn't work when the video is shown full-screen, when the video is saved
> separate from the Web page, when other pages link to the same video file
> without using their own implementation of the feature, and it requires
> substantial implementation work from the page. The _advantages_, and they
> are significant, are that pages can easily create subtitles separate from
> the video, they can easily provide features such as automated
> translations, and they can easily implement features that would otherwise
> seem overly ambitious, e.g. hyperlinked annotations with ad tracking.
>
>
> Based on this analysis it seems to me that cases 1 and 6 are important to
> support, but that cases 2 to 5 aren't as compelling -- they either have
> disadvantages that aren't outweighed by their advantages, or they are just
> being less powerful than other options.
>
> Cases 1 and 6 right now don't require changes to the spec. I think we
> should eventually provide the APIs mentioned above under case 1 since they
> would help bridge the gap between the two types of timed text solutions,
> but as noted above I think we should wait until implementations are more
> mature before extending the API further.
>
> --
> Ian Hickson               U+1047E                )\._.,--....,'``.    fL
> http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
> Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
>