[whatwg] Thoughts on video accessibility

Wed Jul 15 16:38:57 PDT 2009

On Sat, 27 Dec 2008, Silvia Pfeiffer wrote:
> > 
> > 6. Timed text stored in a separate file, which is then fetched and 
> > parsed by the Web page, and which is then rendered by the Web page.
>
> For case 6, while it works for deaf people, we actually create an 
> accessibility nightmare for blind people and their web developers. There 
> is no standard means for a screen reader to identify that a particular 
> part in the DOM is actually text related to the video and supposed to be 
> "displayed" with the video (through a screenreader or a braille reader). 

As far as I can tell, that's exactly what ARIA is for.

> Such functionality would need to be implemented through javascript by 
> every single site that wanted to provide audio annotations.

Right.

> It's also a nightmare for search engines, since there is no clear way of 
> identifying a specific text as video-related and use it as such to 
> extend knowledge about the video.

Embedding subtitles inside the video file is certainly the best option 
overall, for both accessibility and for automated analysis, yes.

> > 1. Timed text in the resource itself (or linked from the resource 
> > itself), rendered as part of the video automatically by the user 
> > agent.
>
> For case 1, the practical implications are that browser vendors will 
> have to develop support for a large variety of text codecs, each one 
> providing different functionalities.

I would hope that as with a video codec, we can standardise on a single 
subtitle format, ideally some simple media-independent combination of SRT 
and LRC [1]. It's difficult to solve this problem without a standard 
codec, though.

> In fact, the easiest solution would be if that particular format was 
> really only HTML.

IMHO that would be absurd. HTML means scripting, embedded videos, an 
unbelivably complex rendering system, complex parsing, etc; plus, what's 
more, it doesn't even support timing yet, so we'd have to add all the 
timing and karaoke features on top of it. Requiring that video players 
embed a timed HTML renderer just to render subtitles is like saying that 
we should ship Microsoft Word with every DVD player, to handle the user 
input when the user wants to type in a new chapter number to jump to.

> But strategically can we keep our options open towards using such a 
> format in HTML5?

As far as I can tell, HTML5 doesn't preclude any particular direction for 
subtitling.

> And now to option 3:
> 
> > 3. Timed text stored in a separate file, which is then parsed by the 
> > user agent and rendered as part of the video automatically by the 
> > browser.
> >
> > This would make authoring subtitles somewhat easier, but would 
> > typically lose the benefits of subtitles surviving when the video file 
> > is extracted. It would also involve a distinct increase in 
> > implementation and language complexity. We would also have to pick a 
> > timed text format, or add yet another format war to the 
> > <video>/<audio> codec debacle, which I think would be a really big 
> > mistake right now. Given the immature state of timed text formats (it 
> > seems there are new formats announced every month), it's probably 
> > premature to pick one -- we should let the market pick one first.
> 
> I think excluding option 3 from our list of ways of supporting
> time-aligned text is a big mistake.

We're not excluding it, we're just delaying its standardisation.

> The majority of subtitles currently available on the Web come from 
> separate files, in particular in srt or sub format. They are simple 
> formats, easily authored in a text editor, and can be related to any 
> container format. It is easy to implement support for them in authoring 
> applications and in player applications. Encapsulating them into a video 
> file and extracting them from a video file again for decoding seems an 
> unnecessary nuisance. This is why I think dealing with separate caption 
> files will continue to be the main way we deal with captions into the 
> future and why we should consider supporting this natively in Web 
> browsers rather than leaving it to every web developer to sort this out 
> himself.

I agree that if we can't get people to embed subtitles straight into their 
video streams, that providing a standard way to associate a video file 
with a subtitle stream is the way to go on the long term.

> The only real issue that we have with separate files is that the 
> captions may get lost when people download the video, store it locally, 
> and share it with friends.

This is a pretty big problem, IMHO.

> Maybe we should consider solving this differently. Either we could 
> encapsulate into the video container upon download. Or we could create a 
> zip-file or tarball upon download. I'd just find it a big mistake to 
> ignore the majority use case in the standard, which is why I proposed 
> the <text> elements inside the <video> tag.

If browser vendors are willing to merge subtitles and video files when 
saving them, that would be great. Is this easy to do?

> Here is my example again:
> <video src="http://example.com/video.ogv" controls>
>  <text category="CC" lang="en" type="text/x-srt" src="caption.srt"></text>
>  <text category="SUB" lang="de" type="application/ttaf+xml" src="german.dfxp"></text>
>  <text category="SUB" lang="jp" type="application/smil" src="japanese.smil"></text>
>  <text category="SUB" lang="fr" type="text/x-srt" src="translation_webservice/fr/caption.srt"></text>
> </video>

Here's a counterproposal:

   <video src="http://example.com/video.ogv"
          subtitles="http://example.com/caption.srt" controls>
   </video>

I think this would be fine, on the long term. I don't think the existing 
implementations of <video> are at a point yet where it makes sense to 
define this yet, though.

It would be interesting to hear back from the browser vendors about how 
easily the subtitles could be kept with the video in a way that survives 
reuse in other contexts.

-- Footnote --

[1] Here's a strawman subtitle format based on SRT and LRC:

Grammar:

   subtitles := subtitle*
   subtitle  := id? location line* crlf
   id        := number crlf
   location  := timestamp arrow timestamp x1? x2? y1? y2? crlf
   number    := <decimal format>
   timestamp := <HH:MM:SS,FFF or HH:MM:SS.FFF, hours optional>
   arrow     := space "-->" space
   x1        := space "X1:" number
   y1        := space "Y1:" number
   x2        := space "X2:" number
   y2        := space "Y2:" number
   line      := style? text [ karaoke text ]* crlf
   style     := "<" [ number | "sound" | "comment" | "credit" ] ">" space
   karaoke   := "<" timestamp ">"
   text      := <any Unicode text other than crlf>
   crlf      := space [ <cr lf> | <cr> | <lf> ]
   space     := " "*

Looks like:

   1
   00:02:26,407 --> 00:02:31,356 X1:100 X2:100 Y1:100 Y2:100
   <1>What do you mean, easy?
   <2>I don't think this is easy

   2
   00:03:00,102 --> 00:03:05,000 X1:100 X2:100 Y1:100 Y2:100
   <1>It's very <00:03:02,500> easy

The ID is ignored.

Blocks whose timestamps can't be parsed are skipped.

Blocks whose timestamps can be parsed but that have other errors have the 
errors ignored.

If x1 is present but not x2, left align on x1.
If x2 is present but not x1, right align on x2.
If both x1 and x2 are present, center between them.
If neither x1 nor x2 are present, center across frame.

If y1 is present but not y2, top align on y1.
If y2 is present but not y1, bottom align on y2.
If both y1 and y2 are present, center between them.
If neither y1 nor y2 are present, center across frame.

The style allows the author to pick either a character (by number), which 
will then cause the user agent to pick a colour using a UA-specific 
mapping, or a non-character style for translation notes, notes on 
background sounds and music, captioning credits, or whatnot. Default style 
is <1>.

The timestamps embedded in the text are karaoke time points; if they are 
present, the line is to be rendered with a progressive fill given by the 
style, from the time before the text to the time after the text (times at 
the start and end are implied by the start and end of the block).

This combines the SRT and LRC formats in a way that is mostly backwards 
compatible with SRT, easily convertable from LRC, trivial to implement 
both for creation and consumption, easily supportable in cheap dedicated 
hardware, mostly compatible with over-the-air subtitle formats so videos 
taken from the Web and shown on-air can use the native subtitling 
mechanism, compatible with braille systems, compatible with VoiceOver 
abilities, and doesn't do anything ridiculous like allow videos or scripts 
to be embedded inside subtitles. It supports the majority of the use cases 
I'm aware of (movies, TV shows, anime, karaoke).

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'