[whatwg] Thoughts on video accessibility
ian at hixie.ch
Wed Jul 15 16:38:57 PDT 2009
On Sat, 27 Dec 2008, Silvia Pfeiffer wrote:
> > 6. Timed text stored in a separate file, which is then fetched and
> > parsed by the Web page, and which is then rendered by the Web page.
> For case 6, while it works for deaf people, we actually create an
> accessibility nightmare for blind people and their web developers. There
> is no standard means for a screen reader to identify that a particular
> part in the DOM is actually text related to the video and supposed to be
> "displayed" with the video (through a screenreader or a braille reader).
As far as I can tell, that's exactly what ARIA is for.
> every single site that wanted to provide audio annotations.
> It's also a nightmare for search engines, since there is no clear way of
> identifying a specific text as video-related and use it as such to
> extend knowledge about the video.
Embedding subtitles inside the video file is certainly the best option
overall, for both accessibility and for automated analysis, yes.
> > 1. Timed text in the resource itself (or linked from the resource
> > itself), rendered as part of the video automatically by the user
> > agent.
> For case 1, the practical implications are that browser vendors will
> have to develop support for a large variety of text codecs, each one
> providing different functionalities.
I would hope that as with a video codec, we can standardise on a single
subtitle format, ideally some simple media-independent combination of SRT
and LRC . It's difficult to solve this problem without a standard
> In fact, the easiest solution would be if that particular format was
> really only HTML.
IMHO that would be absurd. HTML means scripting, embedded videos, an
unbelivably complex rendering system, complex parsing, etc; plus, what's
more, it doesn't even support timing yet, so we'd have to add all the
timing and karaoke features on top of it. Requiring that video players
embed a timed HTML renderer just to render subtitles is like saying that
we should ship Microsoft Word with every DVD player, to handle the user
input when the user wants to type in a new chapter number to jump to.
> But strategically can we keep our options open towards using such a
> format in HTML5?
As far as I can tell, HTML5 doesn't preclude any particular direction for
> And now to option 3:
> > 3. Timed text stored in a separate file, which is then parsed by the
> > user agent and rendered as part of the video automatically by the
> > browser.
> > This would make authoring subtitles somewhat easier, but would
> > typically lose the benefits of subtitles surviving when the video file
> > is extracted. It would also involve a distinct increase in
> > implementation and language complexity. We would also have to pick a
> > timed text format, or add yet another format war to the
> > <video>/<audio> codec debacle, which I think would be a really big
> > mistake right now. Given the immature state of timed text formats (it
> > seems there are new formats announced every month), it's probably
> > premature to pick one -- we should let the market pick one first.
> I think excluding option 3 from our list of ways of supporting
> time-aligned text is a big mistake.
We're not excluding it, we're just delaying its standardisation.
> The majority of subtitles currently available on the Web come from
> separate files, in particular in srt or sub format. They are simple
> formats, easily authored in a text editor, and can be related to any
> container format. It is easy to implement support for them in authoring
> applications and in player applications. Encapsulating them into a video
> file and extracting them from a video file again for decoding seems an
> unnecessary nuisance. This is why I think dealing with separate caption
> files will continue to be the main way we deal with captions into the
> future and why we should consider supporting this natively in Web
> browsers rather than leaving it to every web developer to sort this out
I agree that if we can't get people to embed subtitles straight into their
video streams, that providing a standard way to associate a video file
with a subtitle stream is the way to go on the long term.
> The only real issue that we have with separate files is that the
> captions may get lost when people download the video, store it locally,
> and share it with friends.
This is a pretty big problem, IMHO.
> Maybe we should consider solving this differently. Either we could
> encapsulate into the video container upon download. Or we could create a
> zip-file or tarball upon download. I'd just find it a big mistake to
> ignore the majority use case in the standard, which is why I proposed
> the <text> elements inside the <video> tag.
If browser vendors are willing to merge subtitles and video files when
saving them, that would be great. Is this easy to do?
> Here is my example again:
> <video src="http://example.com/video.ogv" controls>
> <text category="CC" lang="en" type="text/x-srt" src="caption.srt"></text>
> <text category="SUB" lang="de" type="application/ttaf+xml" src="german.dfxp"></text>
> <text category="SUB" lang="jp" type="application/smil" src="japanese.smil"></text>
> <text category="SUB" lang="fr" type="text/x-srt" src="translation_webservice/fr/caption.srt"></text>
Here's a counterproposal:
I think this would be fine, on the long term. I don't think the existing
implementations of <video> are at a point yet where it makes sense to
define this yet, though.
It would be interesting to hear back from the browser vendors about how
easily the subtitles could be kept with the video in a way that survives
reuse in other contexts.
-- Footnote --
 Here's a strawman subtitle format based on SRT and LRC:
subtitles := subtitle*
subtitle := id? location line* crlf
id := number crlf
location := timestamp arrow timestamp x1? x2? y1? y2? crlf
number := <decimal format>
timestamp := <HH:MM:SS,FFF or HH:MM:SS.FFF, hours optional>
arrow := space "-->" space
x1 := space "X1:" number
y1 := space "Y1:" number
x2 := space "X2:" number
y2 := space "Y2:" number
line := style? text [ karaoke text ]* crlf
style := "<" [ number | "sound" | "comment" | "credit" ] ">" space
karaoke := "<" timestamp ">"
text := <any Unicode text other than crlf>
crlf := space [ <cr lf> | <cr> | <lf> ]
space := " "*
00:02:26,407 --> 00:02:31,356 X1:100 X2:100 Y1:100 Y2:100
<1>What do you mean, easy?
<2>I don't think this is easy
00:03:00,102 --> 00:03:05,000 X1:100 X2:100 Y1:100 Y2:100
<1>It's very <00:03:02,500> easy
The ID is ignored.
Blocks whose timestamps can't be parsed are skipped.
Blocks whose timestamps can be parsed but that have other errors have the
If x1 is present but not x2, left align on x1.
If x2 is present but not x1, right align on x2.
If both x1 and x2 are present, center between them.
If neither x1 nor x2 are present, center across frame.
If y1 is present but not y2, top align on y1.
If y2 is present but not y1, bottom align on y2.
If both y1 and y2 are present, center between them.
If neither y1 nor y2 are present, center across frame.
The style allows the author to pick either a character (by number), which
will then cause the user agent to pick a colour using a UA-specific
mapping, or a non-character style for translation notes, notes on
background sounds and music, captioning credits, or whatnot. Default style
The timestamps embedded in the text are karaoke time points; if they are
present, the line is to be rendered with a progressive fill given by the
style, from the time before the text to the time after the text (times at
the start and end are implied by the start and end of the block).
This combines the SRT and LRC formats in a way that is mostly backwards
compatible with SRT, easily convertable from LRC, trivial to implement
both for creation and consumption, easily supportable in cheap dedicated
hardware, mostly compatible with over-the-air subtitle formats so videos
taken from the Web and shown on-air can use the native subtitling
mechanism, compatible with braille systems, compatible with VoiceOver
abilities, and doesn't do anything ridiculous like allow videos or scripts
to be embedded inside subtitles. It supports the majority of the use cases
I'm aware of (movies, TV shows, anime, karaoke).
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
More information about the whatwg