[whatwg] SRT research: separating cues
Simon Pieters
simonp at opera.com
Mon Oct 24 00:26:48 PDT 2011
I wanted to research how common it is to fail to separate cues in SRT, and
for what reason.
SRT parsers usually interpret a timings line as a new cue, while WebVTT
wants two blank lines for a new cue.
I took the 65k SRT files we've got, replaced comma with dot and prepended
"WEBVTT\n\n", then ran them in Opera's <track> impl, looking for '-->' in
cue data.
There were 840 files with --> in cue data. This is 1.3% of the files.
Looking at the cue data, there were 11,118 lines that contained -->. There
were 8830 lines of only whitespace.
In the cue data, if I look at valid-looking timing lines
(/^\d\d:\d\d:\d\d\.\d\d\d\s*-->\s*\d\d:\d\d:\d\d\.\d\d\d(\s|$)/) and check
the line before that, or the line before *that* if it looks like an SRT id
(/^\d+\s*$/), then I see 7030 lines of only whitespace and 3761 lines of
something else.
Failing to separate cues results in an unpleasant experience for the user,
since basically the screen is filled with several "cues" including their
IDs and timing lines.
Some files had most or all of their cues parsed as a single cue with the
WebVTT parser, e.g. because all lines ended with one or more spaces.
Looking at such a file in a text editor, it's not immediately obvious that
there's an error, because the spaces are not visible. Moreover, the file
is not non-conforming, so a validator wouldn't help either.
So what about the cases that aren't whitespace? It seems to be mostly just
missing the newline completely. Some omitted the ID also. One file had a
"|" between all cues.
--
Simon Pieters
Opera Software
More information about the whatwg
mailing list