[whatwg] SRT research: separating cues

Mon Oct 24 00:26:48 PDT 2011

I wanted to research how common it is to fail to separate cues in SRT, and  
for what reason.

SRT parsers usually interpret a timings line as a new cue, while WebVTT  
wants two blank lines for a new cue.

I took the 65k SRT files we've got, replaced comma with dot and prepended  
"WEBVTT\n\n", then ran them in Opera's <track> impl, looking for '-->' in  
cue data.

There were 840 files with --> in cue data. This is 1.3% of the files.

Looking at the cue data, there were 11,118 lines that contained -->. There  
were 8830 lines of only whitespace.

In the cue data, if I look at valid-looking timing lines  
(/^\d\d:\d\d:\d\d\.\d\d\d\s*-->\s*\d\d:\d\d:\d\d\.\d\d\d(\s|$)/) and check  
the line before that, or the line before *that* if it looks like an SRT id  
(/^\d+\s*$/), then I see 7030 lines of only whitespace and 3761 lines of  
something else.

Failing to separate cues results in an unpleasant experience for the user,  
since basically the screen is filled with several "cues" including their  
IDs and timing lines.

Some files had most or all of their cues parsed as a single cue with the  
WebVTT parser, e.g. because all lines ended with one or more spaces.  
Looking at such a file in a text editor, it's not immediately obvious that  
there's an error, because the spaces are not visible. Moreover, the file  
is not non-conforming, so a validator wouldn't help either.

So what about the cases that aren't whitespace? It seems to be mostly just  
missing the newline completely. Some omitted the ID also. One file had a  
"|" between all cues.

-- 
Simon Pieters
Opera Software