[whatwg] SRT research: timestamps
Simon Pieters
simonp at opera.com
Wed Oct 5 10:22:51 PDT 2011
I did some research on authoring errors in SRT timestamps to inform
whether WebVTT parsing of timestamps should be changed.
Our starting point was 70,000 files provided to Opera (for research
purposes) by opensubtitles.org (thanks!) supposedly being SRT files. We
are not allowed to share the files.
Filtering out files that don't contain "-->" leaved 65,000 files.
Grepping for lines that contain "-->" resulted in 52,000,000 lines (which
should represent roughly the total number of cues). Of those, there were
31,900 lines that are invalid, i.e. don't match the python regexp
'\s*\d\d:[0-5]\d:[0-5]\d\,\d\d\d\s*-->\s*\d\d:[0-5]\d:[0-5]\d\,\d\d\d($|\s)'.
Those are categorized as follows. Note that a line can belong to several
categories (except for "none of the above"):
hours too few '(^|\s|>)\d[:\.,]\d+[:\.,]\d+'
57
hours too many '(^|\s|>)\d{3,}[:\.,]\d+[:\.,]\d+'
834
minutes too few '(^|\s|>)\d+[:\.,]\d[:\.,]\d+'
16
minutes too many '(^|\s|>)\d+[:\.,]\d{3,}[:\.,]\d+'
11
seconds too few '(^|\s|>)\d+[:\.,]\d+[:\.,]\d([:.,-]|\s|$)'
889
seconds too many '(^|\s|>)\d+[:\.,]\d+[:\.,]\d{3,}'
154
decimals too few '(^|\s|>)\d+[:\.,]\d+[:\.,]\d+[:\.,]\d{1,2}(\s|$|-)'
2085
decimals too many '(^|\s|>)\d+[:\.,]\d+[:\.,]\d+[:\.,]\d{4,}'
62
decimals missing '(^|\s|>)\d+[:\.,]\d+[:\.,]\d+(\s|$|-)'
132
minutes gt 59 '(^|\s|>)\d+[:\.,]0{0,}[6-9]\d+[:\.,]\d+'
6
seconds gt 59 '(^|\s|>)\d+[:\.,]\d+[:\.,]0{0,}[6-9]\d+'
184
leading garbage '^[^\s\d]+\d+[:\.,]\d+[:\.,]\d+'
599
trailing garbage '-->\s*(\d+[:\.,]){2,3}\d+(\s+[^\s]|[^\s\d:\.,])'
532
colon instead of comma '\d+[:\.,]\d+[:\.,]\d+[:\.,]\d+:\d+'
26
dot instead of comma '\d+[:\.,]\d+[:\.,]\d+\.\d+'
25372
comma instead of colon '\d+,\d+[:\.,]\d+'
82
dot instead of colon '\d+\.\d+[:\.,]\d+'
41
id before timestamp '^\s*\d+\s+\d+[:\.,]\d+'
115
spaces in timestamp '(\d[\d\s]*[:\.,]\s*){2,3}\d[\d\s]*' and not
'(\d+[:\.,]){2,3}\d+'
922
too long arrow '\d\s*-{3,}>\s*\d'
326
none of the above
969
The most common error is to use a dot instead of a comma.
Some appear to be a different format, and some appear to be just garbage.
Too few or too many hours might not technically be an error, however it
appeared that some of too many hours were cases where the line between the
id and the timestamp was missing (and no whitespace between), e.g.:
34500:24:01,000 --> 00:24:03,000
The trailing garbage is mostly the line between the timestamp and the cue
text being missing, e.g.:
00:00:01,000 --> 00:00:03,000Hello.
--
Simon Pieters
Opera Software
More information about the whatwg
mailing list