[whatwg] WebSRT feedback

Tue Oct 5 19:04:44 PDT 2010

Over the past week I've attended 3 video-related events in New York and  
have discussed <track> and WebSRT at all of them. Here's a lengthy report  
of feedback, mine and others.

At the Open Subtitles Design Summit [1], there was some discussion about  
captioning for the HoH. I've already put this input into a related bug  
[2], but to summarize: The default rendering for the voices syntax should  
probably be to prefix the text cue with the name of the speaker, not to do  
anything funny with colors or positioning. What's less clear is if it's  
annoying to always prefix with the speaker, or if it should be done only  
to disambiguate.

For my Open Video Conference [3] presentation [4] I did a JavaScript  
implementation of the most interesting parts of <track> and WebSRT to be  
able to demo what the future might hold [5][6][7]. I have some issues with  
the parser that are at the end of this mail.

At FOMS [8] we had a session on WebSRT [9] which was extremely helpful. It  
turns out that SRT has more syntax variations than we had thought, kindly  
pointed out by VLC developer j-b. Even though there is no SRT spec, there  
is a test suite of sorts [10] that I had never seen before. I'll call SRT  
which follows the syntax implied by these tests ale5000-SRT. Apart from  
the HTML-like markup we knew about, ale5000-SRT also has various markup on  
the form {...} which was borrowed from SSA, as well as \h and \N for "hard  
space" and line break respectively. Also in the crazy department is that  
tags which aren't matched with an opening and closing tag should be  
rendered as plain text. Stray < should also just be displayed as text. VLC  
actually implements most of this, as does VSFilter, which we should have  
tested but didn't [11]. It would probably be possible to write a spec for  
ale5000-SRT, but extensibility would be limited to matched opening and  
closing tags, which doesn't work for the suggested voices syntax. With  
this mess, I'd rather not extend ale5000-SRT. I can only agree with Silvia  
that we should make WebSRT identifiable, so that different parsers can be  
used.  So:

* Add magic bytes to identify WebSRT, maybe "WebSRT". (This will break  
some existing SRT parsers.)
* Make WebSRT always be UTF-8, since you can't reuse existing SRT files  
anyway.
* Note that certain ale5000-SRT syntax is not part of WebSRT, so that one  
doesn't have to debug the parsing algorithm to learn that.

Styling hooks were requested. If we only have the predefined tags (i, b,  
...) and voices, these will most certainly be abused, e.g. resulting in  
<i> being used where italics isn't wanted or <v Foo> being used just for  
styling, breaking the accessibility value it has.

As an aside, the idea of using an HTML parser for the cue text wasn't very  
popular.

There was also some discussion about metadata. Language is sometimes  
necessary for the font engine to pick the right glyph. With legacy SRT the  
encoding could be used as a hint, but if we use UTF-8 that's not possible.  
License is also an often requested piece of metadata. I have no strong  
opinion about how to solve this, but key-value pairs like HTTP headers  
comes to mind.

Finally, some things I think are broken in the current WebSRT parser:

* Parsing of timestamps is more liberal than it needs to be. In  
particular, treating the part after the decimal separator as an integer  
and dividing by 1000 leads to 00:00:00.1 being interpreted as 0.001  
seconds, which is weird. This is what e.g. VLC does, but if we need to add  
a header we could just as well change this to make more sane.  
Alternatively, if we want to really align with C implementations using  
scanf, we should also handle negative numbers (00:01:-5,000 means 55  
seconds), octal and hexadecimal.

* The current syntax looks like XML or HTML but has very different  
parsing. Voices like <narrator> don't create nodes at all and for tags  
like <i> the paser has a whitelist and also special rules for inserting  
<rt>. Unless there are strong reasons for this, then for simplicity and  
forward compatibility, I'd much rather have the parser create an actual  
DOM (not a tree of "WebSRT Node Object") that reflects the input. If we  
also support attributes then people who actually want to use their (silly)  
<font color=red> tags can do so with CSS. This could also work as styling  
hooks. Obviously, a WebSRT parser should create elements in another  
namespace, we don't want e.g. <img> to work inside cues.

* The "bad cue" handling is stricter than it should be. After collecting  
an id, the next line must be a timestamp line. Otherwise, we skip  
everything until a blank line, so in the following the parser would jump  
to "bad cue" on line "2" and skip the whole cue.

1
2
00:00:00.000 --> 00:00:01.000
Bla

This doesn't match what most existing SRT parsers do, as they simply look  
for timing lines and ignore everything else. If we really need to collect  
the id instead of ignoring it like everyone else, this should be more  
robust, so that a valid timing line always begins a new cue. Personally,  
I'd prefer if it is simply ignored and that we use some form of in-cue  
markup for styling hooks.

* At the beginning of "cue text loop" (step 28) a newline should be  
collected.

[1] http://universalsubtitles.org/opensubtitles2010
[2] http://www.w3.org/Bugs/Public/show_bug.cgi?id=10320
[3] http://www.openvideoconference.org/
[4] http://people.opera.com/philipj/2010/10/02/ovc/
[5] http://people.opera.com/philipj/2010/10/02/ovc/demos/captions.html
[6] http://people.opera.com/philipj/2010/10/02/ovc/demos/transcript.html
[7] http://people.opera.com/philipj/2010/10/02/ovc/demos/metadata.html
[8] http://www.foms-workshop.org/foms2010OVC/
[9] http://www.foms-workshop.org/foms2010OVC/pmwiki.php/Main/WebSRT
[10] http://ale5000.altervista.org/subtitles.htm
[11] http://wiki.whatwg.org/wiki/SRT_research

-- 
Philip Jägenstedt
Core Developer
Opera Software