[whatwg] Discussing WebSRT and alternatives/improvements

Philip Jägenstedt philipj at opera.com
Tue Aug 24 08:07:43 PDT 2010

On Tue, 24 Aug 2010 16:21:28 +0200, Henri Sivonen <hsivonen at iki.fi> wrote:

> On Aug 5, 2010, at 18:01, Silvia Pfeiffer wrote:
> On Aug 9, 2010, at 17:04, Philip Jägenstedt wrote:
>> I guess this is in support of Henri's proposal of parsing the cue using  
>> the HTML fragment parser (same as innerHTML)? That would be easy to  
>> implement, but how do we then mark up speakers? Using <span  
>> class="narrator"></span> around each cue is very verbose. HTML isn't  
>> very good for marking up dialog, which is quite a limitation when  
>> dealing with subtitles...
> How often do captions distinguish two or more speakers in the same cue  
> by styling them differently? In my experience, translation subtitles for  
> TV, DVDs and theatrical movies virtually never do (but it's assumed that  
> the reader of the subtitles can work out who is talking from the sound  
> track, so I can see why this might not generalize to captioning for the  
> deaf).

In the same cue? I'm not sure what you mean, but regardless I've hardly  
ever seen different styles for different speakers. I remember once  
watching a DVD using different color subtitles for different speakers and  
found it very annoying. The one case where it's actually useful (IMO) is  
duet karaokes, as in <http://www.youtube.com/watch?v=5tOxfjHLK-A>.  
Incidentally, since it's not possible to have multiple voices per cue,  
WebSRT doesn't actually support this (lines sung together should be a  
third color).

Personally, I'm mostly interested in it for the metadata so that one can  
produce transcripts or search by speaker, but numerical voices don't help  
a great deal here. In the current state of the spec, I wouldn't miss  
voices much if they were removed.

>> Similarly, I think that the WebSRT parser should be designed to ignore  
>> things that it doesn't recognize,
> I agree. Reusing the HTML fragment parsing algorithm would provide this  
> for stuff within the cue text "for free".
> On Aug 10, 2010, at 12:49, Philip Jägenstedt wrote:
>> An HTML parser parsing <1> or <00:01:30> will produce text nodes "<1>"  
>> and "<00:01:30>". Without having read the HTML parsing algorithm I  
>> guess that elements need to begin with a letter or similar. So, it's  
>> not possible to (ab)use the HTML parser to handle inner timestamps of  
>> numerical voices, we'd have to replace those with something else,  
>> probably more verbose.
> Given that voices (according to this thread; I didn't check) are a Hixie  
> invention rather than an original feature of SRT, the <1> syntax doesn't  
> have to be that way for compat. Instead, something that works in HTML  
> without parser modifications could be used.

Yep, if we actually want to use an HTML parser.

> As for <00:01:30>, normal subtitles and, AFAIK, normal captions don't  
> need time-based revelation of parts of the cue. (I'm not considering  
> anime fansubbing "normal"; I mean TV, DVD and movie theater subtitles.  
> Also, continuous revelation of live captioning isn't relevant to the  
> <00:01:30> feature.) Since the <00:01:30> isn't essential for making the  
> feature set of HTML5 accessible, I think the <00:01:30> feature for  
> karaoke and anime fansubbing should be considered for removal if it  
> happens to have any undesirable properties--and not working without HTML  
> parser modifications is such an undesirable property.
> I'd be OK with not supporting karaoke or anime fansubbing at all  
> declaratively (requiring those use cases to be addressed in JavaScript)  
> or with using more verbose syntax like <t t=00:01:30>...</t>.

I'd also be fine with not having intra-cue timing, but adding new elements  
or attributes to HTML just for WebSRT seems odd.

I'm still unconvinced that using the HTML fragment parser is a good idea.  
Elsewhere in this thread I raised these concerns:

* Memory overhead by creating an HTML document per cue. This is just a  

* Needing to load external resources before the cue is displayed, or have  
very different results depending on network speed and cache.

To address the last point we could make it invalid to use any other tags  
but <i>, <b>, <ruby> and <rt> in the cues, but the benefit of using an  
HTML parser is then quite limited indeed.

Philip Jägenstedt
Core Developer
Opera Software

More information about the whatwg mailing list