[whatwg] Fwd: Discussing WebSRT and alternatives/improvements

Tue Aug 10 02:49:47 PDT 2010

On Tue, 10 Aug 2010 01:34:02 +0200, Silvia Pfeiffer  
<silviapfeiffer1 at gmail.com> wrote:

> On Tue, Aug 10, 2010 at 12:04 AM, Philip Jägenstedt  
> <philipj at opera.com>wrote:
>
>> On Sat, 07 Aug 2010 09:57:39 +0200, Silvia Pfeiffer <
>> silviapfeiffer1 at gmail.com> wrote:
>>
>>  Hi Philip,
>>>
>>> On Sat, Aug 7, 2010 at 1:50 AM, Philip Jägenstedt <philipj at opera.com>
>>> wrote:
>>>
>>>>  I'm not sure of the best solution. I'd quite like the ability to use
>>>> arbitrary voices, e.g. to use the names/initials of the speaker rather
>>>> than
>>>> a number, or to use e.g. <shouting> in combination with CSS :before {
>>>> content 'Shouting: ' } or similar to adapt the display for different
>>>> audiences (accessibility, basically).
>>>>
>>>
>>>
>>>
>>> I agree. I think we can go back to using<span> and @class and @id and  
>>> that
>>> would solve it all.
>>>
>>
>> I guess this is in support of Henri's proposal of parsing the cue using  
>> the
>> HTML fragment parser (same as innerHTML)? That would be easy to  
>> implement,
>> but how do we then mark up speakers? Using <span  
>> class="narrator"></span>
>> around each cue is very verbose. HTML isn't very good for marking up  
>> dialog,
>> which is quite a limitation when dealing with subtitles...
>
>
>
> I actually think that the <span @class> mechanism is much more flexible  
> than
> what we have in WebSRT right now. If we want multiple speakers to be  
> able to
> speak in the same subtitle, then that's not possible in WebSRT. It's a
> little more verbose in HTML, but not massively.
>
> We might be able to add a special markup similar to the <[timestamp]>  
> markup
> that Hixie introduced for Karaoke. This is beyond the innerHTML parser  
> and I
> am not sure if it breaks it. But if it doesn't, then maybe we can also
> introduce a <[voice]> marker to be used similarly?

An HTML parser parsing <1> or <00:01:30> will produce text nodes "<1>" and  
"<00:01:30>". Without having read the HTML parsing algorithm I guess that  
elements need to begin with a letter or similar. So, it's not possible to  
(ab)use the HTML parser to handle inner timestamps of numerical voices,  
we'd have to replace those with something else, probably more verbose.

>>   * there is no version number on the format, thus it will be difficult  
>> to
>>>>
>>>>> introduce future changes.
>>>>>
>>>>>
>>>> I think we shouldn't have a version number, for the same reason that  
>>>> CSS
>>>> and HTML don't really have versions. If we evolve the WebSRT spec, it
>>>> should
>>>> be in a backwards-compatible way.
>>>>
>>>
>>>
>>> CSS and HTML are structured formats where you ignore things that you
>>> cannot
>>> interpret. But the parsing is fixed and extensions play within this
>>> parsing
>>> framework. I have my doubts that is possible with WebSRT. Already one
>>> extension that we are discussion here will break parsing: the  
>>> introduction
>>> of structured headers. Because there is no structured way of extending
>>> WebSRT, I believe the best way to communicate whether it is backwards
>>> compatible is through a version number. We can change the minor  
>>> versions
>>> if
>>> the compatibility is not broken - it communicates though what features  
>>> are
>>> being used - and we can change the major version of compatibility is
>>> broken.
>>>
>>
>> Similarly, I think that the WebSRT parser should be designed to ignore
>> things that it doesn't recognize, in particular unknown voices (if we  
>> keep
>> those). Requiring parsers to fail when the version number is increased
>
>
> oh, you misunderstood me: I am not saying that parser have to fail - it's
> good if they don't. But I am saying that if we make a change to the
> specification that is not backwards compatible with the previous one and
> will thus invariably break parsers, we have to notify parsers somehow  
> such
> that if they get parse errors they can e.g. notify the user that this is  
> a
> new version of the WebSRT format which their software doesn't support  
> yet.

A browser won't bother their users by saying "hey, there was something in  
this page I didn't understand", as users won't know what to do to fix it.

> Think for example about the case where we had a requirement that a double
> newline starts a new cue, but now we want to introduce a means where the
> double newline is escaped and can be made part of a cue.
>
> Other formats keep track of their version, such as MS Word files. It is  
> to
> be hoped that most new features can be introduced without breaking  
> backwards
> compatibility and we can write the parsing requirements such that certain
> things will be ignored, but in and of itself, WebSRT doesn't provide for
> this extensibility. Right now, there is for example extensibility with  
> the
> "WebSRT settings parsing" (that's the stuff behind the timestamps) where
> further "setting:value" settings can be introduced. But for example the
> introduction of new "cue identifiers" (that's the <> marker at the start  
> of
> a cue) would be difficult without a version string, since anything that
> doesn't match the given list will just be parsed as cue-internal tag and
> thus end up as part of the cue text where plain text parsing is used.

The bug I filed suggested allowing arbitrary voices, to simplify the  
parser and to make future extensions possible. For a web format I think  
this is a better approach format than versioning. I haven't done a full  
review of the parser, but there are probably more places where it could be  
more forgiving so as to allow future tweaking.

>> makes it harder to introduce changes to the format, because you'll have  
>> to
>> either break all existing implementations or provide one subtitle file  
>> for
>> each version. (Having a version number but letting parsers ignore it is  
>> just
>> weird, quite like in HTML.)
>>
>> I filed a bug suggesting that voice is allowed to be an arbitrary  
>> string: <
>> http://www.w3.org/Bugs/Public/show_bug.cgi?id=10320> (From the point of
>> view of the parser, it still wouldn't be valid syntax.)
>
>
>
> As it stands, the voice marker is more of a "WebSRT setting" for the
> complete cue and should probably be moved up with the other "WebSRT
> settings", since it's not markup inside the cue like the others and  
> should
> not end up as a token during parsing.

Yeah, that would also be an option.

>>   2. Break the SRT link.
>>>>
>>>>>
>>>>>
>>>>  * the mime type of WebSRT resources should be a different mime type  
>>>> to
>>>> SRT
>>>>
>>>>> files, since they are so fundamentally different; e.g. text/websrt
>>>>>
>>>>> * the file extension of WebSRT resources should be different from SRT
>>>>> files,
>>>>> e.g. wsrt
>>>>>
>>>>>
>>>> I'm not sure if either of these would make a difference.
>>>>
>>>
>>>
>>> Really? How do you propose that a media player identifies that it  
>>> cannot
>>> parse a WebSRT file that has random metadata in it when it is called  
>>> .srt
>>> and provided under the same mime type as SRT files? Or a transcoding
>>> pipeline that relies on srt files just being plain old simple SRT. It
>>> breaks
>>> expectations with users, with developers and with software.
>>>
>>
>> I think it's unlikely that people will offer download links to SRT files
>> that aren't useful outside of the page, so random metadata isn't likely  
>> to
>> reach end users or applications by accident. Also, most media frameworks
>> rely mainly on sniffing, so even a file that uses lots of WebSRT-only
>> features is quite likely going to be detected as SRT anyway. At least in
>> GStreamer, the file extension is given quite little weight in guessing  
>> the
>> type and MIME isn't used at all (because the sniffing code doesn't know
>> anything about HTTP). Finally, seeing random metadata displayed on  
>> screen is
>> about as good an indication that the file is "broken" as the application
>> failing to recognize the file completely.
>>
>
> But very poor user experience and a "WTF: I thought this application
> supported SRT and now it doesn't". Transcoding pipelines will break in
> existing productions that expect the simplest SRT without much tolerance  
> for
> extra characters and they will have the extra markup of voice and ruby  
> etc
> in plain sight, since they are not built as WebSRT parsers. It will lead  
> to many many headaches.

Yes, it cannot be denied that there will be some confusion.

> We're even against serving a WebM resource as a .mkv
> video with video/x-matroska MIME type when WebM is really completely
> compatible with Matroska. So, why do it with WebSRT and SRT?

For the record, we at Opera argued against changing the EBML doctype to  
"webm" and artificially breaking compatibility with Matroska. However,  
that wasn't our decision to make and now it's better to only support one  
doctype/extension/mime type.

>> On the other hand, keeping the same extension and (unregistered) MIME  
>> type
>> as SRT has plenty of benefits, such as immediately being able to use
>> existing SRT files in browsers without changing their file extension or  
>> MIME
>> type.
>
>
> There is no harm for browsers to accept both MIME types if they are sure
> they can parse old srt as well as new websrt. But these two formats are
> different enough that they should be given a different extension and mime
> type. I do not see a single advantage in stealing the MIME type of an
> existing format for a new specification.

But there's no spec for the old SRT, the only thing one could do is parser  
it with a WebSRT parser. That would make text/srt and text/websrt  
synonymous, which is kind of pointless. The advantages of taking text/srt  
is that all existing software to create SRT can be used to create WebSRT  
and servers that already send text/srt don't need to be updated. In either  
case I think we should support only one mime type.

>>   * there is no definition of the "canvas" dimensions that the cues are
>>>>
>>>>> prepared for (width/height) and expected to work with other than  
>>>>> saying
>>>>> it
>>>>> is the video dimensions - but these can change and the proportions
>>>>> should
>>>>> be
>>>>> changed with that
>>>>>
>>>>>
>>>> I'm not sure what you're saying here. Should the subtitle file be
>>>> hard-coded to a particular size? In the quite peculiar case where the
>>>> same
>>>> subtitles really don't work at two different resolutions, couldn't we
>>>> just
>>>> have two files? In what cases would this be needed?
>>>>
>>>
>>>
>>> Most subtitles will be created with a specific width and height in  
>>> mind.
>>> For
>>> example, the width in characters relies on the video canvas having at
>>> least
>>> that size and the number of lines used usually refers to a lower third  
>>> of
>>> a
>>> video - where that is too small, it might cover the whole video. So, my
>>> proposal is not the hard-code the subtitles to a particular size, but  
>>> to
>>> put
>>> the minimum width and height that are being used for the creation of  
>>> the
>>> subtitles into the file. Then, the file can be scaled below or above  
>>> this
>>> size to adjust to the actual available space.
>>>
>>
>> In practice, does this mean scaling font-size by
>> width_actual/width_intended or similar? Personally, I prefer subtitles  
>> to be
>> something like 20 screen pixels regardless of video size, as that is
>> readable. Making them bigger hides more of the video, while making them
>> smaller makes them hard to read. But I guess we could let the CSS media
>> query min-width and similar be evaluated against the size of the  
>> containing
>> video element, to make it possible anyway.
>
>
>
> Have you ever tried to keep the small font size of subtitles on a 320x240
> video when going full-screen? They are almost unusable at that size.  
> YouTube
> doesn't do a good job at that, incidentally, so you can go check it out
> there - go full-screen and see how tiny the captions become then step  
> back
> from your screen to where you'd want to watch the video from and notice  
> how
> the captions are basically unreadable.
>
> When you scale the font-size with the video, you do not hide more of the
> video - you hide the exact same part of the video. Video and font get  
> larger
> in the same way. And that's exactly the need that we have.

Existing media players have basically two different ways of handling this.  
The kind you're describing is like MPlayer, where subtitles appear to  
actually be rendered on to the video frames and then scaled together with  
the video. The kind I've used more is like Totem, where subtitles are  
rendered in a separate layer at a fixed size in pixels, regardless of  
whether or not you're watching in fullscreen. This means that word  
wrapping will be different depending on screen size. I find both MPlayer's  
and Totem's behavior annoying in some situations, but personally prefer  
Totem most of the time.

Certainly you want a different size depending on whether you're going to  
watch from your desk or from the sofa, so I guess we'd want to make it  
easy to adjust the size.

-- 
Philip Jägenstedt
Core Developer
Opera Software