[whatwg] Fwd: Discussing WebSRT and alternatives/improvements

Mon Aug 9 16:34:02 PDT 2010

On Tue, Aug 10, 2010 at 12:04 AM, Philip Jägenstedt <philipj at opera.com>wrote:

> On Sat, 07 Aug 2010 09:57:39 +0200, Silvia Pfeiffer <
> silviapfeiffer1 at gmail.com> wrote:
>
>  Hi Philip,
>>
>> On Sat, Aug 7, 2010 at 1:50 AM, Philip Jägenstedt <philipj at opera.com>
>> wrote:
>>
>>  * there is a possibility to provide script that just affects the
>>>
>>>> time-synchronized text resource
>>>>
>>>>
>>> I agree that some metadata would be useful, more on that below. I'm not
>>> sure why we would want to run scripts inside the text document, though,
>>> when
>>> that can be accomplished by using the TimedTrack API from the containing
>>> page.
>>>
>>
>>
>>
>> Scripts inside a timed text document would only be useful for applications
>> that use the track not in conjunction with a Web page.
>>
>
> Do you mean that media players could include a JavaScript engine just for
> supporting scripts in WebSRT? Not to say that it can't happen, but it seems
> a bit unlikely.

Yes, it's indeed an "out there" feature and I am not worried about having it
now. I just mentioned it as a simple possibility for extension.

>
>  2. There is a natural mapping of WebSRT into in-band text tracks.
>>>
>>>> Each cue naturally maps into a encoding page (just like a WMML cue does,
>>>> too). But in WebSRT, because the setup information is not brought in a
>>>> hierarchical element surrounding all cues, it is easier to just chuck
>>>> anything that comes before the first cue into an encoding header page.
>>>> For
>>>> WMML, this problem can be solved, but it is less natural.
>>>>
>>>>
>>> I really like the idea of letting everything before the first timestamp
>>> in
>>> WebSRT be interpreted as the header. I'd want to use it like this:
>>>
>>> # author: Fan Subber
>>> # voices: <1> Boy
>>> #         <2> Girl
>>>
>>> 01:23:45.678 --> 01:23:46.789
>>> <1> Hello
>>>
>>> 01:23:48.910 --> 01:23:49.101
>>> <2> Hello
>>>
>>> It's not critical that the format of the header be machine-readable, but
>>> we
>>> could of course make up a key-value syntax, use JSON, or something else.
>>>
>>
>>
>>
>> I disagree. I think it's absolutely necessary that the format of the
>> header
>> be machine-readable. Just like EXIF in images is machine readable or ID3
>> in
>> MP3 is machine-readable. It would be counter-productive not to have it
>> machine-readable, in particular useless to archiving and media management
>> solutions.
>>
>
> OK, so maybe key-values?
>
> Author: Fan Subber
> Voice: <1> Boy
> Voice: <2> Girl
>
>
> 01:23:45.678 --> 01:23:46.789
> <1> Hello
>
> This looks a bit like HTTP headers. (I'm not sure I'd actually want to
> allow multiple occurrences of the same key, in practice that seems to result
> in inconsistencies in how people mark up multiple authors.)

Yes, anything that can replicate the name-value possibilities of the <meta>
element should be fine.
Multiple occurrences make sense for some fields and not for others.
I wonder if we would need to make a defined list of what should go in here
or just define a general mechanism. HTML has a general mechanism (with
<meta>) while most subtitle formats have a defined set of fileds, e.g.
http://en.wikipedia.org/wiki/LRC_%28file_format%29 (ID3 tags) or
http://www.matroska.org/technical/specs/subtitles/ssa.html (SSA headers).

>
>  I'm not sure of the best solution. I'd quite like the ability to use
>>> arbitrary voices, e.g. to use the names/initials of the speaker rather
>>> than
>>> a number, or to use e.g. <shouting> in combination with CSS :before {
>>> content 'Shouting: ' } or similar to adapt the display for different
>>> audiences (accessibility, basically).
>>>
>>
>>
>>
>> I agree. I think we can go back to using<span> and @class and @id and that
>> would solve it all.
>>
>
> I guess this is in support of Henri's proposal of parsing the cue using the
> HTML fragment parser (same as innerHTML)? That would be easy to implement,
> but how do we then mark up speakers? Using <span class="narrator"></span>
> around each cue is very verbose. HTML isn't very good for marking up dialog,
> which is quite a limitation when dealing with subtitles...

I actually think that the <span @class> mechanism is much more flexible than
what we have in WebSRT right now. If we want multiple speakers to be able to
speak in the same subtitle, then that's not possible in WebSRT. It's a
little more verbose in HTML, but not massively.

We might be able to add a special markup similar to the <[timestamp]> markup
that Hixie introduced for Karaoke. This is beyond the innerHTML parser and I
am not sure if it breaks it. But if it doesn't, then maybe we can also
introduce a <[voice]> marker to be used similarly?

>
>   * there is no means to identify which parser is required in the cues (is
>>>
>>>> it
>>>> "plain text", "minimal markup", or "anything"?) and therefore it is not
>>>> possible for an application to know how it should parse the cues.
>>>>
>>>>
>>> All the types that are actually for visual rendering are parsed in the
>>> same
>>> way, aren't they? Of course there's no way for non-browsers to know that
>>> metadata tracks aren't interesting to look at as subtitles, but I think
>>> showing the user the garbage is a quicker to communicate that the file
>>> isn't
>>> for direct viewing than hiding the text or similar.
>>>
>>
>>
>>
>> The spec says that files of kind "descriptions" and "metadata" are not
>> displayed. It seems though that the parsing section will try two
>> interfaces:
>> HTML and plain. I think there is a disconnect there. If we already know
>> that
>> it's not parsable in HTML, why even try?
>>
>
> I was confused. The parsing algorithm does the same thing regardless of
> what kind of text track it is dealing with. I guess what you're saying is
> that non-browser applications also need to know that something is e.g.
> chapter markers, so that it can display it appropriately?
>
> I don't have a strong opinion, but repeating the same information both in
> the containing document and in the subtitle file means that one of them will
> be ignored by browsers. People will copy-paste the ignored one and it will
> end up being wrong a lot of the time.

I don't see a problem with repeating this information.There will be files
and other file formats that do not have the "kind" inside the file - maybe
because the files are always only used for captions/subtitles or only for
lyrics/karaoke and thus don't need an extra specification. But for WebSRT
files, which provide a platform for time-synchronized text, this is
important information to have inside the file - or assume a default of
"captions" or so. Thus, for files that do not have a "kind", the
specification in HTML is necessary. For those that do, it provides the
author with an opportunity to take that hint or even to override it. An
authoring application could even alert a Web developer if they are
referencing a "chapters" file with a "subtitles" @kind attribute. But
obviously what is stated in the HTML page will be what matters.

>
>   * there is no version number on the format, thus it will be difficult to
>>>
>>>> introduce future changes.
>>>>
>>>>
>>> I think we shouldn't have a version number, for the same reason that CSS
>>> and HTML don't really have versions. If we evolve the WebSRT spec, it
>>> should
>>> be in a backwards-compatible way.
>>>
>>
>>
>> CSS and HTML are structured formats where you ignore things that you
>> cannot
>> interpret. But the parsing is fixed and extensions play within this
>> parsing
>> framework. I have my doubts that is possible with WebSRT. Already one
>> extension that we are discussion here will break parsing: the introduction
>> of structured headers. Because there is no structured way of extending
>> WebSRT, I believe the best way to communicate whether it is backwards
>> compatible is through a version number. We can change the minor versions
>> if
>> the compatibility is not broken - it communicates though what features are
>> being used - and we can change the major version of compatibility is
>> broken.
>>
>
> Similarly, I think that the WebSRT parser should be designed to ignore
> things that it doesn't recognize, in particular unknown voices (if we keep
> those). Requiring parsers to fail when the version number is increased

oh, you misunderstood me: I am not saying that parser have to fail - it's
good if they don't. But I am saying that if we make a change to the
specification that is not backwards compatible with the previous one and
will thus invariably break parsers, we have to notify parsers somehow such
that if they get parse errors they can e.g. notify the user that this is a
new version of the WebSRT format which their software doesn't support yet.
Think for example about the case where we had a requirement that a double
newline starts a new cue, but now we want to introduce a means where the
double newline is escaped and can be made part of a cue.

Other formats keep track of their version, such as MS Word files. It is to
be hoped that most new features can be introduced without breaking backwards
compatibility and we can write the parsing requirements such that certain
things will be ignored, but in and of itself, WebSRT doesn't provide for
this extensibility. Right now, there is for example extensibility with the
"WebSRT settings parsing" (that's the stuff behind the timestamps) where
further "setting:value" settings can be introduced. But for example the
introduction of new "cue identifiers" (that's the <> marker at the start of
a cue) would be difficult without a version string, since anything that
doesn't match the given list will just be parsed as cue-internal tag and
thus end up as part of the cue text where plain text parsing is used.

> makes it harder to introduce changes to the format, because you'll have to
> either break all existing implementations or provide one subtitle file for
> each version. (Having a version number but letting parsers ignore it is just
> weird, quite like in HTML.)
>
> I filed a bug suggesting that voice is allowed to be an arbitrary string: <
> http://www.w3.org/Bugs/Public/show_bug.cgi?id=10320> (From the point of
> view of the parser, it still wouldn't be valid syntax.)

As it stands, the voice marker is more of a "WebSRT setting" for the
complete cue and should probably be moved up with the other "WebSRT
settings", since it's not markup inside the cue like the others and should
not end up as a token during parsing.

>
>   2. Break the SRT link.
>>>
>>>>
>>>>
>>>  * the mime type of WebSRT resources should be a different mime type to
>>> SRT
>>>
>>>> files, since they are so fundamentally different; e.g. text/websrt
>>>>
>>>> * the file extension of WebSRT resources should be different from SRT
>>>> files,
>>>> e.g. wsrt
>>>>
>>>>
>>> I'm not sure if either of these would make a difference.
>>>
>>
>>
>> Really? How do you propose that a media player identifies that it cannot
>> parse a WebSRT file that has random metadata in it when it is called .srt
>> and provided under the same mime type as SRT files? Or a transcoding
>> pipeline that relies on srt files just being plain old simple SRT. It
>> breaks
>> expectations with users, with developers and with software.
>>
>
> I think it's unlikely that people will offer download links to SRT files
> that aren't useful outside of the page, so random metadata isn't likely to
> reach end users or applications by accident. Also, most media frameworks
> rely mainly on sniffing, so even a file that uses lots of WebSRT-only
> features is quite likely going to be detected as SRT anyway. At least in
> GStreamer, the file extension is given quite little weight in guessing the
> type and MIME isn't used at all (because the sniffing code doesn't know
> anything about HTTP). Finally, seeing random metadata displayed on screen is
> about as good an indication that the file is "broken" as the application
> failing to recognize the file completely.
>

But very poor user experience and a "WTF: I thought this application
supported SRT and now it doesn't". Transcoding pipelines will break in
existing productions that expect the simplest SRT without much tolerance for
extra characters and they will have the extra markup of voice and ruby etc
in plain sight, since they are not built as WebSRT parsers. It will lead to
many many headaches. We're even against serving a WebM resource as a .mkv
video with video/x-matroska MIME type when WebM is really completely
compatible with Matroska. So, why do it with WebSRT and SRT?

>
> On the other hand, keeping the same extension and (unregistered) MIME type
> as SRT has plenty of benefits, such as immediately being able to use
> existing SRT files in browsers without changing their file extension or MIME
> type.

There is no harm for browsers to accept both MIME types if they are sure
they can parse old srt as well as new websrt. But these two formats are
different enough that they should be given a different extension and mime
type. I do not see a single advantage in stealing the MIME type of an
existing format for a new specification.

>
>   * there is no definition of the "canvas" dimensions that the cues are
>>>
>>>> prepared for (width/height) and expected to work with other than saying
>>>> it
>>>> is the video dimensions - but these can change and the proportions
>>>> should
>>>> be
>>>> changed with that
>>>>
>>>>
>>> I'm not sure what you're saying here. Should the subtitle file be
>>> hard-coded to a particular size? In the quite peculiar case where the
>>> same
>>> subtitles really don't work at two different resolutions, couldn't we
>>> just
>>> have two files? In what cases would this be needed?
>>>
>>
>>
>> Most subtitles will be created with a specific width and height in mind.
>> For
>> example, the width in characters relies on the video canvas having at
>> least
>> that size and the number of lines used usually refers to a lower third of
>> a
>> video - where that is too small, it might cover the whole video. So, my
>> proposal is not the hard-code the subtitles to a particular size, but to
>> put
>> the minimum width and height that are being used for the creation of the
>> subtitles into the file. Then, the file can be scaled below or above this
>> size to adjust to the actual available space.
>>
>
> In practice, does this mean scaling font-size by
> width_actual/width_intended or similar? Personally, I prefer subtitles to be
> something like 20 screen pixels regardless of video size, as that is
> readable. Making them bigger hides more of the video, while making them
> smaller makes them hard to read. But I guess we could let the CSS media
> query min-width and similar be evaluated against the size of the containing
> video element, to make it possible anyway.

Have you ever tried to keep the small font size of subtitles on a 320x240
video when going full-screen? They are almost unusable at that size. YouTube
doesn't do a good job at that, incidentally, so you can go check it out
there - go full-screen and see how tiny the captions become then step back
from your screen to where you'd want to watch the video from and notice how
the captions are basically unreadable.

When you scale the font-size with the video, you do not hide more of the
video - you hide the exact same part of the video. Video and font get larger
in the same way. And that's exactly the need that we have.

Cheers,
Silvia.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.whatwg.org/pipermail/whatwg-whatwg.org/attachments/20100810/64f3915a/attachment-0002.htm>