[whatwg] Fwd: Discussing WebSRT and alternatives/improvements

Silvia Pfeiffer silviapfeiffer1 at gmail.com
Tue Aug 10 16:43:01 PDT 2010

On Tue, Aug 10, 2010 at 7:49 PM, Philip Jägenstedt <philipj at opera.com>wrote:

> On Tue, 10 Aug 2010 01:34:02 +0200, Silvia Pfeiffer <
> silviapfeiffer1 at gmail.com> wrote:
>  On Tue, Aug 10, 2010 at 12:04 AM, Philip Jägenstedt <philipj at opera.com
>> >wrote:
>>  On Sat, 07 Aug 2010 09:57:39 +0200, Silvia Pfeiffer <
>>> silviapfeiffer1 at gmail.com> wrote:
>>>> I guess this is in support of Henri's proposal of parsing the cue using
>>> the
>>> HTML fragment parser (same as innerHTML)? That would be easy to
>>> implement,
>>> but how do we then mark up speakers? Using <span class="narrator"></span>
>>> around each cue is very verbose. HTML isn't very good for marking up
>>> dialog,
>>> which is quite a limitation when dealing with subtitles...
>> I actually think that the <span @class> mechanism is much more flexible
>> than
>> what we have in WebSRT right now. If we want multiple speakers to be able
>> to
>> speak in the same subtitle, then that's not possible in WebSRT. It's a
>> little more verbose in HTML, but not massively.
>> We might be able to add a special markup similar to the <[timestamp]>
>> markup
>> that Hixie introduced for Karaoke. This is beyond the innerHTML parser and
>> I
>> am not sure if it breaks it. But if it doesn't, then maybe we can also
>> introduce a <[voice]> marker to be used similarly?
> An HTML parser parsing <1> or <00:01:30> will produce text nodes "<1>" and
> "<00:01:30>". Without having read the HTML parsing algorithm I guess that
> elements need to begin with a letter or similar. So, it's not possible to
> (ab)use the HTML parser to handle inner timestamps of numerical voices, we'd
> have to replace those with something else, probably more verbose.

I have checked the parse spec and
http://www.whatwg.org/specs/web-apps/current-work/#tag-open-state indeed
implies that a tag starting with a number is a parse error. Both, the
timestamps and the voice markers thus seem problems when going with an
innerHTML parser. Is there a way to resolve this? I mean: I'd quite happily
drop the voice markers for a <span @class> but I am not sure what to do
about the timestamps. We could do what I did in WMML and introduce a <t>
element with the timestamp as a @at attribute, but that is again more
verbose. We could also introduce an @at attribute in <span> which would then
at least end up in the DOM and can be dealt with specially.

Just for those who think it's a fancy karaoke feature and isn't really
required: it's actually also a useful feature for captions, in particular
when recording live captions that are usually "paint-on". Requirement CC-14
on http://www.w3.org/WAI/PF/HTML/wiki/Media_Accessibility_Requirements also
refers to this need and 608/708 captions provide this functionality, too.

>>> Similarly, I think that the WebSRT parser should be designed to ignore
>>> things that it doesn't recognize, in particular unknown voices (if we
>>> keep
>>> those). Requiring parsers to fail when the version number is increased
>> oh, you misunderstood me: I am not saying that parser have to fail - it's
>> good if they don't. But I am saying that if we make a change to the
>> specification that is not backwards compatible with the previous one and
>> will thus invariably break parsers, we have to notify parsers somehow such
>> that if they get parse errors they can e.g. notify the user that this is a
>> new version of the WebSRT format which their software doesn't support yet.
> A browser won't bother their users by saying "hey, there was something in
> this page I didn't understand", as users won't know what to do to fix it.

I'm not overly worried about browsers. They will just display the wrong
text. They are not normally an authoring or transcoding application. I am
more worried about non-browser applications here, in particular those where
interpreting the text the wrong way will lead to disaster, such as the wrong
data in an archive etc.

>  Think for example about the case where we had a requirement that a double
>> newline starts a new cue, but now we want to introduce a means where the
>> double newline is escaped and can be made part of a cue.
>> Other formats keep track of their version, such as MS Word files. It is to
>> be hoped that most new features can be introduced without breaking
>> backwards
>> compatibility and we can write the parsing requirements such that certain
>> things will be ignored, but in and of itself, WebSRT doesn't provide for
>> this extensibility. Right now, there is for example extensibility with the
>> "WebSRT settings parsing" (that's the stuff behind the timestamps) where
>> further "setting:value" settings can be introduced. But for example the
>> introduction of new "cue identifiers" (that's the <> marker at the start
>> of
>> a cue) would be difficult without a version string, since anything that
>> doesn't match the given list will just be parsed as cue-internal tag and
>> thus end up as part of the cue text where plain text parsing is used.
> The bug I filed suggested allowing arbitrary voices, to simplify the parser
> and to make future extensions possible. For a web format I think this is a
> better approach format than versioning. I haven't done a full review of the
> parser, but there are probably more places where it could be more forgiving
> so as to allow future tweaking.

That's a good approach and will reduce the need for breaking
backwards-compatibility. In an xml-based format that need is 0, while with a
text format where the structure is ad-hoc, that need can never be reduced to
0. That's what I am concerned about and that's why I think we need a version
identifier. If we end up never using/changing the version identifier, the
better so. But I'd much rather we have it now and can identify what
specification a file adheres to than not being able to do so later.

>  On the other hand, keeping the same extension and (unregistered) MIME type
>>> as SRT has plenty of benefits, such as immediately being able to use
>>> existing SRT files in browsers without changing their file extension or
>>> MIME
>>> type.
>> There is no harm for browsers to accept both MIME types if they are sure
>> they can parse old srt as well as new websrt. But these two formats are
>> different enough that they should be given a different extension and mime
>> type. I do not see a single advantage in stealing the MIME type of an
>> existing format for a new specification.
> But there's no spec for the old SRT, the only thing one could do is parser
> it with a WebSRT parser.

I can write that spec in an afternoon and register the mime type with IANA.
That really isn't a problem. People have managed to write correct SRT files
without having a spec, because it's so trivial. Creating a spec is just a
formality. For now, the wikipedia page really is sufficient.

> That would make text/srt and text/websrt synonymous, which is kind of
> pointless.

No, it's only pointless if you are a browser vendor. For everyone else it is
a huge advantage to be able to choose between a guaranteed simple format and
a complex format with all the bells and whistles.

> The advantages of taking text/srt is that all existing software to create
> SRT can be used to create WebSRT

That's not strictly true. If they load a WebSRT file that was created by
some other software for further editing and that WebSRT file uses advanced
WebSRT functionality, the authoring software will break.

> and servers that already send text/srt don't need to be updated. In either
> case I think we should support only one mime type.

What's the harm in supporting two mime types but using the same parser to
parse them?

>   * there is no definition of the "canvas" dimensions that the cues are
>>>>>  prepared for (width/height) and expected to work with other than
>>>>>> saying
>>>>>> it
>>>>>> is the video dimensions - but these can change and the proportions
>>>>>> should
>>>>>> be
>>>>>> changed with that
>>>>>>  I'm not sure what you're saying here. Should the subtitle file be
>>>>> hard-coded to a particular size? In the quite peculiar case where the
>>>>> same
>>>>> subtitles really don't work at two different resolutions, couldn't we
>>>>> just
>>>>> have two files? In what cases would this be needed?
>>>> Most subtitles will be created with a specific width and height in mind.
>>>> For
>>>> example, the width in characters relies on the video canvas having at
>>>> least
>>>> that size and the number of lines used usually refers to a lower third
>>>> of
>>>> a
>>>> video - where that is too small, it might cover the whole video. So, my
>>>> proposal is not the hard-code the subtitles to a particular size, but to
>>>> put
>>>> the minimum width and height that are being used for the creation of the
>>>> subtitles into the file. Then, the file can be scaled below or above
>>>> this
>>>> size to adjust to the actual available space.
>>> In practice, does this mean scaling font-size by
>>> width_actual/width_intended or similar? Personally, I prefer subtitles to
>>> be
>>> something like 20 screen pixels regardless of video size, as that is
>>> readable. Making them bigger hides more of the video, while making them
>>> smaller makes them hard to read. But I guess we could let the CSS media
>>> query min-width and similar be evaluated against the size of the
>>> containing
>>> video element, to make it possible anyway.
>> Have you ever tried to keep the small font size of subtitles on a 320x240
>> video when going full-screen? They are almost unusable at that size.
>> YouTube
>> doesn't do a good job at that, incidentally, so you can go check it out
>> there - go full-screen and see how tiny the captions become then step back
>> from your screen to where you'd want to watch the video from and notice
>> how
>> the captions are basically unreadable.
>> When you scale the font-size with the video, you do not hide more of the
>> video - you hide the exact same part of the video. Video and font get
>> larger
>> in the same way. And that's exactly the need that we have.
> Existing media players have basically two different ways of handling this.
> The kind you're describing is like MPlayer, where subtitles appear to
> actually be rendered on to the video frames and then scaled together with
> the video. The kind I've used more is like Totem, where subtitles are
> rendered in a separate layer at a fixed size in pixels, regardless of
> whether or not you're watching in fullscreen. This means that word wrapping
> will be different depending on screen size.

In the Totem case, does the font size increase with a change in screen size?

My suggestion is to have them in different layers, but there is knowledge
about the intended anchoring, i.e. where is the text supposed to appear on
the video screen. The keep that anchoring intact no matter what the video

> I find both MPlayer's and Totem's behavior annoying in some situations, but
> personally prefer Totem most of the time.

Do you find MPlayer's behavior annoying because by rescaling already
rendered text, the text loses resolution and becomes less readable? This is
definitely not the behaviour I am after.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.whatwg.org/pipermail/whatwg-whatwg.org/attachments/20100811/9df8d382/attachment-0002.htm>

More information about the whatwg mailing list