[whatwg] Discussing WebSRT and alternatives/improvements

Tue Aug 24 07:21:28 PDT 2010

On Aug 5, 2010, at 18:01, Silvia Pfeiffer wrote:

> I developed WMML as a xml-based caption format that will not have the problems that have been pointed out for DFXP/TTML, namely: there are no namespaces, it doesn't use XSL-FO but instead fully reuses CSS, and it supports innerHTML markup in the cues instead of inventing its own markup. Check out the examples at https://wiki.mozilla.org/Accessibility/Video_Text_Format .

The wiki page says it's *not* an XML format.

The wiki page says the format reuses the HTML fragment parsing algorithm. It also asserts that a "WMML parser will only consist of a small amount of new parsing code" but the document fails to explain what code would tokenize (including complicated stuff like attributes) otherwise handle the parts that wouldn't use the HTML fragment parsing algorithm (which parts wouldn't).

The proposal can't be properly evaluated without a precise description of the processing model. (I would prefer not to jump to evaluating the potential processing models that we've discussed off-list to avoid the appearance of me constructing straw men.)

> * a @profile attribute which specifies the format used in the cues and thus the parser that should be chosen, including "plainText", "minimalMarkup", "innerHTML", "JSON", "any" (other formats can be developed)

That looks like excessive complexity without solidly documented need or processing model.

> WMML doesn't have a <body> element, but instead has a <cuelist> element. It was important not to reuse <body> in order to allow only <cue> elements inside the main part of the WMML resource.

Why couldn't the <cuelist> element be called <body> (and the <wmml> element <html>) with a conformance rule that <body> is only permitted to contain <cue> elements (that could be spelled <div>...)? Then you could use the HTML parsing algorithm on the whole file (if only the standalone case is supported--not muxing into a media file). Even if had an element called <cuelist>, you'd need to define error handling for the case where there are non-<cue> children.

If you made those respellings, we'd be back to Timed Divs, but see http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2010-July/027283.html for my argument against that proposal (and logically any proposal that's effectively the same with trivial renaming).

> This makes it a document that can also easily be encapsulated in binary media resources such as WebM, Ogg or MPEG-4 because each cue is essentially a "codec data page" associated with a given timeline, while anything in the root and head element are "codec headers". In this way, the hierarchical document structure is easily flattened.

As discussed in http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2010-July/027283.html , muxing into a media resource places contraints on the processing model. For sanity, the rendering-time processing model should be the same for the muxed case and the standalone case. The logical consequences of the constraints imposed by muxing would make the necessary processing model violate the author expectations (that are markup document is available as a full document tree for selector matching) suggested by the WMML syntax in the standalone case.

> 2. There is a natural mapping of WebSRT into in-band text tracks.
> Each cue naturally maps into a encoding page (just like a WMML cue does, too). But in WebSRT, because the setup information is not brought in a hierarchical element surrounding all cues, it is easier to just chuck anything that comes before the first cue into an encoding header page. For WMML, this problem can be solved, but it is less natural.

Worse, the effects of the "less natural" part violate reasonable author expectations of what the tree that participates in selector matching is.

> 3. I am not too sure, but the "voice" markup may be useful.
> At this point I do wonder whether it has any further use than a @class attribute has in normal markup, but the idea of providing some semantic information about the content in cues is interesting.

I'm rather unconvinced by the voice markup as well. As far as I can tell, the voice markup is syntactic sugar for class for practical purposes. (I don't give much value to arguments that voices are more semantic than classes if the pratical purpose in to achieve visual effects for caption rendering.) Common translation subtitling works just fine without voice identification and (based on information in this thread) the original .srt doesn't have voices.

If voices are really needed for captioning use cases, I think it makes sense to balance the rarity of that need within the captioning spherewith the complexity of introducing syntactic sugar over the class attribute and the class selector. Do all captions use voice identification? Many? Only some? If only some captions (compared to the set of all captions--not to the set of captions plus subtitles) use voice identification, perhaps class in good enough. If nearly all captions use voice identification, sugaring might have merit.

> * there is no possibility to add file-wide metadata to WebSRT; things about authoring and usage rights as well as information about the media resource that the file relates to should be kept within the file. Almost all subtitle and caption format have the possibility for such metadata and we know from image, music and video resources how important it is to have the ability to keep such metadata inside the resource.

Generic metadata mechanisms are a slippery slope into a rathole, so it would be good not to go there. (One minute you have something like Dublin Core that looks simple enough and the next minute you have RDF, ontologies, etc., etc.)

> * there is no means to identify which parser is required in the cues (is it "plain text", "minimal markup", or "anything"?) and therefore it is not possible for an application to know how it should parse the cues.

I think it would be reasonable to always use the HTML fragment parsing algorithm for cues and require authors who just want something plain text-ish to escape < and &.

> * there is no version number on the format, thus it will be difficult to introduce future changes.

Version indicators in Web formats are an anti-pattern. Others have already pointed to the HTML WG versioning poll, etc.

> In fact, the subtitling community itself has already expressed their objections to building an extension of SRT, see http://forum.doom9.org/showthread.php?p=1396576 , so we shouldn't try to enforce something that those for whom it was done don't want. A clean slate will be better for all.

That's how we got RSS 2.0 *and* Atom. In retrospect, I think the feed community would have been better off if the group that did Atom (I was part of it) had extended RSS 2.0 over the objections of its original caretaker instead of creating yet another format.

Since SRT isn't currently deployed to the class of consumers (browsers) that WebSRT is intended for, the situation isn't really analogous with RSS/Atom, but I'm still rather unsympathetic to the above argument as a reason why SRT couldn't be used as the basis.

On Aug 9, 2010, at 17:04, Philip Jägenstedt wrote:

> I guess this is in support of Henri's proposal of parsing the cue using the HTML fragment parser (same as innerHTML)? That would be easy to implement, but how do we then mark up speakers? Using <span class="narrator"></span> around each cue is very verbose. HTML isn't very good for marking up dialog, which is quite a limitation when dealing with subtitles...

How often do captions distinguish two or more speakers in the same cue by styling them differently? In my experience, translation subtitles for TV, DVDs and theatrical movies virtually never do (but it's assumed that the reader of the subtitles can work out who is talking from the sound track, so I can see why this might not generalize to captioning for the deaf).

> Similarly, I think that the WebSRT parser should be designed to ignore things that it doesn't recognize,

I agree. Reusing the HTML fragment parsing algorithm would provide this for stuff within the cue text "for free".

On Aug 10, 2010, at 12:49, Philip Jägenstedt wrote:

> An HTML parser parsing <1> or <00:01:30> will produce text nodes "<1>" and "<00:01:30>". Without having read the HTML parsing algorithm I guess that elements need to begin with a letter or similar. So, it's not possible to (ab)use the HTML parser to handle inner timestamps of numerical voices, we'd have to replace those with something else, probably more verbose.

Given that voices (according to this thread; I didn't check) are a Hixie invention rather than an original feature of SRT, the <1> syntax doesn't have to be that way for compat. Instead, something that works in HTML without parser modifications could be used.

As for <00:01:30>, normal subtitles and, AFAIK, normal captions don't need time-based revelation of parts of the cue. (I'm not considering anime fansubbing "normal"; I mean TV, DVD and movie theater subtitles. Also, continuous revelation of live captioning isn't relevant to the <00:01:30> feature.) Since the <00:01:30> isn't essential for making the feature set of HTML5 accessible, I think the <00:01:30> feature for karaoke and anime fansubbing should be considered for removal if it happens to have any undesirable properties--and not working without HTML parser modifications is such an undesirable property.

I'd be OK with not supporting karaoke or anime fansubbing at all declaratively (requiring those use cases to be addressed in JavaScript) or with using more verbose syntax like <t t=00:01:30>...</t>.

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/