[whatwg] default audio upload format (was Fwd: The Media Capture API Working Draft)

Roger Hågensen rescator at emsai.net
Fri Sep 3 21:59:27 PDT 2010

  On 2010-09-04 01:55, James Salsman wrote:
> Most of the MIME types that support multiple channels and sample rates
> have registered parameters for selecting those.  Using a PCM format
> such as audio/L16 (CD/Red Book audio) as a default would waste a huge
> amount of network bandwidth, which translates directly into money for
> some users.
> On Fri, Sep 3, 2010 at 2:19 PM, David Singer<singer at apple.com>  wrote:
>> I agree that if the server says it accepts something, then it should cover at least the obvious bases, and transcoding at the server side is not very hard.  However, I do think tht there needs to be some way to protect the server (and user, in fact) from mistakes etc.  If the server was hoping for up to 10 seconds of 8kHz mono voice to use as a security voice-print, and the UA doesn't cut off at 10 seconds, records at 48 Khz stereo, and the user forgets to hit 'stop', quite a few systems might be surprised (and maybe charge for) the size of the resulting file.
>> It's also a pain at the server to have to sample-rate convert, downsample to mono, and so on, if the terminal could do it.

Here's an idea. Almost all codecs currently use a quality system.
Where quality is indicated by a range from 0.0 to 1.0 (a few might go 
-1.0 to 1.0, a tuned Ogg Vorbis has a small negative range).
Anyway. If 1.0 could indicate max quality (lossless or lossy) and 0.5 
would indicate 50% quality.
This is similar to what most of the encoders support (usually with a -q 

So if the server asks for let's say FLAC at quality 0.0 that would mean 
compress the hell out of it vs 1.0 which would be a fast encoding.
While for a lossy codec like say Ogg quality of 1.0 would mean retain as 
much of the original audio as possible, while 0.0 would mean toss away 
as much as possible.

Combine this with a min and max bitrate value etc. and a browser could 
be told that the server wants:
"Give me audio in format zxy with medium quality (and medium CPU use as 
well I guess) between 100kbit and 200kbit in stereo at 48khz between 
10seconds and 2minutes long."

Obviously with lossless formats the bitrate and quality means nothing, 
but a low quality value could indicate using the highest compression 

I guess additionally a browser could present a UI if no max duration was 
indicated and ask the user to choose a sensible one. (maybe the standard 
could define a max length if none was negotiated as a extra safetynet?)

Oh and a lossless codec like FLAC there is usually a compression level, 
the higher it is the more CPU/resources are used to compress more tightly.
So a quality indicator only makes sense for lossy, while both lossy and 
lossless should be mapable to a compression level indicator.
But I think that having both quality and compression indicators might be 
best as many lossy codecs allows setting quality and compression level 
(plus bitrate range).

Hmm, has anything similar been discussed on video and image capture as well?
If not, then I think it's best to make sure that audio/image/video 
capture uses the exact same indicators to avoid confusion:

Bits/s: Min/max bitrate is applicable to (lossy mostly, rarely lossless) 
audio, video, video (w/audio), images.
%: Compression level are applicable to (lossy and lossless) audio, 
video, video (w/audio), images.
Seconds: Min/max duration are applicable to (lossy and lossless) audio, 
video, video (w/audio).
Hz: Frequency and channels are applicable to (lossy and lossless) audio, 
video (w/audio).
Bits: Depth of color are applicable to (lossy and lossless) video, video 
(w/audio), images.
Chn: Channels are applicable to (lossy and lossless) audio, video (w/audio).
WxH: Width/Height are applicable to (lossy and lossless) video, video 
(w/audio), images.

Bits/s = 0-??????? where 0 indicate no minimum for Min value and no 
maximum for Max, otherwise the value indicate the desired bitrate in 
Bits per second.
% = 0-100 where %100 is max compression lossless or least quality if 
lossy, and %0 is no compression if lossless or max quality if lossy.
Seconds: 0-??????? where 0 indicate no minimum duration for the Min 
value, and where 0 indicate no maximum for Max value, otherwise it's a 
number indicating the Min and Max range the server allows/expects.
Hz: 0-??????? where 0 indicate that anything is acceptable, otherwise 
the frequency expected.
Bits: 0-??????? where 0 indicates no preference, otherwise the desired 
bit depth for the image/video, and for audio.
Chn: 0-??????? where 0 indicate no preference, otherwise the desired 
WxH: 0-??????? where 0 indicate no preference, otherwise the desired 
FPS: 0-??????? where 0 indicate no preference, otherwise desired framerate.

I believe that covers most of them?

Here's an example (of values):
Video (w/audio, and both lossy)
hz="48000 44100"
chn="2 1-2"
bits="16 24 32"
wxh="1920x1080 1280x720 854x480  320x240-1024x768"
fps="24 50 60 10-60"

This means that the stream must be between 500kbit and 1mbit, 
video+audio combined,
compression must be between 25%-75% (thus averaging 50% quality maybe?),
no minimum length, but must not be longer than 3 minutes,
the frequency must be either 44.1KHz or 48KHz,
only mono or stereo is allowed, stereo is desired if possible,
16 or 24 or 32 bit audio (lossy codecs like mp3 is floating point so in 
that case the bits really do not matter that much)
any resolution from 320x240 and up to 1024x768 is accepted, but if 
possible 480 or 720 or 1080 is desired (widescreen implied by the 
explicit ratios).
24bit color desired if possible.
any framerate from 10 to 60fps accepted, but if possible 24 or 50 or 60 
fps is desired.

This should give the browser enough info to pass on to the video and 
audio encoders, or enough info to calculate the details the encoders need.

A few more examples:
Audio (lossless)
hz="22050 8192 11025 44100 48000"
chn="1 1-2"

Image (lossy)
depth="24 0-32"

Audio (lossy)
hz="48000 44100"
chn="1 2"

Hopefully it's all self explanatory,
but let me point out in that the last audio example the compression 
indicate max quality but the stream will be constrained within 
128-192kbit per second.
Also if you look at the two audio examples "1 2" and "1 1-2" essentially 
mean the same, 1 to 2 channels accepted but 1 channel is preferred.
I think that something similar to the existing http accept header could 
be used as the basis for this.
The worst I guess is coming up with short but sensible value names and 
what type of values would be acceptable, order of preference, default 
behavior if missing etc.

One thing is for sure, video, video (w/audio), audio, images, all need a 
unified way to keep people from going insane,
heck I just ate up like 1 and a half hours just thinking up the stuff 
above and writing this post, so you can see my thoughts kind of changing 
on some things.
(the beginning of the this post is actually 1 and a half hours ago, so 
this is me descending into madness in the middle of the night) *rubs 

Heh! Hopefully this all made sense to you all though... And that you all 
understand that if this is to be done, it really needs to be done 
properly "now" rather than a really nasty patchwork later on.

Who knows, maybe later some day the http accept header ends up adopting 
some of the ideas above for improved content negotiation with the server 
when fetching media
(an iPod and a Gaming PC would have different abilities and support for 
media due to screen sizes, bandwidth and CPU power as well).


Roger "Rescator" Hågensen.
Freelancer - http://EmSai.net/

More information about the whatwg mailing list