[whatwg] Access to live/raw audio and video stream data from both local and remote sources
roBman at mob-labs.com
Tue Aug 23 23:58:12 PDT 2011
thanks for your reply and following up...this all sounds good.
Especially the alignment with the Audio WG. I'd love to see some
resolution on the Web Audio vs Audio Data proposals.
By "raw" I was just clumsily referring to being able to access the data
within the tracks within the stream programmatically.
Here's a very clumsy example I created that shows the type of access I
Obviously this uses the non-standard Mozilla Audio Data API.
At the moment you can do a similar-ish thing with video by pumping it
into a canvas element then using getImageData...but this is a bit of a
clumsy and inefficient pipeline.
Ideally we could programmatically reach into the stream data of any of
the video and audio tracks to process the data in any number of ways
including signal analysis, image recognition, dynamic interface
In fact I'm presenting at a conference on Friday on just that topic
I think the Khronos StreamInput work has this type of thing in mind too
so I wonder if that's also been discussed within the RTC/DAP/Audio WGs?
PS: I am really concerned about being labelled as a "cross posting spam
king" so I apologise for creating any of these cross posting storms.
However, it really seems like the best way to make sure that this
cross-group discussion is shared with all the relevant groups.
On Wed, 2011-08-24 at 08:39 +0200, Stefan Håkansson LK wrote:
> I'm sorry for the late answer. The W3C DAP and WebRTC chairs have
> discussed this, and come to the following:
> - The WebRTC WG deals with access to live (audio and video) streams, and
> also currently have support for local recording of them in the API
> proposal .
> - DAP has a note about the <device> element in the HTML Media Capture
> draft, but the <device> element has been replaced by "getUserMedia" .
> - In the WebRTC charter there are references to DAP regarding device
> exploration and media capturing as that was deemed as in DAP scope at
> the time of writing the WebRTC charter. This has however since been
> resolved, for media streams this will be handled by WebRTC.
> - WebRTC is planning coordination with the Audio WG to ensure alignment
> regarding media streams.
> A question: what do you mean by "raw" audio and video stream data? The
> MediaStreams discussed in WebRTC are more of logical references (which
> you can attach to audio/video elements for rendering, to a
> PeerConnection for streaming to a peer and so on).
> Stefan (for the DAP and WebRTC chairs).
>  http://dev.w3.org/2011/webrtc/editor/webrtc.html
> On 2011-07-27 02:56, Rob Manson wrote:
> > Hi,
> > sorry for posting across multiple groups, but I hope you'll see from my
> > comments below that this is really needed.
> > This is definitely not intended as criticism of any of the work going
> > on. It's intended as constructive feedback that hopefully provides
> > clarification on a key use case and it's supporting requirements.
> > "Access to live/raw audio and video stream data from both local
> > and remote sources in a consistent way"
> > I've spent quite a bit of time trying to follow a clear thread of
> > requirements/solutions that provide API access to raw stream data (e.g.
> > audio, video, etc.). But I'm a bit concerned this is falling in the gap
> > between the DAP and RTC WGs. If this is not the case then please point
> > me to the relevant docs and I'll happily get back in my box 8)
> > Here's how the thread seems to flow at the moment based on public
> > documents.
> > On the DAP page  the mission states:
> > "the Device APIs and Policy Working Group is to create
> > client-side APIs that enable the development of Web Applications
> > and Web Widgets that interact with devices services such as
> > Calendar, Contacts, Camera, etc"
> > So it seems clear that this is the place to start. Further down that
> > page the "HTML Media Capture" and "Media Capture" APIs are listed.
> > HTML Media Capture (camera/microphone interactions through HTML forms)
> > initially seems like a good candidate, however the intro in the latest
> > PWD  clearly states:
> > "Providing streaming access to these capabilities is outside of
> > the scope of this specification."
> > Followed by a NOTE that states:
> > "The Working Group is investigating the opportunity to specify
> > streaming access via the proposed<device> element."
> > The link on the "proposed<device> element"  links to a "no longer
> > maintained" document that then redirects to the top level of the whatwg
> > "current work" page . On that page the most relevant link is the
> > video conferencing and peer-to-peer communication section . More
> > about that further below.
> > So back to the DAP page to follow explore the other Media Capture API
> > (programmatic access to camera/microphone)  and it's latest PWD .
> > The abstract states:
> > "This specification defines an Application Programming Interface
> > (API) that provides access to the audio, image and video capture
> > capabilities of the device."
> > And the introduction states:
> > "The Capture API defines a high-level interface for accessing
> > the microphone and camera of a hosting device. It completes the
> > HTML Form Based Media Capturing specification [HTMLMEDIACAPTURE]
> > with a programmatic access to start a parametrized capture
> > process."
> > So it seems clear that this is not related to streams in any way either.
> > The Notes column for this API on the DAP page  also states:
> > "Programmatic API that completes the form based approach
> > Need to check if still interest in this
> > How does it relate with the Web RTC Working Group?"
> > Is there an updated position on this?
> > So if you then head over to the WebRTC WG's charter  it states:
> > "...to define client-side APIs to enable Real-Time
> > Communications in Web browsers.
> > These APIs should enable building applications that can be run
> > inside a browser, requiring no extra downloads or plugins, that
> > allow communication between parties using audio, video and
> > supplementary real-time communication, without having to use
> > intervening servers..."
> > So this is clearly focused upon peer-to-peer communication "between"
> > systems and the stream related access is naturally just treated as an
> > ancillary requirement. The scope section then states:
> > "Enabling real-time communications between Web browsers require
> > the following client-side technologies to be available:
> > - API functions to explore device capabilities, e.g. camera,
> > microphone, speakers (currently in scope for the Device APIs&
> > Policy Working Group)
> > - API functions to capture media from local devices (camera and
> > microphone) (currently in scope for the Device APIs& Policy
> > Working Group)
> > - API functions for encoding and other processing of those media
> > streams,
> > - API functions for establishing direct peer-to-peer
> > connections, including firewall/NAT traversal
> > - API functions for decoding and processing (including echo
> > cancelling, stream synchronization and a number of other
> > functions) of those streams at the incoming end,
> > - Delivery to the user of those media streams via local screens
> > and audio output devices (partially covered with HTML5)"
> > So this is where I really start to feel the gap growing. The DAP is
> > pointing to RTC saying not sure how if our Camera/Microphone APIs are
> > being superseded by the work in the RTC...and the RTC then points back
> > to say it will be relying on work in the DAP. However the RTCs
> > Recommended Track Deliverables list does include:
> > "Media Stream Functions, Audio Stream Functions and Video Stream
> > Functions"
> > So then it's back to the whatwg MediaStream and LocalMediaStream current
> > work . Following this through you end up back at the<audio> and
> > <video> media element with some brief discussion about media data .
> > Currently the only API that I'm aware of that allows live access to the
> > audio data through the<audio> tag is the relatively proprietary Mozilla
> > Audio Data API .
> > And while the video stream data can be accessed by rendering each frame
> > into a canvas 2d graphics context and then using getImageData to extract
> > and manipulate it from there , this seems more like a work around
> > than an elegantly designed solution.
> > As I said above, this is not intended as a criticism of the work that
> > the DAP WG, WebRTC WG or WHATWG are doing. It's intended as
> > constructive feedback to highlight that the important use case of
> > "Access to live/raw audio and video stream data from both local and
> > remote sources" appears to be falling in the gaps between the groups.
> >> From my perspective this is a critical use case for many advanced web
> > apps that will help bring them in line with what's possible in the
> > native single vendor stack based apps at the moment (e.g. iPhone&
> > Android). And it's also critical for the advancement of web standards
> > based AR applications and other computer vision, hearing and signal
> > processing applications.
> > I understand that a lot of these specifications I've covered are in very
> > formative stages and that requirements and PWDs are just being drafted
> > as I write. And that's exactly why I'm raising this as a single and
> > consolidated perspective that spans all these groups. I hope this goes
> > some way towards "Access to live/raw audio and video stream data from
> > both local and remote sources" being treated as an essential and core
> > use case that binds together the work of all these groups. With a clear
> > vision for this and a little consolidated work I think this will then
> > also open up a wide range of other app opportunities that we haven't
> > even thought of yet. But at the moment it really feels like this is
> > being treated as an assumed requirement and could end up as a poorly
> > formed second class bundle of semi-related API hooks.
> > For this use case I'd really like these clear requirements to be
> > supported:
> > - access the raw stream data for both audio and video in similar ways
> > - access the raw stream data from both remote and local streams in
> > similar ways
> > - ability to inject new data or the transformed original data back into
> > streams and presented audio/video tags in a consistent way
> > - all of this be optimised for performance to meet the demands of live
> > signal processing
> > roBman
> > PS: I've also cc'd in the mozilla dev list as I think this directly
> > relates to the current "booting to the web" thread 
> >  http://www.w3.org/2009/dap/
> >  http://www.w3.org/TR/2011/WD-html-media-capture-20110414/#introduction
> >  http://dev.w3.org/html5/html-device/
> >  http://www.whatwg.org/specs/web-apps/current-work/complete/#devices
> >  http://www.whatwg.org/specs/web-apps/current-work/complete/#auto-toc-9
> >  http://www.w3.org/TR/2010/WD-media-capture-api-20100928/
> >  http://www.w3.org/2011/04/webrtc-charter.html
> >  http://www.whatwg.org/specs/web-apps/current-work/complete/video-conferencing-and-peer-to-peer-communication.html#mediastream
> >  http://www.whatwg.org/specs/web-apps/current-work/complete/the-iframe-element.html#media-data
> >  https://wiki.mozilla.org/Audio_Data_API
> >  https://developer.mozilla.org/En/Manipulating_video_using_canvas
> >  http://groups.google.com/group/mozilla.dev.platform/browse_thread/thread/7668a9d46a43e482#
More information about the whatwg