Author Archives: mosami

Programming video in a web world

Mosami provides an API for operating on video streams and connecting them to the browser. So where is the “application” in our programming model?

In the websites you visit every day, you interact with buttons on a web page (“buy”), the buttons send messages to server-side applications, and that server-side application updates a database (quantity[item]–) that synchronizes information across users.

In a real-time communications application like Mosami, that backend isn’t a database anymore, but the structure remains similar – the user sees a webpage from your webserver that combines their video display and any non-video interactions they need (like buttons), those interaction events are processed by a server-side application on your webserver, and that application speaks to a backend – for video, the Mosami API – when it wants to make changes to other resources.

A few things change in the programming style because the video has its own independent channel outside the webpage:

  • Interaction events can come from the video, not just the user’s browser. Here, analyzer modules – DetectMotion, DetectFace, DetechSpeech, and others – watch a video stream and send messages to the server-side application. Your application processes the events and determines what action to take.
  • Changes to the video happen in the cloud instead of the browser. Processing (multiple) video streams is hard work, and out of reach for many client devices. Instead, the browser is a “thin client” that only sends the webcam video up to the cloud, and displays a video stream it receives from the cloud into the browser page. To make changes to the video, your server application sends messages via the Mosami API to filter and mixer elements to modify and combine streams in different way.

Interactive displays

Let’s look at how to use Mosami to build an interactive display triggered by the viewer’s actions – a video output of a presenter+slide with two different layouts. When the viewer is far away, it will emphasize the presenter over the slides, but swap focus to the slides for better readability when the viewer leans in for a closer look.

What’s under the hood? Let’s take a look.

Code for this example is from our examples library at https://launchpad.net/mosami/+download in the file nearfar.py

The main thing is that we want our application to respond to the viewer’s position. To do that, we use the DetectFace module, which outputs a stream of messages with the bounding box (x,y,w,h) of the detected face(s), and attaching a callback function to handle those messages. Whenever a new DetectFace message arrives, it checks to see whether the (single) face is big or small compared to a threshold. To make it less jumpy, we accumulate those results, and add a slow drift to “far” if no face is detected.

if msg['num'] == 1:
     if msg['width'] > self.defaults['face_width_threshold']:
          self.state['tracker'] += 1.0
     else:
          self.state['tracker'] -= 1.0
else:
     self.state['tracker'] -= 0.05

After that, the callback thresholds the recent activity and chooses a “near” or “far” layout accordingly

if self.state['tracker'] > 20:
     self.state['layout'] = 'near'
     self.state['tracker'] = 10
if self.state['tracker'] < -20:
     self.state['layout'] = 'far'
     self.state['tracker'] = -10
self.mix.msg('switch_layout', layout=self.state['layout'])

Finally, we want that ‘switch_layout’ message to change the display. First, we need the basic display elements. For the presenter, we remove the background with CutoutPerson. For the slides, we use ViewPDF to convert the slides into a video stream. Then, we composite them together using LayoutMix, which allows one to preconfigure multiple layouts.

What are “video applications”?

Mosami enables you to build interactive video applications and deploy them to your users. What do we mean by that? We mean live (real-time) video communications like in video chat, but instead of the webcam video feed being the whole experience, it’s just one element that can be modified and mashed together with other things right inside the video stream the user sees.

To illustrate that with a few examples:

Remote support – Let a support agent highlight and zoom in on things in the customer’s video to help fix a problem

Social gaming – See live video of the players (with background removed) mixed with game elements like the cards or board in a single integrated view

Livestreaming – Incorporate multiple video feeds (presenter, content, audience) in different layouts. Change layouts based on positions/actions detected inside the video stream

In each case, what makes these experiences unique is the way in which the video streams are opened up, analyzed, and mixed back together on-the-fly. That mix-and-match media processing is precisely what our API makes easy.

First app: Follow the speaker

As I said in the previous post, we’ll use this blog to help show you how to get started building cool new video applications using the Mosami API.  But to help get you started, I’m going to start with a familiar example – mixing multiple video streams with a focus on the active speaker.  This demo uses our Python language bindings, and is based on “convdir_position.py” from our Example Gallery available for download from our Launchpad site https://launchpad.net/mosami.

What the application does is divide the screen into two regions “top” and “bottom”.  When you speak (the presenter), you move to the top.  When you’re silent (the audience) you move to the bottom.  Simple.  To do that, we make use of two Mosami API functions:  LayoutMix (to create the mixed output) and DetectSpeech (to trigger changes to the layout when someone speaks).  When finished, it looks something like this

How does it work?

In the init method of the demo, we launch a pipeline of LayoutMix (which extends the basic Mix functionality to deal with defining regions, resizing on new videos, and more).  The python binding jlaunch wraps the POST calls necessary to create new pipelines.

self.mix = self.mo.jlaunch('LayoutMix', dst=self.state['dst'], volume=1)

And define two regions (called DIVs) in the layout.  The python binding msg wraps the PUT calls necessary to update an existing pipeline.

self.mix.msg('add_div', layout='intro', div='top', placement_algorithm='coords_1xn_grid', xpos=0, ypos=0, width=self.state['width'], height=self.state['height']/2, zorder=10)
self.mix.msg('add_div', layout='intro', div='bottom', placement_algorithm='coords_1xn_grid', xpos=0, ypos=self.state['height']/2, width=self.state['width'], height=self.state['height']/2, zorder=10)

Each time a video stream is added, we need to do two things.

First, we attach it as an input to the LayoutMix pipeline and place it at the bottom

self.mix.msg('add', stream=stream, div='bottom')

Second, for each input, we launch a DetectSpeech pipeline (sometimes referred to as VAD – Voice Activity Detection)

self.vads[stream] = self.mo.jlaunch('DetectSpeech', src=stream, interval=1000000000, run_length=2*1000000000, rx_msg_callback=self.vad_cb, user_params=stream)

There’s one step still missing – when someone starts/stops speaking, we need to change the layout appropriately.  This is handled with the callback function rx_msg_callback=self.vad_cb that will be invoked whenever the DetectSpeech pipeline creates an event.  Here, we use one callback function for all pipelines, with user_params=stream passed to the callback function to distinguish them.

With that setup, the callback function itself is quite simple – filter on the correct message type, then send a message to the LayoutMix pipeline to move the given stream to the top or bottom DIV.

    def vad_cb(self, msg, user_params=None):
        stream = user_params
        if 'moMessageType' in msg or 'type' in msg:
            type = msg['type'] if 'type' in msg else msg['moMessageType']
            if type == 'com.mosami.analyzer.vad.activity':
                 if msg['above']:
                       self.mix.msg('add', stream=user_params, div='top')
                 else:
                       self.mix.msg('add', stream=user_params, div='bottom')

That’s all there is to it! This is a simple example, but also illustrates how Mosami gives you direct access to the application behavior you want.

Welcome to Mosami

Welcome to the Mosami blog, where we’ll demo new features, walk through some code examples for using the system, and share other information to help you make exciting video applications leveraging the platform.

To start it off, a little background on what we do and why:  As you can see from our site, we’ve built technology for doing media processing in the cloud that we’re delivering via our API to help people create real-time, interactive video applications.  So why this?  Why now?

When we started looking at real-time video communications a few years ago, what we noticed was a shift underway.  The traditional problem of basic video communications – getting video from point A to point B – was being solved with the appearance of tons of video chat applications and  new tools for connecting real-time video directly to the browser.  

That shift is now opening up opportunities to not just transport the video, but to shape it in real time, the way that TV and movie producers and desktop video editing programs had done for offline video.  And create experiences where multiple users can come together and real interact with the video by automating those “manual” interactions through algorithms processing the video live.

As we tried to do that ourselves, we realized that chaining those algorithms together on the fly was hard work – getting the right software installed on the right machines, linking them to the video sources, etc – and probably out of reach of all but a few specialists.  At the same time, the handful of us can’t develop all the new applications waiting to be created.

So that’s where we are.  Focused on making it as easy as possible for people to connect their real–time video to sophisticated media processing, so all of you can create the next wave of interactive video applications.