SSI: An Open Source Platform for Social Signal Interpretation

Author:Johannes Wagner, Florian Lingenfelser, Elisabeth André

Affiliation: Lab for Human Centered Computing (HCM), University of Augsburg, Germany

URL: http://hcm-lab.de/

Introduction

Automatic detection and interpretation of social signals carried by voice, gestures, mimics, etc. will play a key-role for next-generation interfaces as it paves the way towards more intuitive and natural human-computer interaction. In this article we introduce Social Signal Interpretation (SSI), a framework for real-time recognition of social signals. SSI supports a large range of sensor devices, filter and feature algorithms, as well as, machine learning and pattern recognition tools. It encourages developers to add new components using SSI’s C++ API, but also addresses front end users by offering an XML interface to build pipelines with a text editor. SSI is freely available under GPL at openssi.net.

Key Features

The Social Signal Interpretation (SSI) framework offers tools to record, analyse and recognize human behaviour in real-time, such as gestures, mimics, head nods, and emotional speech. Following a patch-based design pipelines are set up from autonomic components and allow the parallel and synchronized processing of sensor data from multiple input devices. In particularly, SSI supports the machine learning pipeline in its full length and offers a graphical interface that assists a user to collect own training corpora and obtain personalized models. In addition to a large set of built-in components SSI also encourages developers to extend available tools with new functions. For inexperienced users an easy-to-use XML editor is available to draft and run pipelines without special programming skills. SSI is written in C++ and optimized to run on computer systems with multiple CPUs.

The key features of SSI include:

  • Synchronized reading from multiple sensor devices
  • General filter and feature algorithms, such as image processing, signal filtering, frequency analysis and statistical measurements in real-time
  • Event-based signal processing to combine and interpret high level information, such as gestures, keywords, or emotional user states
  • Pattern recognition and machine learning tools for on-line and off-line processing, including various algorithms for feature selection, clustering and classification
  • Patch-based pipeline design (C++-API or easy-to-use XML editor) and a plug-in system to integrate new components

SSI also includes wrappers for many popular sensor devices and signal processing libraries, such as the e-Health Sensor Shield, IOM biofeedback system (Wild Divine), MicrosoftKinect, TheEyeTribe, Wii Remote Control, ARTKplus, FFMpeg, OpenCV, WEKA, Torch, DSPFilters, Fubi, Praat, OpenSmile, LibSox, and EmoVoice. To get SSI, please visit our download page.

 

Figure 1: Sketch summarizing the various tasks covered by SSI.

Framework Overview

Social Signal Interpretation (SSI) is an open source project meant to support the development of recognition systems using live input of multiple sensors [1]. Therefore it offers support for a large variety of filter and feature algorithms to process captured signals, as well as, tools to accomplish the full machine learning pipeline. Two type of users are addressed: developers are provided a C++-API that encourages them to write new components and front end users can define recognition pipelines in XML from available components.

Since social cues are expressed through a variety of channels, such as face, voice, postures, etc., multiple kind of sensors are required to obtain a complete picture of the interaction. In order to combine information generated by different devices raw signal streams need to be synchronized and handled in a coherent way. Therefore an architecture is established to handle diverse signals in a coherent way, no matter if it is a waveform, a heart beat signal, or a video image.

 

Figure 2: Examples of sensor devices SSI supports.

Sensor devices deliver raw signals, which need to undergo a number of processing steps in order to carve out relevant information and separate it from noisy or irrelevant parts. Therefore, SSI comes with a large repertoire of filter and feature algorithms to treat audiovisual and physiological signals. By putting processing blocks in series developers can quickly build complex processing pipelines, without having to care much about implementation details such as buffering and synchronization, which will be automatically handled by the framework. Since processing blocks are allocated to separate threads, individual window sizes can be chosen for each processing step.

 

Figure 3: Streams are processed in parallel using tailored window sizes.

Since human communication does not follow the precise mechanisms of a machine, but is tainted with a high amount of variability, uncertainty and ambiguity, robust recognizers have to be built that use probabilistic models to recognize and interpret the observed behaviour. To this end, SSI assembles all tasks of a machine learning pipeline including pre-processing, feature extraction, and online classification/fusion in real-time. Feature extraction converts a signal chunk into a set of compact features – keeping only the essential information necessary to classify the observed behaviour. Classification, finally accomplishes a mapping of observed feature vectors onto a set of discrete states or continuous values. Depending on whether the chunks are reduced to a single feature vector or remain a series of variable length, a statistical or dynamic classification scheme is applied. Examples of both types are included in the SSI framework.

 

Figure 4: Support for statistical and dynamic classification schemes.

To solve ambiguity in human interaction information extracted from diverse channels need to be combined. In SSI information can be fused at various levels. Already at data level, e. g. when depth information is enhanced with colour information. At feature level, when features of two ore more channels are put together to a single feature vector. Or at decision level, when probabilities of different recognizers are combined. In the latter cases, fused information should represent the same moment in time. If this is not possible due to temporal offsets (e. g. a gesture followed by a verbal instruction) fusion has to take place at event level. The preferred level depends on the type of information that is fused.

 

Figure 5: Fusion on feature, decision and event level.

XML Pipeline

In this article we will focus on SSI’s XML interface, which allows the definition of pipelines as plain text files. No particular programming skills or development environments are required. To assemble a pipeline any text editor can be used. However, there is also an XML editor in SSI, which offers special functions and simplifies the task of writing pipelines, e.g. by listing options and descriptions for a selected component.

 

Figure 6: SSI’s XML editor offers convenient access to available components (left panel). A component’s options can be directly accessed and edited in a special sheet (right panel).

To illustrate how XML pipelines are built in SSI, we will start off with a simple unimodal example. Let’s assume we wish to build an application that converts a sound into a spectrum of frequencies (a so called spectrogram). This is a typical task in audio processing as many properties of speech are best studied in the frequency domain. The following pipeline captures sound from a microphone and transforms it into a spectrogram. Both, the raw and the transformed signal, are finally visualized.

<?xml version="1.0" ?>
<pipeline ssi-v="1">

        <register>  
                <load name="ssiaudio.dll"/>
                <load name="ssisignal.dll"/>
                <load name="ssigraphic.dll" />
        </register>

        <!-- SENSOR -->
        <sensor create="ssi_sensor_Audio" option="audio" scale="true">
                <provider channel="audio" pin="audio"/>
        </sensor>

        <!-- PROCESSING -->
        <transformer create="ssi_feature_Spectrogram" minfreq="100" maxfreq="5100" nbanks="50">
                <input pin="audio" frame="0.01s" delta="0.015s"/>
                <output pin="spect"/>
        </transformer>

 <!-- VISUALIZATION -->
        <consumer create="ssi_consumer_SignalPainter" name="audio" size="10" type="2">
                <input pin="audio" frame="0.02s"/>
        </consumer>
        <consumer create="ssi_consumer_SignalPainter" name="spectrogram" size="10" type="1">
                <input pin="spect" frame="1"/>
        </consumer>

</pipeline>

 

Since SSI uses a plugin system, components are loaded dynamically at runtime. Therefore, we load the required components by including the according DLL files grouped within the <register> element. In our case, three DLLs will be loaded, namely ssiaudio.dll, which includes components to read from an audio source, ssisignal.dll, which includes generic signal processing algorithms, and ssigraphic.dll, which includes tools for visualizing the raw and processed signals. The remaining part of the pipeline defines which components will be created and in which way they connect with each other.

Sensors, introduced with the keyword <sensor>, define the source of a pipeline. Here, we place a component with the name ssi_sensor_Audio, which encapsulates an audio input. Generally, components offer options, which allow us to alter their behaviour. For instance, setting scale=true will tell the audio sensor to deliver the wave signal as floating point values in range [-1..1], otherwise we would receive a stream of integers. By setting option=audio we further instruct the component to save the final configuration to a file audio.option.

Next, we have to define, which sources of a sensor we want to tap. A sensor offers at least one such channel. To connect a channel we use the <provider> statement, which has two attributes: channel is the unique identifier of the channel and pin defines a freely selectable identifier, which is later on used to refer to the according signal stream. Internally, SSI will now create a buffer and constantly write incoming audio samples to it. By default, it will keep the last 10 seconds of the stream. Connected components that read from the buffer will receive a copy of the requested samples.

Components that connect to a stream, apply some processing, and output the result as a new stream, are tagged with the keyword <transformer>. The new stream may differ from the input stream in sample rate and type, e.g. a video image can be mapped to a single position value. Or – as in our case – a one dimensional audio stream of 16kHz is mapped on a multidimensional spectral representation of lower frequency, a so called spectrogram. A spectogram is created by calculating the energy in different frequency bins. The component in SSI that applies this kind of transformation is called ssi_feature_Spectrogram. Via the options minfreq, maxfreq, and nbanks, the frequency range in Hz, which is defined by [minfreq…maxfreq], and the number of bins are set. Options not set in the pipeline, like here the number of coefficients (nfft) and the type of window (wintype) used for the Fourier transformation, will be initialized with their default values (unless indirectly overwritten in an option file loaded via option as in case of the audio sensor).

By the tag <input> we specify the input stream for the transformer. Since we want to read raw audio samples we put audio, i.e. the pin we selected for the audio channel. Now, we need to decide on the size of the blocks in which the input stream is processed. We do this through the attribute frame, which defines the frame hop, i.e. the number of samples a window will be shifted after a read operation. Optionally, this window can be extended by a certain number of samples given by the attribute delta. In our case, we choose 0.01s and 0.015s, respectively. I.e. at each loop a block of length 0.025 seconds is retrieved and then shifted by 0.01 seconds. If we assume a sample rate of 16kHz (the default rate of an audio stream) this converts to a block length of 400 samples (16kHz*[0.01s+0.015s]) and a frame shift of 160 samples (16kHz*0.01s). In other words, at each calculation step, 400 samples are copied from the buffer and afterwards the read position is increased by 160 samples. Since the output is a single sample of dimension equal to the number of bins in the spectrogram, the sample rate of the output stream becomes 100 Hz (1/0.01s). For the new, transformed stream, SSI creates another buffer, to which we assign a new pin spect that we wrap in the <output> element.

Finally, we want to visualize the streams. Components that read from a buffer, but do not write back a new stream, are tagged with the keyword <consumer>. To draw the current content of a buffer in a graph we use an instance of ssi_consumer_SignalPainter that we connect to a stream pin within the <input> tag. To draw both, the raw and the transformed stream, we add two of them and connect one to audio and one to spect. The option type allows us to choose an appropriate kind of visualization. In case of the raw audio we set the frame length to 0.02s, i.e. the stream will be updated every 20 milliseconds. In case of the spectrogram, we set 1 (no trailing s), which sets the update rate to a single frame.

Now, we are ready to run the pipeline by typing xmlpipe <filename> on the console (or hitting F5 if you use SSI’s XML editor). When running for the first time, a pop-up shows up, so we can select an input source. The choice will be remembered and stored in the file audio.option. The output should be something like:

 

Figure 7: Left top: plot of a raw audio signal. Right bottom: spectrogram with low frequency bins being blue and high energy bins being red. The console window provides information on the current state of the pipeline.

Sometimes, when pipelines become long, it is clearer to outsource important options to a separate file. In the pipeline we mark those parts with $(<key>) and create a new file, which includes statements of the form <key> = <value>. For instance, we could alter the spectogram to:

<transformer create="ssi_feature_Spectrogram" minfreq="$(minfreq)" maxfreq="$(maxfreq)" nbanks="$(nbanks)">
        <input pin="audio" frame="0.01s" delta="0.015s"/>
        <output pin="spect"/>
</transformer>

and set the actual values in another file (while pipelines should end on .pipeline, config files should end on .pipeline-config):

minfreq = 100 # minimum frequency
maxfreq = 5100 # maximum frequency
nbanks = 100 # $(select{5,10,25,50,100}) number of bins

For convenience, SSI offers a small GUI named xmlpipeui.exe, which lists available options in a table and automatically parses new keys from a pipeline:

 

Figure 8: GUI to manage options and run pipelines with different configurations.

In SSI, the counterpart to streams are events. Unlike streams, which have a continuous nature, events may occur asynchronously and have a definite onset and offset. To demonstrate this feature we will extend our previous example and add an activity detector to drive the feature extraction, i.e. the spectogram will be displayed only during times when there is activity in the audio.

To do so, we add two more components. Another transformer (AudioActivity), which calculates loudness (method=0) and sets values below some threshold to zero (threshold=0.1). And another consumer (ZeroEventSender), which picks up the result and looks for parts in the signal that are non-zero. If such a part is detected and if it is longer than a second (mindur=1.0), an event is fired. To identify them, events will be equipped with an address composed of an event name and a sender name: <event>@<sender&gt. In our example, options ename and sname are applied to set the address to activity@audio.

<!-- ACTIVITY DETECTION -->
<transformer create="ssi_feature_AudioActivity" method="0" threshold="0.1">
        <input pin="audio" frame="0.01s" delta="0.015s"/>
        <output pin="activity"/>
</transformer>
<consumer create="ssi_consumer_ZeroEventSender" mindur="1.0" maxdur="5.0" sname="audio" ename="activity">
        <input pin="activity" frame="0.1s"/>
</consumer>

 

We can now change the visualization of the spectogram from continuous to event triggered. Therefore we replace the attribute frame by listen=activity@audio. We also set the length of the graph to zero (size=0), which will allow it to adapt its length dynamically to the duration of the event.

<consumer create="ssi_consumer_SignalPainter" name="spectrogram" size="0" type="1">
        <input pin="spect" listen="activity@audio" />
</consumer>

 

To display a list of current events in the framework, we also include an instance of the component called EventMonitor. Since it only reacts to events and is neither a consumer, nor a transformer, it is wrapped with the keyword <object>. With the <listen> tag we can determine, which events we want to receive. By setting address=@ and span=10000we configure the monitor to display any event in the last 10 seconds.

<object create="ssi_listener_EventMonitor" mpos="400,300,400,300">
        <listen address="@" span="10000"/>
</object>

The output of the new pipeline is shown below. Again, it contains continuous plots of the raw audio and the activity signal (left top). In the graph below the triggered spectrogram is displayed, showing the result for the latest activity event (18.2s – 19.4s). This corresponds to the top entry in the monitor (right bottom). It also lists three previous events. Since activity events do not contain additional meta data, they have a size of 0 bytes. In case of a classification event, e.g., class names with probabilities are attached.

 

Figure 9: In this example activity detection has been added to drive the spectrogram (see graph between raw audio and spectrogram). After a period of activity, an event is fired, which triggers the visualization of the spectrogram. Past events are listed in the window below the console.

Multi-modal Enjoyment Detection

We will now move to a more complex application – multi-modal enjoyment detection. The system we want to focus on has been developed as part of the European project (FP7) ILHAIRE (Incorporating Laughter into Human Avatar Interactions: Research and Experiments, see http://www.ilhaire.eu/). It combines the input of two sensors, a microphone and a camera, to predict in real-time the level of enjoyment of a user. In this context, we define enjoyment as an episode of positive emotion, indicated by visual and auditory cues of enjoyment, such as smiles and voiced laughters. On the basis of the frequency and intensity of these cues the level of enjoyment is determined.

 

Figure 10: The more cues of enjoyment a user displays, the higher will be the output of the system.

Training data for tuning the detection models was recorded in several sessions among three to four users having funny conversation, each session alsted for about 1.5h. During recordings each user was equipped with a headset and filmed with a Kinect and a HD camera. To allow simultaneous recording of four users, the setup included several pcs synchronized over the network. The possibility to keep pipelines, which are distributed over several machines in a network, in sync allows it to create large multi-modal corpora with multiple users. In this particular case, the raw data captured by SSI summed up to about ~4.78 GB per minute including audio, Kinect body and face tracking, as well as, HD video streams.

The following pipeline snippet connects to an audio and Kinect sensor and stores the captured signals on disk. Note that the audio stream and the Kinect rgb video are muxed in a single file. Therefore the audio is passed as an additional input source wrapped by a <xinput> element. An additional line is added at the top of the file to configure the <framework> to wait for a synchronization signal on port 1234.

<!-- SYNCHRONIZATION -->
<framework sync="true" sport="1234" slisten="true"/>

<!-- AUDIO SENSOR -->
<sensor create="ssi_sensor_Audio" option="audio" scale="true">
        <provider channel="audio" pin="audio"/>
</sensor>

<!-- KINECT SENSOR -->
<sensor create="ssi_sensor_MicrosoftKinect">
        <provider channel="rgb" pin="kinect_rgb"/>
        <provider channel="au" pin="kinect_au"/>
        <provider channel="face" pin="kinect_face"/>
</sensor>
<!-- STORAGE -->
<consumer create="ssi_consumer_FFMPEGWriter" url="rgb.mp4">
        <input pin="kinect_rgb" frame="1"/>
        <xinput size="1">
                <input pin="audio"/>
        </xinput>
</consumer>
<consumer create="ssi_consumer_FileWriter" path="au">
        <input pin="kinect_au" frame="5"/>
</consumer>
<consumer create="ssi_consumer_FileWriter" path="face">
        <input pin="kinect_face" frame="5"/>
</consumer>

 

Based on the audiovisual content, raters were asked to annotate audible and visual cues of laughter in the recordings. Afterwards, features are extracted from the raw signals and each feature vector is labelled by the according annotation tracks. The labelled feature vectors serve as input for a learning phase during which a separation of the feature space is seeked that allows a good segregation of the class labels. E.g. a feature measuring the extension of lip corners may correlate with smiling and hence be picked as an indicator for enjoyment. Since in complex recognition tasks no definite mapping exists, numerous approaches have been proposed to solve this task. SSI includes several well established learning algorithms, such as K-Nearest Neighbours, Gaussian Mixture Models, or Support Vector Machines. These algorithms are part of SSI’s machine learning library, which also allows provides tools to simulate (parts of) pipelines in a best-effort manner and to evaluate models in terms of expected recognition accuracy.

 

 

Figure 11: Manual annotations of the enjoyment cues are used to train detection models for each modality.

The following C++ code snippet gives an impression how learning is accomplished using SSI’s machine learning library. First, the raw audio file (“user1.wav”) and an annotation file (“user1.anno”) are loaded. Next, the audio stream is converted into a list of samples to which feature extraction is applied. Finally, a model is trained using those samples. To see how well the model performs on the training data an additional evaluation step is added.

// read audio
ssi_stream_t stream;
WavTools::ReadWavFile ("user1.wav", stream);

// read annotation
Annotation anno;
ModelTools::LoadAnnotation (anno, "user1.anno");

// create samples
SampleList samples;
ModelTools::LoadSampleList (samples, stream, anno, "user1");

// extract features
SampleList samples_t;
EmoVoiceFeat *ev = ssi_create (EmoVoiceFeat, "ev", true);
ModelTools::TransformSampleList (samples, samples_t, *ev);

// create model
IModel *svm = ssi_create (SVM, "svm", true);
Trainer trainer (svm);

// train and save
trainer.train (samples_t);
trainer.save (model);

// evaluation
Evaluation eval;
eval.evalKFold (trainer, samples_t, 10);
eval.print ();

 

After the learning phase the pre-trained classification models are ready to be plugged into a pipeline. To accomplish the pipeline at hand, two models were trained: one to detect audible laughter cues (e.g. laughter bursts) from the voice, and one to detect visual cues (e.g. smiles) in the face, which are now connected to the according feature extracting components. Activity detection is applied to decide if a frame contains sufficient information to be included in the classification process. For example, if no face is detected in the current image or if the energy of an audio chunk is too low, the frame will be discarded. Otherwise, classification is applied and the result is forwarded to the fusion component. The first steps in the pipeline are stream-based, i.e. signals are continuously processed over a fixed length window. Late components responsible for cue detection and fusion are event based processes, applied only where signals carry information relevant to the recognition process. To derive a final decision, both type of cues are finally combined.

The pipeline is basically an extension of the recording pipeline described earlier, which includes additional processing steps. To process the audio stream, the following snippet is added:

<!-- VOCAL ACTIVITY DETECTION -->
<transformer create="ssi_feature_AudioActivity" threshold="0.025">
        <input pin="audio" frame="19200" delta="28800"/>
        <output pin="voice_activity"/>
</transformer>      

<!-- VOCAL FEATURE EXTRACTION -->
<transformer create="ssi_feature_EmoVoiceFeat">
        <input pin="audio" frame="19200" delta="28800"/>
        <output pin="audio_feat"/>
</transformer>

<!-- VOCAL LAUGTHER CLASSIFICATION -->
<consumer create="ssi_consumer_Classifier" trainer="models\voice" sname="laughter" ename="voice">
        <input pin="audio_feat" frame="1" delta="0" trigger="voice_activity"></input>
</consumer>

 

While video processing is accomplished by:

<!-- FACIAL ACTIVITY DETECTION -->
<transformer create="ssi_feature_MicrosoftKinectFAD" minfaceframes="10">
        <input pin="kinect_face" frame="10" delta="15"/>
        <output pin="face_activity"/>
</transformer>      

<!-- FACIAL FEATURE EXTRACTION -->
<transformer create="ssi_feature_MicrosoftKinectAUFeat">
        <input pin="kinect_au" frame="10" delta="15"/>
        <output pin="kinect_au_feat"/>
</transformer>

<!-- FACIAL LAUGHTER CLASSIFICATION -->
<consumer create="ssi_consumer_Classifier" trainer="models\face" sname="laughter" ename="face">
        <input pin="kinect_au_feat" frame="1" delta="0" trigger="face_activity"></input>         
</consumer>

 

Obviously, both snippets share a very similar structure, though different components are loaded and frame/delta sizes are adjusted to fit the samples rates. Note that this time the trigger stream (voice_activity/face_activity) is directly applied by the keyword trigger in the <input> section of the classifier. The pre-trained models for detecting cues in the vocal and facial feature streams are loaded from file via the trainer option. Cue probabilities are finally combined via vector fusion:

<object create="ssi_listener_VectorFusionModality" ename="enjoyment" sname="fusion"
        update_ms="400" fusionspeed="1.0f" gradient="0.5f" threshold="0.1f" >
        <listen address="laughter@voice,face"/>
</object>

 

The core idea of vector based fusion is to handle detect events (laughter cues in our case) as independent vectors in a single or multidimensional event space and derive a final decision by aggregating vectors while taking into account temporal relationships (the influence of an event is reduced over time) [2]. In contrast to standard segment-based fusion approaches, which force a decision in all modalities at each fusion step, it is individually decided if and when a modality contributes. The following animation illustrates this, with green dots representing detected cues, whereas red dots mean that no cue was detected. Note that the final fusion decision – represented by the green bar on the right – grows if cues are detected and afterwards shrinks again as no new input is added:

 

Figure 12: The pre-trained models are applied to detect enjoyment cues at run-time (green dots). The more cues are detected across modalities, the higher is the output of the fusion algorithm (green bar).

The following video clip demonstrates the detection pipeline in action. Input streams are visualized on the left (top: video stream with face tracking, bottom: raw audio stream and acdtivity plot). Laughter cues detected in the two modalities are shown in two separate bar plots on top of each other. Result of final multi-modal laughter detection in the bar plot on the very right.

Conclusion

In this article we have introduced Social Signal Interpretation (SSI), a multi-modal signal processing framework. We have introduced the basic concepts of SSI and demonstrated by means of two examples how SSI allows users to quickly set up processing pipelines in XML. While in this article we have focused on how to build applications from available components, a simple plugin system offers developers the possibility to extend the pool of available modules with new components. By sharing these within the multimedia community everyone is encouraged to enrich the functions of SSI in future. Readers who would like to learn more about SSI or get free access to the source code, please visit http://openssi.net.

Future Work

So far, applications developed with SSI target desktop machines, possibly distributed over several such machines within a network. Though, wireless sensor devices become more and more popular offering some amount of mobility, yet it is not possible to monitor subjects outside a certain radius, unless they take a desktop computer with them. Smartphones and similar pocket computers can help to overcome this limitation. In a pilot project, a plugin for SSI has been developed to stream in real-time audiovisual content and other sensor data over a wireless LAN connection from a mobile device running Android to an SSI server. The data is analysed on the fly on the server and the result is send back to the mobile device. Such scenarios give ample scope for new applications, which allow it to follow a user “in the wild”. In the CARE project (a sentient Context-Aware Recommender System for the Elderly) an recommender system is currently developed, which in real-time assists solely living elderly people in their home environment by recommending them depending on the situation physical, mental and social activities that are aimed to induce senior citizens to be self-confident, independent and active at everyday life again.

Acknowledgements

The work described in this article is funded by the European Union under research grant CEEDs (FP7-ICT-2009-5) and TARDIS (FP7-ICT-2011-7), and ILHAIRE, a Seventh Framework Programme (FP7/2007-2013) under grant agreement n°270780.

Bookmark the permalink.