Top

2013 | Book

Read chapter Read first chapter

Multimodal Pattern Recognition of Social Signals in Human-Computer-Interaction

First IAPR TC3 Workshop, MPRSS 2012, Tsukuba, Japan, November 11, 2012, Revised Selected Papers

Editors: Friedhelm Schwenker, Stefan Scherer, Louis-Philippe Morency

Publisher: Springer Berlin Heidelberg

Book Series : Lecture Notes in Computer Science

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

This book constitutes the thoroughly refereed post-workshop proceedings of the First IAPR TC3 Workshop on Pattern Recognition of Social Signals in Human-Computer-Interaction (MPRSS2012), held in Tsukuba, Japan in November 2012, in collaboration with the NLGD Festival of Games. The 21 revised papers presented during the workshop cover topics on facial expression recognition, audiovisual emotion recognition, multimodal Information fusion architectures, learning from unlabeled and partially labeled data, learning of time series, companion technologies and robotics.

Frontmatter

Modelling Social Signals

Generative Modelling of Dyadic Conversations: Characterization of Pragmatic Skills During Development Age

Abstract

This work investigates the effect of children age on pragmatic skills, i.e. on the way children participate in conversations, in particular when it comes to turn-management (who talks when and how much) and use of silences and pauses. The proposed approach combines the extraction of “Steady Conversational Periods” - time intervals during which the structure of a conversation is stable - with Observed Influence Models, Generative Score Spaces and feature selection strategies. The experiments involve 76 children split into two age groups: “pre-School” (3-4 years) and “School” (6-8 years). The statistical approach proposed in this work predicts the group each child belongs to with precision up to 85%. Furthermore, it identifies the pragmatic skills that better account for the difference between the two groups.

Anna Pesarin, Monja Tait, Alessandro Vinciarelli, Cristina Segalin, Giovanni Bilancia, Marco Cristani

Social Coordination Assessment: Distinguishing between Shape and Timing

Abstract

In this paper, we propose a new framework to assess temporal coordination (synchrony) and content coordination (behavior matching) in dyadic interaction. The synchrony module is dedicated to identify the time lag and possible rhythm between partners. The imitation module aims at assessing the distance between two gestures, based on 1-Class SVM models. These measures discriminate significantly conditions where synchrony or behavior matching occurs from conditions where these phenomenons are absent. Moreover, these measures are unsupervised and could be implemented online.

Emilie Delaherche, Sofiane Boucenna, Koby Karp, Stéphane Michelet, Catherine Achard, Mohamed Chetouani

Social Signals in Facial Expressions

A Novel LDA and HMM-Based Technique for Emotion Recognition from Facial Expressions

Abstract

Over the last few years, many researchers have done a lot of work on emotion recognition from facial expressions using the techniques of image processing and computer vision. In this paper we explore the application of Latent Dirichlet Allocation, a technique conventionally used in Natural text processing, when used with Hidden Markov Model, for the same. The classification is done at an image sequence level. Each frame of an image sequence is represented by a feature vector, which is mapped to one of the words from the dictionary generated using K-means. Latent Dirichlet Allocation then models each image sequence as a set of topics. We further know the order of topics for image sequence from the order of words, which we use for classification in the next step. This is done by training a Hidden Markov Model for each emotion. The emotions dealt with are six basic emotions: happy, fear, sad, surprise, angry, disgust and contempt. We compare our results with another technique in which sequence information of words instead of topics is used by HMM for learning facial expression dynamics. The results have been presented on CK+ dataset [2]. The accuracy obtained on the proposed technique is 80.77% .The use of word-sequence in found to give better results in general.

Akhil Bansal, Santanu Chaudhary, Sumantra Dutta Roy

Generation of Facial Expression for Communication Using Elfoid with Projector

Abstract

We propose a method for generating facial expressions with a mobile projector built into a cellphone-type tele-operated android, called Elfoid. Elfoid is designed to transmit the presence of a speaker to a communication partner in a remote place using a camera and microphone and a soft exterior that provides the look and feel of human skin. To transmit the presence of a speaker, Elfoid sends not only voice but also facial expressions and emotion information captured by the camera and microphone. Elfoid cannot, however, display facial motions because of its compactness and the lack of sufficiently small actuator motors. Therefore, we use a mobile projector and generate projection patterns to represent facial expressions estimated with a camera.

Maiya Hori, Hideki Takakura, Hiroki Yoshimura, Yoshio Iwai

Eye Localization from Infrared Thermal Images

Abstract

By using the knowledge of facial structure and temperature distribution, this paper proposes an automatic eye localization method from infrared thermal images. A facial structure consisting of 15 sub-regions is proposed to extract Haar-like features. Eight classifiers are learned from the features selected by Adaboost algorithm for left and right eye, respectively. A vote strategy is used to find the most likely eyes. Experimental results on the NVIE and Equinox databases show the effectiveness of our approach.

Shangfei Wang, Peijia Shen, Zhilei Liu

Analysis of Speech and Physiological Speech

The Effect of Fuzzy Training Targets on Voice Quality Classification

Abstract

The dynamic use of voice qualities in spoken language can reveal useful information on a speaker’s attitude, mood and affective states. This information may be desirable for a range of speech technology applications. However, annotation of voice quality may frequently be inconsistent across raters. But whom should one trust or is the truth somewhere in between? The current study looks first to describe a voice quality feature set that is suitable for differentiating voice qualities on a tense to breathy dimension. These features are used as inputs to a fuzzy-input fuzzy-output support vector machine (F²SVM) algorithm, to automatically classify the voice qualities. The F²SVM is compared to standard approaches and shows promising results. Performances for cross validation, leave one speaker out, and cross corpus experiments of around 90% are achieved.

Stefan Scherer, John Kane, Christer Gobl, Friedhelm Schwenker

Physiological Effects of Delayed System Response Time on Skin Conductance

Abstract

Research on psychological effects of delayed system response time (SRT) has not lost its topicality, since uncertainty in providing immediate system response remains, even after decades of stunning enhancements in computer science. When delays occur, the user’s expectancy about the temporal course of an interaction is not fulfilled which he may interpret as irritating. The current study investigates physiological effects on the skin conductance (SC) and its particular patterns in two experimental scenarios. In the first scenario, unexpected delays of 0.5, 1, and 2 seconds occur while the subject is performing a two-choice auditory categorization task, expecting the system to respond immediately after their input. The second scenario is a wizard-of-oz (woz) scenario in which the user plays the game ‘concentration’ that is being manipulated in order to induce various emotional states. During the ‘negative’ sequences delays of 6 seconds are triggered. The patterns of the mean SC curves during delays are analyzed.

David Hrabal, Christin Kohrs, André Brechmann, Jun-Wen Tan, Stefanie Rukavina, Harald C. Traue

A Non-invasive Multi-sensor Capturing System for Human Physiological and Behavioral Responses Analysis

Abstract

We present a new noninvasive multi-sensor capturing system for recording video, sound and motion data. The characteristic of the system is its 1msec. order accuracy hardware level synchronization among all the sensors as well as automatic extraction of variety of ground truth from the data. The proposed system enables the analysis of the correlation between variety of psychophysiological model (modalities), such as facial expression, body temperature changes, gaze analysis etc... . Following benchmarks driven framework principles, the data captured by our system is used to establish benchmarks for evaluation of the algorithms involved in the automatic emotions recognition process.

Senya Polikovsky, Maria Alejandra Quiros-Ramirez, Takehisa Onisawa, Yoshinari Kameda, Yuichi Ohta

Motion Analysis and Activity Recognition

3D Motion Estimation of Human Body from Video with Dynamic Camera Work

Abstract

Occlusion or camera setting produces a high degree of ambiguity when estimating human body motion from monocular video sequences. Good human motion models are an important means of addressing this problem. In this work, we propose a hierarchical motion model and a motion estimation for it to estimate human motion without camera calibration and with free camera operation. The model is able to generate particles in multi-spaces and thus is able to estimate both camera view and human motion at one time. We showed the possibility of achieving 3D motion estimation for simple movements such as ”walking” without camera calibration and with dynamic camera operation.

Matsumoto Ayumi, Wu Xiaojun, Kawamura Harumi, Kojima Akira

Motion History of Skeletal Volumes and Temporal Change in Bounding Volume Fusion for Human Action Recognition

Abstract

Human action recognition is an important area of research in computer vision. Its applications include surveillance systems, patient monitoring, human-computer interaction, just to name a few. Numerous techniques have been developed to solve this problem in 2D and 3D spaces. However 3D imaging gained a lot of interest nowadays. In this paper we propose a novel view-independent action recognition algorithm based on fusion between a global feature and a graph based feature. We used the motion history of skeleton volumes; we compute a skeleton for each volume and a motion history for each action. Then, alignment is performed using cylindrical coordinates-based Fourier transform to form a feature vector. A dimension reduction step is subsequently applied using PCA and action classification is carried out by using Mahalonobis distance, and Linear Discernment analysis. The second feature is the temporal changes in bounding volume, volumes are aligned using PCA and each divided into sub volumes then temporal change in volume is calculated and classified using Logistic Model Trees. The fusion is done using majority vote. The proposed technique is evaluated on the benchmark IXMAS and i3DPost datasets where results of the fusion are compared against using each feature individually. Obtained results demonstrate that fusion improve the recognition accuracy over individual features and can be used to recognize human actions independent of view point and scale.

Abubakrelsedik Karali, Mohammed ElHelw

Multi-view Multi-modal Gait Based Human Identity Recognition from Surveillance Videos

Abstract

In this paper we propose a novel human-identification scheme from long range gait profiles in surveillance videos. We investigate the role of multi view gait images acquired from multiple cameras, the importance of infrared and visible range images in ascertaining identity, the impact of multimodal fusion, efficient subspace features and classifier methods, and the role of soft/secondary biometric (walking style) in enhancing the accuracy and robustness of the identification systems, Experimental evaluation of several subspace based gait feature extraction approaches (PCA/LDA) and learning classifier methods (NB/MLP/SVM/SMO) on different datasets from a publicly available gait database CASIA, show significant improvement in recognition accuracies with multimodal fusion of multi-view gait images from visible and infrared cameras acquired from video surveillance scenarios.

Emdad Hossain, Girija Chetty, Roland Goecke

Multimodal Fusion

Using the Transferable Belief Model for Multimodal Input Fusion in Companion Systems

Abstract

Systems with multimodal interaction capabilities have gained a lot of attention in recent years. Especially so called companion systems that offer an adaptive, multimodal user interface show great promise for a natural human computer interaction. While more and more sophisticated sensors become available, current systems capable of accepting multimodal inputs (e.g. speech and gesture) still lack the robustness of input interpretation needed for companion systems. We demonstrate how evidential reasoning can be applied in the domain of graphical user interfaces in order to provide such reliability and robustness expected by users. For this purpose an existing approach using the Transferable Belief Model from the robotic domain is adapted and extended.

Felix Schüssel, Frank Honold, Michael Weber

Fusion of Fragmentary Classifier Decisions for Affective State Recognition

Abstract

Real human-computer interaction systems based on different modalities face the problem that not all information channels are always available at regular time steps. Nevertheless an estimation of the current user state is required at anytime to enable the system to interact instantaneously based on the available modalities. A novel approach to decision fusion of fragmentary classifications is therefore proposed and empirically evaluated for audio and video signals of a corpus of non-acted user behavior. It is shown that visual and prosodic analysis successfully complement each other leading to an outstanding performance of the fusion architecture.

Gerald Krell, Michael Glodek, Axel Panning, Ingo Siegert, Bernd Michaelis, Andreas Wendemuth, Friedhelm Schwenker

Backmatter

Title: Multimodal Pattern Recognition of Social Signals in Human-Computer-Interaction
Editors: Friedhelm Schwenker
Stefan Scherer
Louis-Philippe Morency
Publisher: Springer Berlin Heidelberg
Electronic ISBN: 978-3-642-37081-6
Print ISBN: 978-3-642-37080-9
DOI: https://doi.org/10.1007/978-3-642-37081-6

Springer Professional

Multimodal Pattern Recognition of Social Signals in Human-Computer-Interaction

First IAPR TC3 Workshop, MPRSS 2012, Tsukuba, Japan, November 11, 2012, Revised Selected Papers

About this book

Table of Contents

Frontmatter

Modelling Social Signals

Generative Modelling of Dyadic Conversations: Characterization of Pragmatic Skills During Development Age

Social Coordination Assessment: Distinguishing between Shape and Timing

Social Signals in Facial Expressions

A Novel LDA and HMM-Based Technique for Emotion Recognition from Facial Expressions

Generation of Facial Expression for Communication Using Elfoid with Projector

Eye Localization from Infrared Thermal Images

Analysis of Speech and Physiological Speech

The Effect of Fuzzy Training Targets on Voice Quality Classification

Physiological Effects of Delayed System Response Time on Skin Conductance

A Non-invasive Multi-sensor Capturing System for Human Physiological and Behavioral Responses Analysis

Motion Analysis and Activity Recognition

3D Motion Estimation of Human Body from Video with Dynamic Camera Work

Motion History of Skeletal Volumes and Temporal Change in Bounding Volume Fusion for Human Action Recognition

Multi-view Multi-modal Gait Based Human Identity Recognition from Surveillance Videos

Multimodal Fusion

Using the Transferable Belief Model for Multimodal Input Fusion in Companion Systems

Fusion of Fragmentary Classifier Decisions for Affective State Recognition

Backmatter

Premium Partner