Skip to main content

2017 | Buch

Multimodal Pattern Recognition of Social Signals in Human-Computer-Interaction

4th IAPR TC 9 Workshop, MPRSS 2016, Cancun, Mexico, December 4, 2016, Revised Selected Papers

insite
SUCHEN

Über dieses Buch

This book constitutes the thoroughly refereed post-workshop proceedings of the Fourth IAPR TC9 Workshop on Pattern Recognition of Social Signals in Human-Computer-Interaction, MPRSS 2016, held in Cancun, Mexico, in December 2016.

The 13 revised papers presented focus on pattern recognition, machine learning and information fusion methods with applications in social signal processing, including multimodal emotion recognition, user identification, and recognition of human activities.

Inhaltsverzeichnis

Frontmatter
Active Shape Model vs. Deep Learning for Facial Emotion Recognition in Security
Abstract
As Facial Emotion Recognition is becoming more important everyday, A research experiment was conducted to find the best approach for Facial Emotion Recognition. Deep Learning (DL) and Active Shape Model (ASM) were tested. Researchers have worked with Facial Emotion Recognition in the past, with both Deep learning and Active Shape Model, with wanting to find out which approach is better for this kind of technology. Both methods were tested with two different datasets and our findings were consistent. Active shape Model was better when tested versus Deep Learning. However, Deep Learning was faster, and easier to implement, which means with better Deep Learning software, Deep Learning will be better in recognizing and classifying facial emotions. For this experiment Deep Learning showed accuracy for the CAFE dataset by 60% whereas Active Shape Model showed accuracy at 93%. Likewise with the JAFFE dataset; Deep Learning showed accuracy at 63% and Active Shape Model showed accuracy at 83%.
Monica Bebawy, Suzan Anwar, Mariofanna Milanova
Bimodal Recognition of Cognitive Load Based on Speech and Physiological Changes
Abstract
An essential component of the interaction between humans is the reaction through their emotional intelligence to emotional states of the counterpart and respond appropriately. This kind of action results in a successful interpersonal communication. The first step to achieve this goal within HCI is the identification of these emotional states.
This paper deals with the development of procedures and an automated classification system for recognition of mental overload and mental underload utilizing speech an physiological signals. Mental load states are induced through easy and tedious tasks for mental underload and complex and hard tasks for mental overload. It will be shown, how to select suitable features, build uni modal classifiers which then are combined to a bimodal mental load estimation by the use of early and late fusion. Additionally the impact of speech artifacts on physiological data is investigated.
Dennis Held, Sascha Meudt, Friedhelm Schwenker
Human Mobility-Pattern Discovery and Next-Place Prediction from GPS Data
Abstract
We provide a novel algorithm for the discovery of mobility patterns and prediction of users’ destination locations, both in terms of geographic coordinates and semantic meaning. We did not use any semantic data voluntarily provided by a user, and there was no sharing of data among the users. An advantage of our algorithm is that it allows a trade-off between prediction accuracy and information. Experimental validation was conducted on a GPS dataset collected in the Microsoft Research Asia GeoLife project by 168 users in a period of over five years.
Faina Khoroshevsky, Boaz Lerner
Fusion Architectures for Multimodal Cognitive Load Recognition
Abstract
Knowledge about the users emotional state is important to achieve human like, natural Human Computer Interaction (HCI) in modern technical systems. Humans rely on implicit signals like body gestures and posture, vocal changes (e.g. pitch) and mimic expressions when communicating. We investigate the relation between them and human emotion, specifically when completing easy or difficult tasks. Additionally we include physiological data which also differ in changes of cognitive load. We focus on discriminating between mental overload and mental underload, which can e.g. be useful in an e-tutorial system. Mental underload is a new term used to describe the state a person is in when completing a dull or boring task. It will be shown how to select suited features, build uni modal classifiers which then are combined to a multimodal mental load estimation by the use of Markov Fusion Networks (MFN) and Kalman Filter Fusion (KFF).
Daniel Kindsvater, Sascha Meudt, Friedhelm Schwenker
Face Recognition in Home Security System Using Tensor Decomposition Based on Radix-(2 × 2) Hierarchical SVD
Abstract
This paper explains research based on improving real time face recognition system using new Radix-(2 × 2) Hierarchical Singular Value Decomposition (HSVD) for 3rd order tensor. The scientific interest, aimed at the processing of image sequences represented as tensors, was significantly increased in the last years. Current home security solutions can be cost-prohibitive, prone to false alarms, and fail to alert the user of a break-in while they are away from the home. Because of advancements in facial detection and recognition techniques made in the past decade, we propose a home security system that takes advantage of this technology. To create such a system at a low cost requires algorithms that are powerful enough to detect users in various environmental conditions and fast enough to process real time video on weaker hardware. Experiments comparing the efficiency of two different decomposition techniques applied for face recognition in real time.
Roumen Kountchev, Suzan Anwar, Roumiana Kountcheva, Mariofanna Milanova
Performance Analysis of Gesture Recognition Classifiers for Building a Human Robot Interface
Abstract
In this paper we present a natural human computer interface based on gesture recognition. The principal aim is to study how different personalized gestures, defined by users, can be represented in terms of features and can be modelled by classification approaches in order to obtain the best performances in gesture recognition. Ten different gestures involving the movement of the left arm are performed by different users. Different classification methodologies (SVM, HMM, NN, and DTW) are compared and their performances and limitations are discussed. An ensemble of classifiers is proposed to produce more favorable results compared to those of a single classifier system. The problems concerning different lengths of gesture executions, variability in their representations, generalization ability of the classifiers have been analyzed and a valuable insight in possible recommendation is provided.
Tiziana D’Orazio, Nicola Mosca, Roberto Marani, Grazia Cicirelli
On Automatic Question Answering Using Efficient Primal-Dual Models
Abstract
Automatic question answering has been a major problem in natural language processing since the early days of research in the field. Given a large dataset of question-answer pairs, the problem can be tackled using text matching in two steps: find a set of similar questions to a given query from the dataset and then provide an answer to the query by evaluating the answers stored in the dataset for those questions. In this paper, we treat the text matching problem as an instance of the inexact graph matching problem and propose an efficient approximate matching scheme. We utilize the well known quadratic optimization problem metric labeling as the framework of graph matching. In order to solve the text matching, we first embed the sentences given in natural language into a weighted directed graph. Next, we present a primal-dual approximation algorithm for the linear programming relaxation of the metric labeling problem to match text graphs. We demonstrate the utility of our approach on a question answering task over a large dataset which involves matching of questions as well as plain text.
Yusuf Osmanlıoğlu, Ali Shokoufandeh
Hierarchical Bayesian Multiple Kernel Learning Based Feature Fusion for Action Recognition
Abstract
Human action recognition is an area with increasing significance and has attracted much research attention over these years. Fusing multiple features is intuitively an appropriate way to better recognize actions in videos, as single type of features is not able to capture the visual characteristics sufficiently. However, most of the existing fusion methods used for action recognition fail to measure the contributions of different features and may not guarantee the performance improvement over the individual features. In this paper, we propose a new Hierarchical Bayesian Multiple Kernel Learning (HB-MKL) model to effectively fuse diverse types of features for action recognition. The model is able to adaptively evaluate the optimal weights of the base kernels constructed from different features to form a composite kernel. We evaluate the effectiveness of our method with the complementary features capturing both appearance and motion information from the videos on challenging human action datasets, and the experimental results demonstrate the potential of HB-MKL for action recognition.
Wen Sun, Chunfeng Yuan, Pei Wang, Shuang Yang, Weiming Hu, Zhaoquan Cai
Audio Visual Speech Recognition Using Deep Recurrent Neural Networks
Abstract
In this work, we propose a training algorithm for an audio-visual automatic speech recognition (AV-ASR) system using deep recurrent neural network (RNN). First, we train a deep RNN acoustic model with a Connectionist Temporal Classification (CTC) objective function. The frame labels obtained from the acoustic model are then used to perform a non-linear dimensionality reduction of the visual features using a deep bottleneck network. Audio and visual features are fused and used to train a fusion RNN. The use of bottleneck features for visual modality helps the model to converge properly during training. Our system is evaluated on GRID corpus. Our results show that presence of visual modality gives significant improvement in character error rate (CER) at various levels of noise even when the model is trained without noisy data. We also provide a comparison of two fusion methods: feature fusion and decision fusion.
Abhinav Thanda, Shankar M. Venkatesan
Audio-Visual Recognition of Pain Intensity
Abstract
In this work, a multi-modal pain intensity recognition system based on both audio and video channels is presented. The system is assessed on a newly recorded dataset consisting of several individuals, each subjected to 3 gradually increasing levels of painful heat stimuli under controlled conditions. The assessment of the dataset consists of the extraction of a multitude of features from each modality, followed by an evaluation of the discriminative power of each extracted feature set. Finally, several fusion architectures, involving early and late fusion, are assessed. The temporal availability of the audio channel is taken in consideration during the assessment of the fusion architectures.
Patrick Thiam, Viktor Kessler, Steffen Walter, Günther Palm, Friedhelm Schwenker
The SenseEmotion Database: A Multimodal Database for the Development and Systematic Validation of an Automatic Pain- and Emotion-Recognition System
Abstract
In our modern industrial society the group of the older (generation 65+) is constantly growing. Many subjects of this group are severely affected by their health and are suffering from disability and pain. The problem with chronic illness and pain is that it lowers the patient’s quality of life, and therefore accurate pain assessment is needed to facilitate effective pain management and treatment. In the future, automatic pain monitoring may enable health care professionals to assess and manage pain in a more and more objective way. To this end, the goal of our SenseEmotion project is to develop automatic pain- and emotion-recognition systems for successful assessment and effective personalized management of pain, particularly for the generation 65+. In this paper the recently created SenseEmotion Database for pain- vs. emotion-recognition is presented. Data of 45 healthy subjects is collected to this database. For each subject approximately 30 min of multimodal sensory data has been recorded. For a comprehensive understanding of pain and affect three rather different modalities of data are included in this study: biopotentials, camera images of the facial region, and, for the first time, audio signals. Heat stimulation is applied to elicit pain, and affective image stimuli accompanied by sound stimuli are used for the elicitation of emotional states.
Maria Velana, Sascha Gruss, Georg Layher, Patrick Thiam, Yan Zhang, Daniel Schork, Viktor Kessler, Sascha Meudt, Heiko Neumann, Jonghwa Kim, Friedhelm Schwenker, Elisabeth André, Harald C. Traue, Steffen Walter
Photometric Stereo for 3D Face Reconstruction Using Non Linear Illumination Models
Abstract
Face recognition in presence of illumination changes, variant pose and different facial expressions is a challenging problem. In this paper, a method for 3D face reconstruction using photometric stereo and without knowing the illumination directions and facial expression is proposed in order to achieve improvement in face recognition. A dimensionality reduction method was introduced to represent the face deformations due to illumination variations and self shadows in a lower space. The obtained mapping function was used to determine the illumination direction of each input image and that direction was used to apply photometric stereo. Experiments with faces were performed in order to evaluate the performance of the proposed scheme. From the experiments it was shown that the proposed approach results very accurate 3D surfaces without knowing the light directions and with a very small differences compared to the case of known directions. As a result the proposed approach is more general and requires less restrictions enabling 3D face recognition methods to operate with less data.
Barbara Villarini, Athanasios Gkelias, Vasilios Argyriou
Recursively Measured Action Units
Abstract
Video is a recursively measured signal where frames are highly correlated with structured sparsity and low-rankness. A simple example is facial expression - multiple measurements of a face. Several salient facial action units (AU) are often enough for a correct expression recognition. We hope that AUs are not stored when the face remains neutral until they become salient when expression occurs, as well as that the recognizer is still able to restore historic salient AUs. A temporal memory mechanism is appealing for a real-time system to reduce rich redundancy in information coding. We formulate expression recognition as a video Sparse Representation based Classification (SRC) with Long Short-Term Memory (LSTM) mechanism, which is applicable for human actions yet requiring a careful design of sparse representation due to possible changing scenes. Preliminary experiments are conducted on the MPI Face Video Database (MPI-VDB). We compare the proposed sparse coding with temporal modeling using LSTM against the baseline of sparse coding with simultaneous recursive matching pursuit (SRMP).
Xiang Xiang, Trac D. Tran
Backmatter
Metadaten
Titel
Multimodal Pattern Recognition of Social Signals in Human-Computer-Interaction
herausgegeben von
Friedhelm Schwenker
Stefan Scherer
Copyright-Jahr
2017
Electronic ISBN
978-3-319-59259-6
Print ISBN
978-3-319-59258-9
DOI
https://doi.org/10.1007/978-3-319-59259-6