Skip to main content
Top

2019 | Book

Multimodal Pattern Recognition of Social Signals in Human-Computer-Interaction

5th IAPR TC 9 Workshop, MPRSS 2018, Beijing, China, August 20, 2018, Revised Selected Papers

insite
SEARCH

About this book

This book constitutes the refereed post-workshop proceedings of the 5th IAPR TC9 Workshop on Pattern Recognition of Social Signals in Human-Computer-Interaction, MPRSS 2018, held in Beijing, China, in August 2018.
The 10 revised papers presented in this book focus on pattern recognition, machine learning and information fusion methods with applications in social signal processing, including multimodal emotion recognition and pain intensity estimation, especially the question how to distinguish between human emotions from pain or stress induced by pain is discussed.

Table of Contents

Frontmatter
Multi-focus Image Fusion with PCA Filters of PCANet
Abstract
As is well known to all, the training of deep learning model is time consuming and complex. Therefore, in this paper, a very simple deep learning model called PCANet is used to extract image features from multi-focus images. First, we train the two-stage PCANet using ImageNet to get PCA filters which will be used to extract image features. Using the feature maps of the first stage of PCANet, we generate activity level maps of source images by using nuclear norm. Then, the decision map is obtained through a series of post-processing operations on the activity level maps. Finally, the fused image is achieved by utilizing a weighted fusion rule. The experimental results demonstrate that the proposed method can achieve state-of-the-art fusion performance in terms of both objective assessment and visual quality.
Xu Song, Xiao-Jun Wu
An Image Captioning Method for Infant Sleeping Environment Diagnosis
Abstract
This paper presents a new method of image captioning, which generate textual description of an image. We applied our method for infant sleeping environment analysis and diagnosis to describe the image with the infant sleeping position, sleeping surface and bedding condition, which involves recognition and representation of body pose, activity and surrounding environment. In this challenging case, visual attention as an essential part of human visual perception is employed to efficiently process the visual input. Texture analysis is used to give a precise diagnosis of sleeping surface. The encoder-decoder model was trained by Microsoft COCO dataset combined with our own annotated dataset contains relevant information. The result shows it is able to generate description of the image and address the potential risk factors in the image, then give the corresponding advice based on the generated caption. It proved its ability to assist human in infant care-giving area and potential in other human assistive systems.
Xinyi Liu, Mariofanna Milanova
A First-Person Vision Dataset of Office Activities
Abstract
We present a multi-subject first-person vision dataset of office activities. The dataset contains the highest number of subjects and activities compared to existing office activity datasets. Office activities include person-to-person interactions, such as chatting and handshaking, person-to-object interactions, such as using a computer or a whiteboard, as well as generic activities such as walking. The videos in the dataset present a number of challenges that, in addition to intra-class differences and inter-class similarities, include frames with illumination changes, motion blur, and lack of texture. Moreover, we present and discuss state-of-the-art features extracted from the dataset and baseline activity recognition results with a number of existing methods. The dataset is provided along with its annotation and the extracted features.
Girmaw Abebe, Andreu Catala, Andrea Cavallaro
Perceptual Judgments to Detect Computer Generated Forged Faces in Social Media
Abstract
There has been an increasing interest in developing methods for image representation learning, focused in particular on training deep neural networks to synthesize images. Generative adversarial networks (GANs) are used to apply face aging, to generate new viewpoints, or to alter face attributes like skin color. For forensics specifically on faces, some methods have been proposed to distinguish computer generated faces from natural ones and to detect face retouching. We propose to investigate techniques based on perceptual judgments to detect image/video manipulation produced by deep learning architectures. The main objectives of this study are: (1) To develop technique to make a distinction between Computer Generated and photographic faces based on Facial Expressions Analysis; (2) To develop entropy-based technique for forgery detection in Computer Generated (CG) human faces. The results show differences between emotions in both original and altered videos. These computed results were large and statistically significant. The results show that the entropy value for the altered videos is reduced comparing with the value of the original videos. Histograms of original frames have heavy tailed distribution, while in case of altered frames; the histograms are sharper due to the tiny values of images vertical and horizontal edges.
Suzan Anwar, Mariofanna Milanova, Mardin Anwer, Anderson Banihirwe
Combining Deep and Hand-Crafted Features for Audio-Based Pain Intensity Classification
Abstract
In this work, the classification of pain intensity based on recorded breathing sounds is addressed. A classification approach is proposed and assessed, based on hand-crafted features and spectrograms extracted from the audio recordings. The goal is to use a combination of feature learning (based on deep neural networks) and feature engineering (based on expert knowledge) in order to improve the performance of the classification system. The assessment is performed on the SenseEmotion Database and the experimental results point to the relevance of such a classification approach.
Patrick Thiam, Friedhelm Schwenker
Deep Learning Algorithms for Emotion Recognition on Low Power Single Board Computers
Abstract
In the world of Human-Computer Interaction, a computer should have the ability to communicate with humans. One of the communication skill that a computer requires is to recognize the emotional state of the human. With the state-of-the-art computing systems along with Graphical Processing Units, a Deep Neural Network can be realized by training on any publicly available dataset and learn the whole emotion estimation into one single network. In a real-time application, the inference of such a network may not need high computational power as training a network does.
Several Single Board Computers (SBC) such as Raspberry Pi is now available with sufficient computational power wherein during inference; small Deep Neural Networks models could perform well enough with acceptable accuracy and processing delay. The paper deals in exploring SBC capabilities for DNN inference, where we prepare a target platform on which real-time camera sensor data is processed to detect face regions and succeed further with recognizing emotions. Several DNN architectures are evaluated on SBC considering processing delay, possible frame rates and classification accuracy on SBC. Finally, a Neural Compute Stick (NCS) such as Intel’s Movidius is used to look at the performance of SBC for Emotion classification.
Venkatesh Srinivasan, Sascha Meudt, Friedhelm Schwenker
Improving Audio-Visual Speech Recognition Using Gabor Recurrent Neural Networks
Abstract
The performance of speech recognition systems can be significantly improved when visual information is used in conjunction with the audio ones, especially in noisy environments. Prompted by the great achievements of deep learning in solving Audio-Visual Speech Recognition (AVSR) problems, we propose a deep AVSR model based on Long Short-Term Memory Bidirectional Recurrent Neural Network (LSTM-BRNN). The proposed deep AVSR model utilizes the Gabor filters in both the audio and visual front-ends with Early Integration (EI) scheme. This model is termed as BRNN\(_{av}\) model. The Gabor features simulate the underlying spatiotemporal processing chain that occurs in the Primary Audio Cortex (PAC) in conjunction with Primary Visual Cortex (PVC). We named it Gabor Audio Features (GAF) and Gabor Visual Features (GVF). The experimental results show that the deep Gabor (LSTM-BRNNs)-based model achieves superior performance when compared to the (GMM-HMM)-based models which utilize the same front-ends. Furthermore, the use of GAF and GVF in both audio and visual front-ends attain significant improvement in the performance compared to the traditional audio and visual features.
Ali S. Saudi, Mahmoud I. Khalil, Hazem M. Abbas
Evolutionary Algorithms for the Design of Neural Network Classifiers for the Classification of Pain Intensity
Abstract
In this paper we present a study on multi-modal pain intensity recognition based on video and bio-physiological sensor data. The newly recorded SenseEmotion dataset consisting of 40 individuals, each subjected to three gradually increasing levels of painful heat stimuli, has been used for the evaluation of the proposed algorithms. We propose and evaluated evolutionary algorithms for the design and adaptation of the structure of deep artificial neural network architectures. Feedforward Neural Network and Recurrent Neural Network have been considered for the optimisation by using a Self-Configuring Genetic Algorithm (SelfCGA) and Self-Configuring Genetic Programming (SelfCGP).
Danila Mamontov, Iana Polonskaia, Alina Skorokhod, Eugene Semenkin, Viktor Kessler, Friedhelm Schwenker
Visualizing Facial Expression Features of Pain and Emotion Data
Abstract
Pain and emotions reveal important information about the state of a person and are often expressed via the face. Most of the time, systems which analyse these states consider only one type of expression. For pain, the medical context is a common scenario for automatic monitoring systems and it is not unlikely that emotions occur there as well. Hence, these systems should not confuse both types of expressions. To facilitate advances in this field, we use video data from the BioVid Heat Pain Database, extract Action Unit (AU) intensity features and conduct first analyses by creating several feature visualizations. We show that the AU usage pattern is more distinct for the pain, amusement and disgust classes than for the sadness, fear and anger classes. For the former, we present additional visualizations which reveal a clearer picture of the typically used AUs per expression by highlighting dependencies between AUs (joint usages). Finally, we show that the feature discrimination quality varies heavily across the 64 tested subjects.
Jan Sellner, Patrick Thiam, Friedhelm Schwenker
Backmatter
Metadata
Title
Multimodal Pattern Recognition of Social Signals in Human-Computer-Interaction
Editors
Prof. Dr. Friedhelm Schwenker
Stefan Scherer
Copyright Year
2019
Electronic ISBN
978-3-030-20984-1
Print ISBN
978-3-030-20983-4
DOI
https://doi.org/10.1007/978-3-030-20984-1

Premium Partner