Skip to main content

Über dieses Buch

This book constitutes the thoroughly refereed post-workshop proceedings of the Third IAPR TC3 Workshop on Pattern Recognition of Social Signals in Human-Computer-Interaction, MPRSS 2014, held in Stockholm, Sweden, in August 2014, as a satellite event of the International Conference on Pattern Recognition, ICPR 2014. The 14 revised papers presented focus on pattern recognition, machine learning and information fusion methods with applications in social signal processing, including multimodal emotion recognition, user identification, and recognition of human activities.





Automatic Image Collection of Objects with Similar Function by Learning Human Grasping Forms

This paper proposes an automatic functional object segmentation method based on modeling the relationship between grasping hand form and the object appearance. First the relationship among a representative grasping pattern and a position and pose of a object relative to the hand is learned based on a few typical functional objects. By learning local features from the hand grasping various tools with various way to hold them, the proposed method can estimate the position, scale, direction of the hand and the region of the grasped object. By some experiments, we demonstrate that the proposed method can detect them in cluttered backgrounds.
Shinya Morioka, Tadashi Matsuo, Yasuhiro Hiramoto, Nobutaka Shimada, Yoshiaki Shirai

Client Specific Image Gradient Orientation for Unimodal and Multimodal Face Representation

Multimodal face recognition systems usually provide better recognition performance compared to systems based on a single modality. To exploit this advantage, in this paper, an image fusion method which integrates region segmentation and pulse coupled neural network (PCNN) is used to obtain fused images by using visible (VIS) images and infrared (IR) images. Then, client specific image gradient orientation (CSIGO) is proposed which is inspired by the successful application of client specific technique and image gradient orientations technique. As most of the traditional appearance-based subspace learning algorithms are not robust to illumination changes, to remedy this problem to some extent, we adopt the image gradient orientations method. Moreover, to better describe the discrepancies between different classes, client specific is introduced to derive one dimensional Fisher face per client. Thus CS-IGO-LDA and improved CS-IGO-LDA are proposed in this paper, which combine the merits of IGO and client specific technique. Experimental results obtained on publicly available databases indicate the effectiveness of the proposed methods on unimodal and multimodal face recognition.
He-Feng Yin, Xiao-Jun Wu, Xiao-Qi Sun

Multiple-manifolds Discriminant Analysis for Facial Expression Recognition from Local Patches Set

In this paper, a novel framework is proposed for feature extraction and classification of facial expression recognition, namely multiple manifold discriminant analysis (MMDA), which assumes samples of different expressions reside on different manifolds, thereby learning multiple projection matrices from training set. In particular, MMDA first incorporates five local patches, including the regions of left and right eyes, mouth and left and right cheeks from each training sample to form a new training set, and then learns projection matrix from each expression so that maximizes the manifold margins among different expressions and minimizes the manifold distances of the same expression. A key feature of MMDA is that it can extract the discriminative information of expression-specific for classification rather than that of subject-specific, leading to a robust performance in practical applications. Our experiments on Cohn-Kanade and JAFFE databases demonstrate that MMDA can effectively enhance the discriminant power of the extracted expression features.
Ning Zheng, Lin Qi, Ling Guan

Monte Carlo Based Importance Estimation of Localized Feature Descriptors for the Recognition of Facial Expressions

The automated and exact identification of facial expressions in human computer interaction scenarios is a challenging but necessary task to recognize human emotions by a machine learning system. The human face consists of regions whose elements contribute to single expressions in a different manner. This work aims to shed light onto the importance of specific facial regions to provide information which can be used to discriminate between different facial expressions from a statistical pattern recognition perspective. A sampling based classification approach is used to reveal informative locations in the face. The results are expression-sensitive importance maps that indicate regions of high discriminative power which can be used for various applications.
Markus Kächele, Günther Palm, Friedhelm Schwenker

Noisy Speech Recognition Based on Combined Audio-Visual Classifiers

An isolated word speech recognition system based on audio-visual features is proposed in this paper. To enhance the recognition over different noisy conditions, this system combines three classifiers based on audio, visual and audio-visual information, respectively. The performance of the proposed recognition system is evaluated over two isolated word audio-visual databases, a public one and a database compiled by the authors of this paper. Experimental results show that the structure of the proposed system leads to significant improvements of the recognition rates through a wide range of signal-to-noise ratios.
Lucas D. Terissi, Gonzalo D. Sad, Juan C. Gómez, Marianela Parodi

Complementary Gaussian Mixture Models for Multimodal Speech Recognition

In speech recognition systems, typically, each word/phoneme in the vocabulary is represented by a model trained with samples of each particular class. The recognition is then performed by computing which model best represents the input word/phoneme to be classified. In this paper, a novel classification strategy based on complementary class models is presented. A complementary model to a particular class \(j\) refers to a model that is trained with instances of all the considered classes, excepting the ones associated to that class \(j\). This work describes new multi-classifier schemes for isolated word speech recognition based on the combination of standard Hidden Markov Models (HMMs) and Complementary Gaussian Mixture Models (CGMMs). In particular, two different conditions are considered. If the data is represented by single feature vectors a cascade classification scheme using HMMs and CGMMs is proposed. On the other hand, when data is represented by multiple feature vectors, a classification scheme based on a voting strategy which combines scores from individual HMMs and CGMMs is proposed. The classification schemes proposed in this paper are evaluated over two audio-visual speech databases, considering acoustic noisy conditions. Experimental results show that improvements in the recognition rates through a wide range of signal to noise ratios are achieved with the proposed classification methodologies.
Gonzalo D. Sad, Lucas D. Terissi, Juan C. Gómez

Fusion of Text and Audio Semantic Representations Through CCA

Humans are natural multimedia processing machines. Multimedia is a domain of multi-modalities including audio, text and images. A central aspect of multimedia processing is the coherent integration of media from different modalities as a single identity. Multimodal information fusion architectures become a necessity when not all information channels are available at all times. In this paper, we introduce a multimodal fusion of audio signals and lyrics in a shared semantic space through canonical correlation analysis. We propose an audio retrieval system based on extended semantic analysis of audio signals. We will combine this model with a tf-idf representation of lyrics to achieve a multimodal retrieval system. We use canonical correlation analysis and supervised learning methods as a basis for relating audio and lyrics information. Our experimental evaluation of the proposed method indicated that the proposed model outperforms the prior approaches based on simple canonical correlation methods. Finally, the efficiency of the proposed method allows for dealing with large music and lyrics collections enabling users to explore relevant lyrics information for music datasets.
Kamelia Aryafar, Ali Shokoufandeh



uulmMAD – A Human Action Recognition Dataset for Ground-Truth Evaluation and Investigation of View Invariances

In recent time, human action recognition has gained increasing attention in pattern recognition. However, many datasets in the literature focus on a limited number of target-oriented properties. Within this work, we present a novel dataset, named uulmMAD, which has been created to benchmark state-of-the-art action recognition architectures addressing multiple properties, e.g. high-resolutions cameras, perspective changes, realistic cluttered background and noise, overlap of action classes, different execution speeds, variability in subjects and their clothing, and the availability of a pose ground-truth. The uulmMAD was recorded using three synchronized high-resolution cameras and an inertial motion capturing system. Each subject performed fourteen actions at least three times in front of a green screen. Selected actions in four variants were recorded, i.e. normal, pausing, fast and deceleration. The data has been post-processed in order to separate the subject from the background. Furthermore, the camera and the motion capturing data have been mapped onto each other and 3D-avatars have been generated to further extend the dataset. The avatars have also been used to emulate the self-occlusion in pose recognition when using a time-of-flight camera. In this work, we analyze the uulmMAD using a state-of-the-art action recognition architecture to provide first baseline results. The results emphasize the unique characteristics of the dataset. The dataset will be made publicity available upon publication of the paper.
Michael Glodek, Georg Layher, Felix Heilemann, Florian Gawrilowicz, Günther Palm, Friedhelm Schwenker, Heiko Neumann

A Real Time Gesture Recognition System for Human Computer Interaction

Every form of human gesture has been recognized in the literature as a means of providing natural and intuitive ways to interact with computers across many computer application domains. In this paper we propose a real time gesture recognition approach which uses a depth sensor to extract the initial human skeleton. Then, robust and significant features have been compared and the most unrelated and representative features have been selected and fed to a set of supervised classifiers trained to recognize different gestures. Different problems concerning the gesture initialization, segmentation, and normalization have been considered. Several experiments have demonstrated that the proposed approach works effectively in real time applications.
Carmela Attolico, Grazia Cicirelli, Cataldo Guaragnella, Tiziana D’Orazio

A SIFT-Based Feature Level Fusion of Iris and Ear Biometrics

To overcome the drawbacks encountoured in unimodal biometric systems for person authentification, multimodal biometrics methods are needed. This paper presents an efficient feature level fusion of iris and ear images using SIFT descriptors which extract the iris and ear features separetely. Then, these features are incorporated in a single feature vector called fused template. The generated template is enrolled in the database, then the matching of SIFT features of iris and ear input images and the enrolled template of the claiming user is computed using Euclidean distance. The proposed method has been applied on a synthetic multimodal biometrics database. The latter is produced from Casia and USTB 2 databases wich represent iris and ear image sets respectively. As the performance evaluation of the proposed method we compute the false rejection rate (FRR), the false acceptance rate (FAR) and accuracy measures. From the obtained results, we can say that the fusion at feature level outperforms iris and ear authentification systems taken separately.
Lamis Ghoualmi, Salim Chikhi, Amer Draa

Audio-Visual User Identification in HCI Scenarios

Modern computing systems are usually equipped with various input devices such as microphones or cameras, and hence the user of such a system can easily be identified. User identification is important in many human computer interaction (HCI) scenarios, such as speech recognition, activity recognition, transcription of meeting room data or affective computing. Here personalized models may significantly improve the performance of the overall recognition system. This paper deals with audio-visual user identification. The main processing steps are segmentation of the relevant parts from video and audio streams, extraction of meaningful features and construction of the overall classifier and fusion architectures. The proposed system has been evaluated on the MOBIO dataset, a benchmark database consisting of real-world recordings collected from mobile devices, e.g. cell-phones. Recognition rates of up to 92 % could be achieved for the proposed audio-visual classifier system.
Markus Kächele, Sascha Meudt, Andrej Schwarz, Friedhelm Schwenker

Towards an Adaptive Brain-Computer Interface – An Error Potential Approach

In this paper a new adaptive Brain Computer Interface (BCI) architecture is proposed that allows to autonomously adapt the BCI parameters in malfunctioning situations. Such situations are detected by discriminating EEG Error Potentials and when necessary the BCI mode is switched back to the training stage in order to improve its performance. First, the modules of the adaptive BCI are presented, then the scenarios for identification of the user reaction to intentionally introduced errors are discussed and finally promising preliminary results are commented. The proposed concept has the potential to increase the reliability of BCI systems.
Nuno Figueiredo, Filipe Silva, Petia Georgieva, Mariofanna Milanova, Engin Mendi

Online Smart Face Morphing Engine with Prior Constraints and Local Geometry Preservation

We present an online system for automatic smart face morphing, which can be accessed at http://​facewarping.​com. This system morphs the user-uploaded image to “beautify” it by embedding features from a user-selected celebrity. This system consists of three major modules: facial feature point detection, geometry embedding, and image warping. To embed the features of a celebrity face and at the same time preserve the features of the original face, we formulate an optimization problem which we call prior multidimensional scaling (prior MDS). We propose an iterated Levenberg-Marquardt algorithm (ILMA) to efficiently solve prior MDS in the general case. This online system allows the user to configure the morphing parameters, and has been tested under different conditions.
Quan Wang, Yu Wang, Zuoguan Wang

Exploring Alternate Modalities for Tag Recommendation

Etsy is an online marketplace (www.​etsy.​com) for users selling unique handcrafted goods and vintage wares. Etsy’s sellers currently maintain a total of 25 million active listings in the marketplace. Usability research has identified the listing creation process as a pain point for sellers. Adding tags to a listing can be a particularly challenging task for new sellers who are unfamiliar with the tagging taxonomy. Automatically generated tags can reduce confusion and minimize the effort required to list new items for sale in an online marketplace. In this paper we explore listings images in the absence of text modality for automatic tag suggestion. Coupled with the ability for users to manually add tags, the proposed model can ease the burden of tagging and increase the utility of retrieval systems built on top of tagging data.
Kamelia Aryafar, Jerry Soung


Weitere Informationen

Premium Partner