Skip to main content

Über dieses Buch

It was a great pleasure to organize the First International Workshop on Human Behavior Understanding (HBU), which took place as a satellite workshop to International Conference on Pattern Recognition (ICPR) on August 22, 2010, in Istanbul, Turkey. This workshop arose from the natural marriage of pattern recognitionwiththerapidlyadvancingareaofhumanbehavioranalysis.Ouraim was to gather researchersdealing with the problem of modeling human behavior under its multiple facets (expression of emotions, display of relational attitudes, performance of individual or joint actions, etc.), with particular attention to pattern recognition approaches that involve multiple modalities and those that model actual dynamics of behavior. The contiguity with ICPR, one of the most important events in the p- tern recognition and machine learning communities, is expected to foster cro- pollination with other areas, for example temporal pattern mining or time - ries analysis, which share their important methodological aspects with human behavior understanding. Furthermore, the presence of this workshop at ICPR was meant to attract researchers, in particular PhD students and postd- toral researchers, to work on the questions of human behavior understanding that is likely to play a major role in future technologies (ambient intelligence, human–robot interaction, arti?cial social intelligence, etc.), as witnessed by a number of researche?orts aimed at collecting and annotating large sets of multi sensor data,collected from observingpeople in naturaland often technologically challenging conditions.



Challenges of Human Behavior Understanding

Challenges of Human Behavior Understanding

Recent advances in pattern recognition has allowed computer scientists and psychologists to jointly address automatic analysis of of human behavior via computers. The Workshop on Human Behavior Understanding at the International Conference on Pattern Recognition explores a number of different aspects and open questions in this field, and demonstrates the multi-disciplinary nature of this research area. In this brief summary, we give an overview of the Workshop and discuss the main research challenges.
Albert Ali Salah, Theo Gevers, Nicu Sebe, Alessandro Vinciarelli

Analysis of Human Activities

Understanding Macroscopic Human Behavior

The Web has changed the way we live, work, and socialize. Web-thinking has been influencing how we understand, design, and solve important societal problems and build complex systems. For centuries, emergence has been considered an essential property underlying the way complex systems and patterns emerge out of relatively simple interactions among different components. The Web has compellingly demonstrated results of emergence in understanding human behavior not at an individual level but at different macro levels ranging from social networks to global levels. Recent rapid advances in sensor technology, Web 2.0, Mobile devices, and Web technologies have opened further opportunities to understand macroscopic human behavior. In this talk, we will discuss our approach to build a framework for studying macroscopic human behavior based on micro-events including Tweets and other participatory sensing approaches.
Ramesh Jain

Activity-Aware Map: Identifying Human Daily Activity Pattern Using Mobile Phone Data

Being able to understand dynamics of human mobility is essential for urban planning and transportation management. Besides geographic space, in this paper, we characterize mobility in a profile-based space (activity-aware map) that describes most probable activity associated with a specific area of space. This, in turn, allows us to capture the individual daily activity pattern and analyze the correlations among different people’s work area’s profile. Based on a large mobile phone data of nearly one million records of the users in the central Metro-Boston area, we find a strong correlation in daily activity patterns within the group of people who share a common work area’s profile. In addition, within the group itself, the similarity in activity patterns decreases as their work places become apart.
Santi Phithakkitnukoon, Teerayut Horanont, Giusy Di Lorenzo, Ryosuke Shibasaki, Carlo Ratti

From On-Going to Complete Activity Recognition Exploiting Related Activities

Activity recognition can be seen as a local task aimed at identifying an on-going activity performed at a certain time, or a global one identifying time segments in which a certain activity is being performed. We combine these tasks by a hierarchical approach which locally predicts on-going activities by a Support Vector Machine and globally refines them by a Conditional Random Field focused on time segments involving related activities. By varying temporal scales in order to account for widely different activity durations, we achieve substantial improvements in on-going activity recognition on a realistic dataset from the PlaceLab sensing environment. When focusing on periods within which related activities are known to be performed, the refinement stage manages to exploit these relationships in order to correct inaccurate local predictions.
Carlo Nicolini, Bruno Lepri, Stefano Teso, Andrea Passerini

Human Activity Recognition Using Inertial/Magnetic Sensor Units

This paper provides a comparative study on the different techniques of classifying human activities that are performed using body-worn miniature inertial and magnetic sensors. The classification techniques implemented and compared in this study are: Bayesian decision making (BDM), the least-squares method (LSM), the k-nearest neighbor algorithm (k-NN), dynamic time warping (DTW), support vector machines (SVM), and artificial neural networks (ANN). Daily and sports activities are classified using five sensor units worn by eight subjects on the chest, the arms, and the legs. Each sensor unit comprises a triaxial gyroscope, a triaxial accelerometer, and a triaxial magnetometer. Principal component analysis (PCA) and sequential forward feature selection (SFFS) methods are employed for feature reduction. For a small number of features, SFFS demonstrates better performance and should be preferable especially in real-time applications. The classifiers are validated using different cross-validation techniques. Among the different classifiers we have considered, BDM results in the highest correct classification rate with relatively small computational cost.
Kerem Altun, Billur Barshan

Non-verbal Action Dynamics

Face Tracking and Recognition Considering the Camera’s Field of View

We propose a method that tracks and recognizes faces simultaneously. In previous methods, features needed to be extracted twice for tracking and recognizing faces in image sequences because the features used for face recognition are different from those used for face tracking. To reduce the computational cost, we propose a probabilistic model for face tracking and recognition and a system that performs face tracking and recognition simultaneously using the same features. The probabilistic model handles any overlap in the camera’s field of view, something that is ignored in previous methods. The model thus deals with face tracking and recognition using multiple overlapping image sequences. Experimental results show that the proposed method can track and recognize multiple faces simultaneously.
Yuzuko Utsumi, Yoshio Iwai, Hiroshi Ishiguro

Spatiotemporal-Boosted DCT Features for Head and Face Gesture Analysis

Automatic analysis of head gestures and facial expressions is a challenging research area and it has significant applications in human-computer interfaces. In this study, facial landmark points are detected and tracked over successive video frames using a robust method based on subspace regularization, Kalman prediction and refinement. The trajectories (time series) of facial landmark positions during the course of the head gesture or facial expression are organized in a spatiotemporal matrix and discriminative features are extracted from the trajectory matrix. Alternatively, appearance based features are extracted from DCT coefficients of several face patches. Finally Adaboost algorithm is performed to learn a set of discriminating spatiotemporal DCT features for face and head gesture (FHG) classification. We report the classification results obtained by using the Support Vector Machines (SVM) on the outputs of the features learned by Adaboost. We achieve 94.04% subject independent classification performance over seven FHG.
Hatice Çınar Akakın, Bülent Sankur

Concensus of Self-features for Nonverbal Behavior Analysis

One of the key challenge in social behavior analysis is to automatically discover the subset of features relevant to a specific social signal (e.g., backchannel feedback). The way that these social signals are performed exhibit some variations among different people. In this paper, we present a feature selection approach which first looks at important behaviors for each individual, called self-features, before building a consensus. To enable this approach, we propose a new feature ranking scheme which exploits the sparsity of probabilistic models when trained on human behavior problems. We validated our self-feature concensus approach on the task of listener backchannel prediction and showed improvement over the traditional group-feature approach. Our technique gives researchers a new tool to analyze individual differences in social nonverbal communication.
Derya Ozkan, Louis-Philippe Morency

Visual Action Recognition

Recognizing Human Action in the Wild

Automatic recognition of human actions is a growing research topic urged by demands from emerging industries including (i) indexing of professional and user-generated video archives, (ii) automatic video surveillance, and (iii) human-computer interaction. Most applications require action recognition to operate reliably in diverse and realistic video settings. This challenging but important problem, however, has mostly been ignored in the past due to several issues including (i) the difficulty of addressing the complexity of realistic video data as well as (ii) the lack of representative datasets with human actions “in the wild”. In this talk we address both problems and first present a supervised method for detecting human actions in movies. To avoid a prohibitive cost of manual supervision when training many action classes, we next investigate weakly-supervised methods and use movie scripts for automatic annotation of human actions in video. With this approach we automatically retrieve action samples for training and learn discriminative visual action models from a large set of movies. We further argue for the importance of scene context for action recognition and show improvements using mining and classification of action-specific scene classes. We also address the temporal uncertainty of script-based action supervision and present a discriminative clustering algorithm that compensates for this uncertainty and provides substantially improved results for temporal action localization in video. We finally present a comprehensive evaluation of state-of-the-art methods for actions recognition on three recent datasets with human actions.
Ivan Laptev

Comparing Evaluation Protocols on the KTH Dataset

Human action recognition has become a hot research topic, and a lot of algorithms have been proposed. Most of researchers evaluated their performances on the KTH dataset, but there is no unified standard how to evaluate algorithms on this dataset. Different researchers have employed different test setups, so the comparison is not accurate, fair or complete. In order to know how much difference there is when different experimental setups are used, we take our own spatio-temporal MoSIFT feature as an example to assess its performance on the KTH dataset using different test scenarios and different partitioning of the data. In all experiments, support vector machine (SVM) with a chi-square kernel is adopted. First, we evaluate performance changes resulting from differing vocabulary sizes of the codebook, and then decide on a suitable vocabulary size of codebook. Then, we train the models using different training dataset partitions, and test the performances one the corresponding held-out test sets. Experiments show that the best performance of MoSIFT can reach 96.33% on the KTH dataset. When different n-fold cross-validation methods are used, there can be up to 10.67% difference in the result. And when different dataset segmentations are used (such as KTH1 and KTH2), the difference in results can be up to 5.8% absolute. In addition, the performance changes dramatically when different scenarios are used in the training and test dataset. When training on KTH1 S1+S2+S3+S4 and testing on KTH1 S1 and S3 scenarios, the performance can reach 97.33% and 89.33% respectively. This paper shows how different test configurations can skew results, even on standard data set. The recommendation is to use a simple leave-one-out as the most easily replicable clear-cut partitioning.
Zan Gao, Ming-yu Chen, Alexander G. Hauptmann, Anni Cai

3D Mean-Shift Tracking of Human Body Parts and Recognition of Working Actions in an Industrial Environment

In this study we describe a method for 3D trajectory based recognition of and discrimination between different working actions in an industrial environment. A motion-attributed 3D point cloud represents the scene based on images of a small-baseline trinocular camera system. A two-stage mean-shift algorithm is used for detection and 3D tracking of all moving objects in the scene. A sequence of working actions is recognised with a particle filter based matching of a non-stationary Hidden Markov Model, relying on spatial context and a classification of the observed 3D trajectories. The system is able to extract an object performing a known action out of a multitude of tracked objects. The 3D tracking stage is evaluated with respect to its metric accuracy based on nine real-world test image sequences for which ground truth data were determined. An experimental evaluation of the action recognition stage is conducted using 20 real-world test sequences acquired from different viewpoints in an industrial working environment. We show that our system is able to perform 3D tracking of human body parts and a subsequent recognition of working actions under difficult, realistic conditions. It detects interruptions of the sequence of working actions by entering a safety mode and returns to the regular mode as soon as the working actions continue.
Markus Hahn, Fuad Quronfuleh, Christian Wöhler, Franz Kummert

Feature Representations for the Recognition of 3D Emblematic Gestures

In human-machine interaction, gestures play an important role as input modality for natural and intuitive interfaces. The class of gestures often called “emblems” is of special interest since they convey a well-defined meaning in an intuitive way. We present an approach for the visual recognition of 3D dynamic emblematic gestures in a smart room scenario using a HMM-based recognition framework. In particular, we assess the suitability of several feature representations calculated from a gesture trajectory in a detailed experimental evaluation on realistic data.
Jan Richarz, Gernot A. Fink

Social Signals

Types of Help in the Teacher’s Multimodal Behavior

Psychological and social researches of last decades suggest that studying helping relationships may offer important suggestions for a better understanding of human behavior. In this work we present a study on the over-helping behaviors of teachers in their interaction with pupils, which may deepen our knowledge on how prosocial conducts can eventually produce unexpected effects over social interaction and cognitive development. To differentiate between helping and over-helping, we propose a taxonomy of communicative and non-communicative behaviors of teachers towards their pupils (Section 3), and an annotation scheme aimed to detect both helping and over-helping in teacher-pupil dyads (Sect. 4). Results of the study show how the annotation scheme presented allows to classify the different types of helping behavior, provides a reliable basis for the analysis of the teacher’s behaviors, and suggest hints useful to empower teachers’ self-reflection, in view of an improvement of the teacher-pupil relationship and of the pupils’ learning processes.
Francesca D’Errico, Giovanna Leone, Isabella Poggi

Honest Signals and Their Contribution to the Automatic Analysis of Personality Traits – A Comparative Study

In our paper we focus on the usage of different kind of "honest" signals for the automatic prediction of two personality traits, Extraversion and Locus of Control. In particular, we investigate the predictive power of four classes of speech honest signal features (Conversational Activity, Emphasis, Influence, and Mimicry), along with three fidgeting visual features by systematically comparing the results obtained by classifiers using them.
Bruno Lepri, Kyriaki Kalimeri, Fabio Pianesi

Speech Emotion Classification and Public Speaking Skill Assessment

This paper presents a new classification algorithm for real-time inference of emotions from the non-verbal features of speech. It identifies simultaneously occurring emotional states by recognising correlations between emotions and features such as pitch, loudness and energy. Pairwise classifiers are constructed for nine classes from the Mind Reading emotion corpus, yielding an average cross-validation accuracy of 89% for the pairwise machines and 86% for the fused machine. The paper also shows a novel application of the classifier for assessing public speaking skills, achieving an average cross-validation accuracy of 81%. Optimisation of support vector machine coefficients is shown to improve the accuracy by up to 25%. The classifier outperforms previous research on the same emotion corpus and achieves real-time performance.
Tomas Pfister, Peter Robinson

Dominance Signals in Debates

The paper analyzes the signals of dominance in different modalities displayed during TV talk shows and debates. Dominance is defined, according to a model in terms of goals and beliefs, as a person’s having more power than others. A scheme is presented for the annotation of signals of dominance in political debates: based on the analysis of videotaped data, a typology is proposed of strategies to convey dominance, and the corresponding signals are overviewed. Strategies range from the aggressive ones of imperiousness, judgement, invasion, norm violation and defiance, to the more subtle touchiness and victimhood, ending up with haughtiness, irony and ridicule, easiness, carelessness and assertiveness.
Isabella Poggi, Francesca D’Errico


Weitere Informationen

Premium Partner