Skip to main content

2005 | Buch

Computer Vision in Human-Computer Interaction

ICCV 2005 Workshop on HCI, Beijing, China, October 21, 2005. Proceedings

herausgegeben von: Nicu Sebe, Michael Lew, Thomas S. Huang

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

Human-Computer Interaction (HCI) lies at the crossroads of many scienti?c areas including arti?cial intelligence, computer vision, face recognition, motion tracking, etc. In order for HCI systems to interact seamlessly with people, they need to understand their environment through vision and auditory input. Mo- over, HCI systems should learn how to adaptively respond depending on the situation. The goal of this workshop was to bring together researchers from the ?eld of computer vision whose work is related to human-computer interaction. The selected articles for this workshop address a wide range of theoretical and - plication issues in human-computer interaction ranging from human-robot - teraction, gesture recognition, and body tracking, to facial features analysis and human-computer interaction systems. This year 74 papers from 18 countries were submitted and 22 were accepted for presentation at the workshop after being reviewed by at least 3 members of the Program Committee. We had therefore a very competitive acceptance rate of less than 30% and as a consequence we had a very-high-quality workshop. Wewouldliketo thankallmembersofthe ProgramCommitteefor their help in ensuring the quality of the papers accepted for publication. We are grateful to Dr. Jian Wang for giving the keynote address. In addition, we wish to thank the organizers of the 10th IEEE International Conference on Computer Vision and our sponsors, University of Amsterdam, Leiden Institute of Advanced Computer Science, and the University of Illinois at Urbana-Champaign, for support in setting up our workshop.

Inhaltsverzeichnis

Frontmatter

Multimodal Human Computer Interaction: A Survey

Multimodal Human Computer Interaction: A Survey
Abstract
In this paper we review the major approaches to multimodal human computer interaction from a computer vision perspective. In particular, we focus on body, gesture, gaze, and affective interaction (facial expression recognition, and emotion in audio). We discuss user and task modeling, and multimodal fusion, highlighting challenges, open issues, and emerging applications for Multimodal Human Computer Interaction (MMHCI) research.
Alejandro Jaimes, Nicu Sebe

Tracking

Tracking Body Parts of Multiple People for Multi-person Multimodal Interface
Abstract
Although large displays could allow several users to work together and to move freely in a room, their associated interfaces are limited to contact devices that must generally be shared. This paper describes a novel interface called SHIVA (Several-Humans Interface with Vision and Audio) allowing several users to interact remotely with a very large display using both speech and gesture. The head and both hands of two users are tracked in real time by a stereo vision based system. From the body parts position, the direction pointed by each user is computed and selection gestures done with the second hand are recognized. Pointing gesture is fused with n-best results from speech recognition taking into account the application context. The system is tested on a chess game with two users playing on a very large display.
Sébastien Carbini, Jean-Emmanuel Viallet, Olivier Bernier, Bénédicte Bascle
Articulated Body Tracking Using Dynamic Belief Propagation
Abstract
An efficient articulated body tracking algorithm is proposed in this paper. Due to the high dimensionality of human-body motion, current articulated tracking algorithms based on sampling [1], belief propagation (BP) [2], or non-parametric belief propagation (NBP) [3], are very slow. To accelerate the articulated tracking algorithm, we adapted belief propagation according to the dynamics of articulated human motion. The searching space is selected according to the prediction based on human motion dynamics and current body-configuration estimation. The searching space of the dynamic BP tracker is much smaller than the one of traditional BP tracker [2] and the dynamic BP need not the slow Gibbs sampler used in NBP [3,4,5]. Based on a graphical model similar to the pictorial structure [6] or loose-limbed model [3], the proposed efficient, dynamic BP is carried out to find the MAP of the body configuration. The experiments on tracking the body movement in meeting scenario show robustness and efficiency of the proposed algorithm.
Tony X. Han, Thomas S. Huang
Recover Human Pose from Monocular Image Under Weak Perspective Projection
Abstract
In this paper we construct a novel human body model using convolution surface with articulated kinematic skeleton. The human body’s pose and shape in a monocular image can be estimated from convolution curve through nonlinear optimization. The contribution of the paper is in three folds: Firstly, human model based convolution surface with articulated skeletons is presented and its shape is deformable when changing polynomial parameters and radius parameters. Secondly, we give convolution surface and curve correspondence theorem under weak perspective projection, which provide a bridge between the 3D pose and 2D contour. Thirdly, we model the human body’s silhouette with convolution curve in order to estimate joint’s parameters from monocular images. Evalution of the method is performed on a sequence of video frames about a walking man.
Minglei Tong, Yuncai Liu, Thomas S. Huang
A Joint System for Person Tracking and Face Detection
Abstract
Visual detection and tracking of humans in complex scenes is a challenging problem with a wide range of applications, for example surveillance and human-computer interaction. In many such applications, time-synchronous views from multiple calibrated cameras are available, and both frame-view and space-level human location information is desired. In such scenarios, efficiently combining the strengths of face detection and person tracking is a viable approach that can provide both levels of information required and improve robustness. In this paper, we propose a novel vision system that detects and tracks human faces automatically, using input from multiple calibrated cameras. The method uses an Adaboost algorithm variant combined with mean shift tracking applied on single camera views for face detection and tracking, and fuses the results on multiple camera views to check for consistency and obtain the three-dimensional head estimate. We apply the proposed system to a lecture scenario in a smart room, on a corpus collected as part of the CHIL European Union integrated project. We report results on both frame-level face detection and three-dimensional head tracking. For the latter, the proposed algorithm achieves similar results with the IBM “PeopleVision” system.
Zhenqiu Zhang, Gerasimos Potamianos, Andrew Senior, Stephen Chu, Thomas S. Huang

Interfacing

Perceptive User Interface, a Generic Approach
Abstract
This paper describes the development of a real-time perceptive user interface. Two cameras are used to detect a user’s head, eyes, hand, fingers and gestures. These cues are interpreted to control a user interface on a large screen. The result is a fully functional integrated system that processes roughly 7.5 frames per second on a Pentium IV system. The calibration of this setup is carried out through a few simple and intuitive routines, making the system adaptive and accessible to non-expert users. The minimal hardware requirements are two web-cams and a computer. The paper will describe how the user is observed (head, eye, hand and finger detection, gesture recognition), the 3D geometry involved, and the calibration steps necessary to set up the system.
Michael Van den Bergh, Ward Servaes, Geert Caenen, Stefaan De Roeck, Luc Van Gool
A Vision Based Game Control Method
Abstract
The appeal of computer games may be enhanced by vision-based user inputs. The high speed and low cost requirements for near-term, mass-market game applications make system design challenging. In this paper we propose a vision based 3D racing car game controlling method, which analyzes two fists positions of the player in video stream from the camera to get the direction commands of the racing car.
This paper especially focuses on the robust and real-time Bayesian network (BN) based multi-cue fusion fist tracking method. Firstly, a new strategy, which employs the latest work in face recognition, is used to create accurate color model of the fist automatically. Secondly, color cue and motion cue are used to generate the possible position of the fist. Then, the posterior probability of each possible position is evaluated by BN, which fuses color cue and appearance cue. Finally, the fist position is approximated by the hypothesis that maximizes a posterior. Based on the proposed control system, a racing car game, “Simulation Drive”, has been developed by our group. Through the game an entirely new experience can be obtained by the player.
Peng Lu, Yufeng Chen, Xiangyong Zeng, Yangsheng Wang
Mobile Camera-Based User Interaction
Abstract
We present an approach for facilitating user interaction on mobile devices, focusing on camera-enabled mobile phones. A user interacts with an application by moving their device. An on-board camera is used to capture incoming video and the scrolling direction and magnitude are estimated using a feature-based tracking algorithm. The direction is used as the scroll direction in the application, and the magnitude is used to set the zoom level. The camera is treated as a pointing device and zoom level control in applications. Our approach generates mouse events, so any application that is mouse-driven can make use of this technique.
Antonio Haro, Koichi Mori, Tolga Capin, Stephen Wilkinson

Event Detection

Fast Head Tilt Detection for Human-Computer Interaction
Abstract
Accurate head tilt detection has a large potential to aid people with disabilities in the use of human-computer interfaces and provide universal access to communication software. We show how it can be utilized to tab through links on a web page or control a video game with head motions. It may also be useful as a correction method for currently available video-based assistive technology that requires upright facial poses. Few of the existing computer vision methods that detect head rotations in and out of the image plane with reasonable accuracy can operate within the context of a real-time communication interface because the computational expense that they incur is too great. Our method uses a variety of metrics to obtain a robust head tilt estimate without incurring the computational cost of previous methods. Our system runs in real time on a computer with a 2.53 GHz processor, 256 MB of RAM and an inexpensive webcam, using only 55% of the processor cycles.
Benjamin N. Waber, John J. Magee, Margrit Betke
Attention Monitoring Based on Temporal Signal-Behavior Structures
Abstract
In this paper, we discuss our system that estimates user attention to displayed content signals with temporal analysis of their exhibited behavior. Detecting user attention and controlling contents are key issues in our “networked interaction therapy system,” which effectively attracts the attention of memory-impaired people. In our proposed system, user behavior, including facial movements and body motions (“beat actions”), is detected with vision-based methods. User attention to the displayed content is then estimated based on the on/off facial orientation from a display system and body motions synchronous to auditorial signals. This attention monitoring mechanism design is derived from observations of actual patients. Estimated attention level can be used for content control to attract more attention of the viewers to the display system. Experimental results suggest that the content switching mechanism effectively attracts user interest.
Akira Utsumi, Shinjiro Kawato, Shinji Abe
Action Recognition with Global Features
Abstract
In this study, a new method allowing recognizing and segmenting everyday life actions is proposed. Only one camera is utilized without calibration. Viewpoint invariance is obtained by several acquisitions of the same action. To enhance robustness, each sequence is characterized globally: a detection of moving areas is first computed on each image. All these binary points form a volume in the three-dimensional (3D) space (x,y,t). This volume is characterized by its geometric 3D moments. Action recognition is then carried out by computing the Mahalanobis distance between the vector of features of the action to be recognized and those of the reference database. Results, which validate the suggested approach, are presented on a base of 1662 sequences performed by several persons and categorized in eight actions. An extension of the method for the segmentation of sequences with several actions is also proposed.
Arash Mokhber, Catherine Achard, Xingtai Qu, Maurice Milgram
3D Human Action Recognition Using Spatio-temporal Motion Templates
Abstract
Our goal is automatic recognition of basic human actions, such as stand, sit and wave hands, to aid in natural communication between a human and a computer. Human actions are inferred from human body joint motions, but such data has high dimensionality and large spatial and temporal variations may occur in executing the same action. We present a learning-based approach for the representation and recognition of 3D human action. Each action is represented by a template consisting of a set of channels with weights. Each channel corresponds to the evolution of one 3D joint coordinate and its weight is learned according to the Neyman-Pearson criterion. We use the learned templates to recognize actions based on χ 2 error measurement. Results of recognizing 22 actions on a large set of motion capture sequences as well as several annotated and automatically tracked sequences show the effectiveness of the proposed algorithm.
Fengjun Lv, Ramakant Nevatia, Mun Wai Lee

Augmented Reality

Interactive Point-and-Click Segmentation for Object Removal in Digital Images
Abstract
In this paper, we explore the problem of deleting objects in still pictures. We present an interactive system based on a novel intuitive user-friendly interface for removing undesirable objects in digital pictures. To erase an object in an image, a user indicates which object is to be removed by simply pinpointing it with the mouse cursor. As the mouse cursor rolls over the image, the current implicit selected object’s border is highlighted, providing a visual feedback. In case the computer-segmented area does not match the users’ perception of the object, users can further provide a few inside/outside object cues by clicking on a small number of object or nonobject pixels. Experimentally, a small number of such cues is generally enough to reach a correct matching, even for complex textured images. Afterwards, the user removes the object by clicking the left mouse button, and a hole-filling technique is initiated to generate a seamless background portion. Our image manipulation system consists of two components: (i) fully automatic or partially user-steered image segmentation based on an improved fast statistical region-growing segmentation, and (ii) texture synthesis or image inpainting of irregular shaped hole regions. Experiments on a variety of photographs display the ability of the system to handle complex scenes with highly textured objects.
Frank Nielsen, Richard Nock
Information Layout and Interaction Techniques on an Augmented Round Table
Abstract
Round tabletop display systems are currently being promoted, but the optimal ways to use these systems to display a large amount of information or how to interact with them have not been considered. This paper describes information presentation and interaction technique for a large number of files on a round tabletop display system. Three layouts are explored on our augmented table system: sequential layout, classification layout, and spiral layout. Users can search and find files by virtually rotating the circular display using a ”hands-on” technique.
Shintaro Kajiwara, Hideki Koike, Kentaro Fukuchi, Kenji Oka, Yoichi Sato
On-Line Novel View Synthesis Capable of Handling Multiple Moving Objects
Abstract
This paper presents a new interactive teleconferencing system. It adds a ‘virtual’ camera to the scene which can move freely in between multiple real cameras. The viewpoint can automatically be selected using basic cinematographic rules, based on the position and the actions of the instructor. This produces a clearer and more engaging view for the remote audience, without the need for a human editor.
For the creation of the novel views generated by such a ‘virtual’ camera, segmentation and depth calculations are required. The system is semi-automatic, in that the user is asked to indicate a few corresponding points or edges for generating an initial rough background model. Next to the static background and moving foreground also multiple independently moving objects are catered for. The initial foreground contour is tracked over time, using a new active contour. If a second object appears, the contour prediction allows to recognize this situation and to take appropriate measures. The 3D models are continuously validated based on a Birchfield dissimilarity measure. The foreground model is updated every frame, the background is refined if necessary. The current implementation can reach approx 4 fps on a single desktop.
Indra Geys, Luc Van Gool

Hand and Gesture

Resolving Hand over Face Occlusion
Abstract
This paper presents a method to segment the hand over complex backgrounds, such as the face. The similar colors and texture of the hand and face make the problem particularly challenging. Our method is based on the concept of an image force field. In this representation each individual image location consists of a vector value which is a nonlinear combination of the remaining pixels in the image. We introduce and develop a novel physics based feature that is able to measure regional structure in the image thus avoiding the problem of local pixel based analysis, which break down under our conditions. The regional image structure changes in the occluded region during occlusion. Elsewhere the regional structure remains relatively constant. We model the regional image structure at all image locations over time using a Mixture of Gaussians (MoG) to detect the occluded region in the image. We have tested the method on a number of sequences demonstrating the versatility of the proposed approach.
Paul Smith, Niels da Vitoria Lobo, Mubarak Shah
Real-Time Adaptive Hand Motion Recognition Using a Sparse Bayesian Classifier
Abstract
An approach to increase adaptability of a recognition system, which can recognise 10 elementary gestures and be extended to sign language recognition, is proposed. In this work, recognition is done by firstly extracting a motion gradient orientation image from a raw video input and then classifying a feature vector generated from this image to one of the 10 gestures by a sparse Bayesian classifier. The classifier is designed in a way that it supports online incremental learning and it can be thus re-trained to increase its adaptability to an input captured under a new condition. Experiments show that the accuracy of the classifier can be boosted from less than 40% to over 80% by re-training it using 5 newly captured samples from each gesture class. Apart from having a better adaptability, the system can work reliably in real-time and give a probabilistic output that is useful in complex motion analysis.
Shu-Fai Wong, Roberto Cipolla
Topographic Feature Mapping for Head Pose Estimation with Application to Facial Gesture Interfaces
Abstract
We propose a new general approach to the problem of head pose estimation, based on semi-supervised low-dimensional topographic feature mapping. We show how several recently proposed nonlinear manifold learning methods can be applied in this general framework, and additionally, we present a new algorithm, IsoScale, which combines the best aspects of some of the other methods. The efficacy of the proposed approach is illustrated both on a view- and illumination-varied face database, and in a real-world human-computer interface application, as head pose based facial-gesture interface for automatic wheelchair navigation.
Bisser Raytchev, Ikushi Yoda, Katsuhiko Sakaue
Accurate and Efficient Gesture Spotting via Pruning and Subgesture Reasoning
Abstract
Gesture spotting is the challenging task of locating the start and end frames of the video stream that correspond to a gesture of interest, while at the same time rejecting non-gesture motion patterns. This paper proposes a new gesture spotting and recognition algorithm that is based on the continuous dynamic programming (CDP) algorithm, and runs in real-time. To make gesture spotting efficient a pruning method is proposed that allows the system to evaluate a relatively small number of hypotheses compared to CDP. Pruning is implemented by a set of model-dependent classifiers, that are learned from training examples. To make gesture spotting more accurate a subgesture reasoning process is proposed that models the fact that some gesture models can falsely match parts of other longer gestures. In our experiments, the proposed method with pruning and subgesture modeling is an order of magnitude faster and 18% more accurate compared to the original CDP algorithm.
Jonathan Alon, Vassilis Athitsos, Stan Sclaroff

Applications

A Study of Detecting Social Interaction with Sensors in a Nursing Home Environment
Abstract
Social interaction plays an important role in our daily lives. It is one of the most important indicators of physical or mental diseases of aging patients. In this paper, we present a Wizard of Oz study on the feasibility of detecting social interaction with sensors in skilled nursing facilities. Our study explores statistical models that can be constructed to monitor and analyze social interactions among aging patients and nurses. We are also interested in identifying sensors that might be most useful in interaction detection; and determining how robustly the detection can be performed with noisy sensors. We simulate a wide range of plausible sensors using human labeling of audio and visual data. Based on these simulated sensors, we build statistical models for both individual sensors and combinations of multiple sensors using various machine learning methods. Comparison experiments are conducted to demonstrate the effectiveness and robustness of the sensors and statistical models for detecting interactions.
Datong Chen, Jie Yang, Howard Wactlar
HMM Based Falling Person Detection Using Both Audio and Video
Abstract
Automatic detection of a falling person in video is an important problem with applications in security and safety areas including supportive home environments and CCTV surveillance systems. Human motion in video is modeled using Hidden Markov Models (HMM) in this paper. In addition, the audio track of the video is also used to distinguish a person simply sitting on a floor from a person stumbling and falling. Most video recording systems have the capability of recording audio as well and the impact sound of a falling person is also available as an additional clue. Audio channel data based decision is also reached using HMMs and fused with results of HMMs modeling the video data to reach a final decision.
B. Uğur Töreyin, Yiğithan Dedeoğlu, A. Enis Çetin
Appearance Manifold of Facial Expression
Abstract
This paper investigates the appearance manifold of facial expression: embedding image sequences of facial expression from the high dimensional appearance feature space to a low dimensional manifold. We explore Locality Preserving Projections (LPP) to learn expression manifolds from two kinds of feature space: raw image data and Local Binary Patterns (LBP). For manifolds of different subjects, we propose a novel alignment algorithm to define a global coordinate space, and align them on one generalized manifold. Extensive experiments on 96 subjects from the Cohn-Kanade database illustrate the effectiveness of the alignment algorithm. The proposed generalized appearance manifold provides a unified framework for automatic facial expression analysis.
Caifeng Shan, Shaogang Gong, Peter W. McOwan
Backmatter
Metadaten
Titel
Computer Vision in Human-Computer Interaction
herausgegeben von
Nicu Sebe
Michael Lew
Thomas S. Huang
Copyright-Jahr
2005
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-32129-3
Print ISBN
978-3-540-29620-1
DOI
https://doi.org/10.1007/11573425

Neuer Inhalt