Skip to main content

Über dieses Buch

Techniques of vision-based motion analysis aim to detect, track, identify, and generally understand the behavior of objects in image sequences. With the growth of video data in a wide range of applications from visual surveillance to human-machine interfaces, the ability to automatically analyze and understand object motions from video footage is of increasing importance. Among the latest developments in this field is the application of statistical machine learning algorithms for object tracking, activity modeling, and recognition.

Developed from expert contributions to the first and second International Workshop on Machine Learning for Vision-Based Motion Analysis, this important text/reference highlights the latest algorithms and systems for robust and effective vision-based motion understanding from a machine learning perspective. Highlighting the benefits of collaboration between the communities of object motion understanding and machine learning, the book discusses the most active forefronts of research, including current challenges and potential future directions.

Topics and features: provides a comprehensive review of the latest developments in vision-based motion analysis, presenting numerous case studies on state-of-the-art learning algorithms; examines algorithms for clustering and segmentation, and manifold learning for dynamical models; describes the theory behind mixed-state statistical models, with a focus on mixed-state Markov models that take into account spatial and temporal interaction; discusses object tracking in surveillance image streams, discriminative multiple target tracking, and guidewire tracking in fluoroscopy; explores issues of modeling for saliency detection, human gait modeling, modeling of extremely crowded scenes, and behavior modeling from video surveillance data; investigates methods for automatic recognition of gestures in Sign Language, and human action recognition from small training sets.

Researchers, professional engineers, and graduate students in computer vision, pattern recognition and machine learning, will all find this text an accessible survey of machine learning techniques for vision-based motion analysis. The book will also be of interest to all who work with specific vision applications, such as surveillance, sport event analysis, healthcare, video conferencing, and motion video indexing and retrieval.



Manifold Learning and Clustering/Segmentation


Practical Algorithms of Spectral Clustering: Toward Large-Scale Vision-Based Motion Analysis

This chapter presents some practical algorithms of spectral clustering for large-scale data. Spectral clustering is a kernel-based method of grouping data on separate nonlinear manifolds. Reducing its computational expense without critical loss of accuracy contributes to its practical use especially in vision-based applications. The present algorithms exploit random projection and subsampling techniques for reducing dimensionality and the cost for evaluating pairwise similarities of data. The computation time is quasilinear with respect to the data cardinality, and it can be independent of data dimensionality in some appearance-based applications. The efficiency of the algorithms is demonstrated in appearance-based image/video segmentation.
Tomoya Sakai, Atsushi Imiya

Riemannian Manifold Clustering and Dimensionality Reduction for Vision-Based Analysis

Segmentation is one fundamental aspect of vision-based motion analysis, thus it has been extensively studied. Its goal is to group the data into clusters based upon image properties such as intensity, color, texture, or motion. Most existing segmentation algorithms proceed by associating a feature vector to each pixel in the image or video and then segmenting the data by clustering these feature vectors. This process can be phrased as a manifold learning and clustering problem, where the objective is to learn a low-dimensional representation of the underlying data structure and to segment the data points into different groups. Over the past few years, various techniques have been developed for learning a low-dimensional representation of a nonlinear manifold embedded in a high-dimensional space. Unfortunately, most of these techniques are limited to the analysis of a single connected nonlinear manifold. In addition, all these manifold learning algorithms assume that the feature vectors are embedded in a Euclidean space and make use of (at least locally) the Euclidean metric or a variation of it to perform dimensionality reduction. While this may be appropriate in some cases, there are several computer vision problems where it is more natural to consider features that live in a Riemannian space. To address these problems, algorithms for performing simultaneous nonlinear dimensionality reduction and clustering of data sampled from multiple submanifolds of a Riemannian manifold have been recently proposed. In this book chapter, we give a summary of these newly developed algorithms as described in Goh and Vidal (Conference on Computer Vision and Pattern Recognition, 2007 and 2008; European Conference on Machine Learning, 2008; and European Conference on Computer Vision, 2008) and demonstrate their applications to vision-based analysis.
Alvina Goh

Manifold Learning for Multi-dimensional Auto-regressive Dynamical Models

We present a general differential-geometric framework for learning distance functions for dynamical models. Given a training set of models, the optimal metric is selected among a family of pullback metrics induced by the Fisher information tensor through a parameterized automorphism. The problem of classifying motions, encoded as dynamical models of a certain class, can then be posed on the learnt manifold. In particular, we consider the class of multidimensional autoregressive models of order 2. Experimental results concerning identity recognition are shown that prove how such optimal pullback Fisher metrics greatly improve classification performances.
Fabio Cuzzolin



Mixed-State Markov Models in Image Motion Analysis

When analyzing motion observations extracted from image sequences, one notes that the histogram of the velocity magnitude at each pixel shows a large probability mass at zero velocity, while the rest of the motion values may be appropriately modeled with a continuous distribution. This suggests the introduction of mixed-state random variables that have probability mass concentrated in discrete states, while they have a probability density over a continuous range of values. In the first part of the chapter, we give a comprehensive description of the theory behind mixed-state statistical models, in particular the development of mixed-state Markov models that permits to take into account spatial and temporal interaction. The presentation generalizes the case of simultaneous modeling of continuous values and any type of discrete symbolic states. For the second part, we present the application of mixed-state models to motion texture analysis. Motion textures correspond to the instantaneous apparent motion maps extracted from dynamic textures. They depict mixed-state motion values with a discrete state at zero and a Gaussian distribution for the rest. Mixed-state Markov random fields and mixed-state Markov chains are defined and applied to motion texture recognition and tracking.
Tomás Crivelli, Patrick Bouthemy, Bruno Cernuschi Frías, Jian-feng Yao

Learning to Detect Event Sequences in Surveillance Streams at Very Low Frame Rate

Some camera surveillance systems are designed to be autonomous—both from the energy and storage points of view. Autonomy allows operation in environments where wiring cameras for power and data transmission is not feasible. In these contexts, for cameras to work unattended over long periods of time requires choosing a low frame rate to match the speed of the process to be supervised while minimizing energy and storage usage. The result of surveillance is a large stream of images acquired sparsely over time with limited visual continuity from one frame to the other. Reviewing these images to detect events of interest requires techniques that do not assume traceability of objects by visual similarity. When the process surveyed shows recurrent patterns of events, as it is often the case for industrial settings, other possibilities open up. Since images are time-stamped, techniques which use temporal data can help detecting events. This contribution presents an image review tool that combines a scene change detector (SCD) with a temporal filter. The temporal filter learns to recognize relevant SCD events by their time distribution on the image stream. Learning is supported by image annotations provided by end-users during past reviews. The concept is tested on a benchmark of real surveillance images stemming from a nuclear safeguards context. Experimental results show that the combined SCD-temporal filter significantly reduces the workload necessary to detect safeguards-relevant events in large image streams.
Paolo Lombardi, Cristina Versino

Discriminative Multiple Target Tracking

In this chapter, we introduce a metric learning framework to learn a single discriminative appearance model for robust visual tracking of multiple targets. The single appearance model effectively captures the discriminative visual information among the different visual targets as well as the background. The appearance modeling and the tracking of the multiple targets are all cast in a discriminative metric learning framework. We manifest that an implicit exclusive principle is naturally reinforced in the proposed framework, which renders the tracker to be robust to cross occlusions among the multiple targets. We demonstrate the efficacy of the proposed multiple target tracker on benchmark visual tracking sequences, and real-world video sequences as well.
Xiaoyu Wang, Gang Hua, Tony X. Han

A Framework of Wire Tracking in Image Guided Interventions

This chapter presents a framework of using computer vision and machine learning methods to tracking guidewire, a medical device inserted into vessels during image guided interventions. During interventions, the guidewire exhibits nonrigid deformation due to patients’ breathing and cardiac motions. Such 3D motions are complicated when being projected onto the 2D fluoroscopy. Furthermore, fluoroscopic images have severe image artifacts and other wire-like structures. Those factors make robust guidewire tracking a challenging problem. To address these challenges, this chapter presents a probabilistic framework for the purpose of robust tracking. We introduce a semantic guidewire model that contains three parts, including a catheter tip, a guidewire tip and a guidewire body. Measurements of different parts are integrated into a Bayesian framework as measurements of a whole guidewire for robust guidewire tracking. For each part, two types of measurements, one from learning-based detectors and the other from appearance models, are combined. A hierarchical and multi-resolution tracking scheme based on kernel-based measurement smoothing is then developed to track guidewires effectively and efficiently in a coarse-to-fine manner. The framework has been validated on a testing set containing 47 sequences acquired under clinical environments, and achieves a mean tracking error of less than 2 pixels.
Peng Wang, Andreas Meyer, Terrence Chen, Shaohua K. Zhou, Dorin Comaniciu

Motion Analysis and Behavior Modeling


An Integrated Approach to Visual Attention Modeling for Saliency Detection in Videos

In this chapter, we present a framework to learn and predict regions of interest in videos, based on human eye movements. In our approach, the eye gaze information of several users are recorded as they watch videos that are similar, and belong to a particular application domain. This information is used to train a classifier to learn low-level video features from regions that attracted the visual attention of users. Such a classifier is combined with vision-based approaches to provide an integrated framework to detect salient regions in videos. Till date, saliency prediction has been viewed from two different perspectives, namely visual attention modeling and spatiotemporal interest point detection. These approaches have largely been vision-based. They detect regions having a predefined set of characteristics such as complex motion or high contrast, for all kinds of videos. However, what is ‘interesting’ varies from one application to another. By learning features of regions that capture the attention of viewers while watching a video, we aim to distinguish those that are actually salient in the given context, from the rest. The integrated approach ensures that both regions with anticipated content (top–down attention) and unanticipated content (bottom–up attention) are predicted by the proposed framework as salient. In our experiments with news videos of popular channels, the results show a significant improvement in the identification of relevant salient regions in such videos, when compared with existing approaches.
Sunaad Nataraju, Vineeth Balasubramanian, Sethuraman Panchanathan

Video-Based Human Motion Estimation by Part-Whole Gait Manifold Learning

This chapter presents a general gait representation framework for video-based human motion estimation that involves gait modeling at both the whole and part levels. Our goal is to estimate the kinematics of an unknown gait from image sequences taken by a single camera. This approach involves two generative models, called the kinematic gait generative model (KGGM) and the visual gait generative model (VGGM), which represent the kinematics and appearances of a gait by a few latent variables, respectively. Particularly, the concept of gait manifold is proposed to capture the gait variability among different individuals by which KGGM and VGGM can be integrated together for gait estimation, so that a new gait with unknown kinematics can be inferred from gait appearances via KGGM and VGGM. A key issue in generating a gait manifold is the definition of the distance function that reflects the dissimilarity between two individual gaits. Specifically, we investigate and compare three distance functions each of which leads to a specific gait manifold. Moreover, we extend our gait modeling framework from the whole level to the part level by decomposing a gait into two parts, an upper-body gait and a lower-body gait, each of which is associated with a specific gait manifold for part level gait modeling. Also, a two-stage inference algorithm is employed for whole-part gait estimation. The proposed algorithms were trained on the CMU Mocap data and tested on the HumanEva data, and the experiment results show promising results compared with the state-of-the-art algorithms with similar experimental settings.
Guoliang Fan, Xin Zhang

Spatio-Temporal Motion Pattern Models of Extremely Crowded Scenes

Extremely crowded scenes present unique challenges to motion-based video analysis due to the large quantity of pedestrians within the scene and the frequent occlusions they produce. The movement of pedestrians, however, collectively form a spatially and temporally structured pattern in the motion of the crowd. In this work, we present a novel statistical framework for modeling this structured pattern, or steady-state, of the motion in extremely crowded scenes. Our key insight is to model the motion of the crowd by the spatial and temporal variations of local spatio-temporal motion patterns exhibited by pedestrians within the scene. We divide the video into local spatio-temporal sub-volumes and represent the movement through each sub-volume with a local spatio-temporal motion pattern. We then derive a novel, distribution-based hidden Markov model to encode the temporal variations of local spatio-temporal motion patterns. We demonstrate that by capturing the steady-state of the motion within the scene, we can naturally detect unusual activities as statistical deviations in videos with complex activities that are hard for even human observers to analyze.
Louis Kratz, Ko Nishino

Learning Behavioral Patterns of Time Series for Video-Surveillance

This chapter deals with the problem of learning behaviors of people activities from (possibly big) sets of visual dynamic data, with a specific reference to video-surveillance applications. The study focuses mainly on devising meaningful data abstractions able to capture the intrinsic nature of the available data, and applying similarity measures appropriate to the specific representations. The methods are selected among the most promising techniques available in the literature and include classical curve fitting, string-based approaches, and hidden Markov models. The analysis considers both supervised and unsupervised settings and is based on a set of loosely labeled data acquired by a real video-surveillance system. The experiments highlight different peculiarities of the methods taken into consideration, and the final discussion guides the reader towards the most appropriate choice for a given scenario.
Nicoletta Noceti, Matteo Santoro, Francesca Odone

Gesture and Action Recognition


Recognition of Spatiotemporal Gestures in Sign Language Using Gesture Threshold HMMs

In this paper, we propose a framework for the automatic recognition of spatiotemporal gestures in Sign Language. We implement an extension to the standard HMM model to develop a gesture threshold HMM (GT-HMM) framework which is specifically designed to identify inter gesture transitions. We evaluate the performance of this system, and different CRF systems, when recognizing gestures and identifying inter gesture transitions. The evaluation of the system included testing the performance of conditional random fields (CRF), hidden CRF (HCRF) and latent-dynamic CRF (LDCRF) based systems and comparing these to our GT-HMM based system when recognizing motion gestures and identifying inter gesture transitions.
Daniel Kelly, John McDonald, Charles Markham

Learning Transferable Distance Functions for Human Action Recognition

Learning-based approaches for human action recognition often rely on large training sets. Most of these approaches do not perform well when only a few training samples are available. In this chapter, we consider the problem of human action recognition from a single clip per action. Each clip contains at most 25 frames. Using a patch based motion descriptor and matching scheme, we can achieve promising results on three different action datasets with a single clip as the template. Our results are comparable to previously published results using much larger training sets. We also present a method for learning a transferable distance function for these patches. The transferable distance function learning extracts generic knowledge of patch weighting from previous training sets, and can be applied to videos of new actions without further learning. Our experimental results show that the transferable distance function learning not only improves the recognition accuracy of the single clip action recognition, but also significantly enhances the efficiency of the matching scheme.
Weilong Yang, Yang Wang, Greg Mori


Weitere Informationen

Premium Partner