Silhouette lookup for monocular 3D pose tracking

doi:10.1016/j.imavis.2005.10.006

Image and Vision Computing

Volume 25, Issue 3, March 2007, Pages 331-341

https://doi.org/10.1016/j.imavis.2005.10.006 Get rights and content

Abstract

Computers should be able to detect and track the articulated 3D pose of a human being moving through a video sequence. Incremental tracking methods often prove slow and unreliable, and many must be initialized by a human operator before they can track a sequence. This paper describes a simple yet effective algorithm for tracking articulated pose, based upon looking up observations (such as body silhouettes) within a collection of known poses. The new algorithm runs quickly, can initialize itself without human intervention, and can automatically recover from critical tracking errors made while tracking previous frames in a video sequence.

Introduction

Researchers have worked for decades towards the goal of a computer system that can track the articulated pose of a moving human being from monocular video input [1], [2], [3]. An effective pose tracking system would immediately enable applications in security, ergonomics, human–computer interaction, and many other fields. Yet a recent study concluded that none of the automated tracking methods tested could successfully track a moderately difficult example [4]. Recovery from tracking errors therefore, deserves more than the scant research attention it has received [5] to date. Furthermore, currently popular approaches based upon particle tracking are slowed by the need to propagate multiple samples at each frame. Research into non-incremental, recoverable tracking mechanisms therefore, fills a pressing need.

This paper develops a lookup-based approach to pose tracking, focusing in particular on silhouette lookup. This approach, hereafter referred to as SiLo tracking, offers significant advantages over currently popular methods using parameter optimization and particle tracking algorithms. The SiLo tracker described in Section 2 requires no human input for initialization. Even if it makes grave errors during difficult sections of a video, it can automatically recover to track the correct pose on subsequent frames. Furthermore, although the implementation described here is not optimized for speed, the approach invites significantly faster implementations than those based upon optimization and particle tracking.

Several developments contribute to enable these advances. The many-to-one silhouette-to-pose relationship has in the past proved a barrier to the development of silhouette-based trackers. The new technique exploits temporal continuity to choose the best hypothesis among multiple candidate poses at each frame, via a Markov chain formulation. Relieved of the burden of finding the perfect match, simple yet effective metrics make feasible the rapid retrieval of candidate silhouettes. Finally, smoothing and optimization based upon polynomial splines ensure that the tracked output forms a plausible human motion that matches the observations.

The sections that follow describe each of these contributions in more detail. Section 2 describes the SiLo tracking algorithm and places it in the context of previous work. Section 3 describes experimental results using the algorithm. Section 4 concludes with an analysis of the approach's strengths and weaknesses, and a discussion of future work.

Section snippets

SiLo tracking

The algorithm described below takes as its input raw video from a single fixed viewpoint, assumed for simplicity to contain a single human being entirely within the camera frame and unoccluded by other objects. Multiple subjects, partial visibility, and camera motions make the already challenging problem more difficult. Although this paper will at times indicate how such additional complications might be addressed, they fall beyond its focus, and the experiments will all use input that conforms

Experimental results

Quantitative evaluation of 3D pose reconstruction is notoriously difficult, and standard test sets have yet to emerge. It is difficult to obtain ground truth calibrated with real video. This section therefore, begins with quantitative results for synthetic input for which ground truth is known. Further, experiments apply the methods described above to real video clips without ground truth, but representing a wider range in difficulty.

Conclusion

The SiLo tracker demonstrates successful self-initialization and error-recovery for 3D pose tracking from monocular video. It infers realistic depth information missing from the 2D input. Like many other current algorithms for monocular 3D pose tracking, it makes some errors, but unlike most techniques it can recover automatically and regain the correct track on subsequent frames without human intervention.

Despite the positive results presented in this paper, silhouette lookup remains an

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grant no. IIS-0328741. The training data used in this project was obtained from mocap.cs.cmu.edu. That database was created with funding from NSF EIA-0186217.

References (33)

D. Hogg
Model-based vision: a program to see a walking person
Image and Vision Computing
(1983)
T.B. Moeslund et al.
A survey of computer vision-based human motion capture
Computer Vision and Image Understanding
(2001)
N.J. Belkin et al.
Combining the evidence of multiple query representations for information retrieval
Information Processing and Management
(1995)
A.P. Berman et al.
A flexible image database system for content-based retrieval
Computer Vision and Image Understanding
(1999)
J. O'Rourke et al.
Model-based image analysis of human motion using constraint propagation
IEEE Transactions on Pattern Analysis and Machine Intelligence
(1980)
D. DiFranco, T.-J. Cham, J.M. Rehg, Reconstruction of 3-d figure motion from 2-d correspondences, in: IEEE Computer...
S. Ioffe, D. Forsyth, Human tracking with mixtures of trees, in: International Conference on Computer Vision, 2001, pp....
J.W. Davis et al.
A robust human-silhouette extraction technique for interactive virtual environments,
K.-P. Karmann, A. von Brandt, Moving object recognition using an adaptive background memory, in: Time-Varying Image...
T. Horprasert, D. Harwood, L. Davis, A robust background subtraction and shadow detection, in: Proceedings of the Asian...

J. Zhong, S. Sclaroff, Segmenting foreground objects from a dynamic, textured background via a robust Kalman filter,...

N. Howe, A. Deschamps, Better foreground segmentation through graph cuts, Technical report, Smith College,...

B. Scassellati, S. Alexopoulos, M. Flickner, Retrieving images by 2d shape: a comparison of computation methods with...

C. Sminchisescu, B. Triggs, Kinetic jump processes for monocular 3d human tracking, in: IEEE Computer Society...

G. Shakhnarovich, P. Viola, T. Darrell, Fast pose estimation with parameter-sensitive hashing, in: International...

A. Agarwal, B. Triggs, 3d Human pose from silhouettes by relevance vector regression, in: IEEE Computer Society...

Cited by (34)

Video quality evaluation toward complicated sport activities for clustering analysis
2021, Future Generation Computer Systems
Citation Excerpt :
These fine-grained object details are discriminative for distinguishing different types of hand gestures. To extract all these object details, we employ the BING objectness measure [6] to produce a concise set of object patches. The key advantage of BING is its highly competitive speed.
Automatically clustering various sophisticated human activities (e.g., dancing, martial arts, and gymnastics) based on their quality scores is an indispensable technique in physical training, human–computer interaction, etc. Conventionally, many action recognition models are built upon the visual/semantic appearance of human body movements. Recently, due to the introduction of Microsoft Kinect, many skeleton-based human action understanding frameworks have been proposed. In this work, we propose a novel method to cluster the quality of complicated human actions towards contactless operative video reading system (COVRS). More specifically, we first extract the skeleton by leveraging the Kinect, which is subsequently fed into an aggregation deep neural network to extract the deep feature for each human action skeleton. In COVRS, the human hand gesture is an informative clue. Thus, we propose a ranking algorithm to extract the position of human five figures, based on which the deep hand gesture representation is hierarchically learned. Noticeably, it is observable that, the acoustic feature from many human activities also contributes to the quality assessment. We extract multiple acoustic features from the audio associated with each human activity video. Finally, based on the above human skeleton and hand gesture deep features, as well as the shallow acoustic features, we employ a probabilistic model to integrate them for clustering the various human activities using the quality of COVRS. Comprehensive experimental have demonstrated the effectiveness and efficiency of our method. Besides, empirical results have shown that our probabilistic quality model is highly extensible, where additionally visual/acoustic features can be encoded according to different applications.
Deeply fusing multi-model quality-aware features for sophisticated human activity understanding
2020, Signal Processing: Image Communication
Citation Excerpt :
Kolev et al. [9] formulated the color moment to construct 3D-appearance that similar to the evaluated human action profile. Howe et al. [10] calculated the human posture from the known human posture profile sets. The proposed algorithm by Howe et al. was simple and effective.
Detecting and understanding human action under sophisticated lighting condition and backgrounds, also known as human action recognition in real-world context, is an indispensable component in modern intelligent systems and has becoming a hot research topic currently. Nowadays, human action recognition is still a tough challenge due to intra-class and inter-class, environment and temporal-level differences of the same action. Algorithms based on the single visual channel cannot achieve satisfactory performance. Thus, in this paper, we propose a novel action recognition framework towards sophisticated activity understanding, focusing on intelligently combining multimodel quality-related action features. Specifically, we first design a multi-channel feature fusion (MCFF) algorithm to capture visual appearance, motion and acoustic patterns from each video frame, where image-level labels are characterized by choosing high quality multimodel features. Subsequently, we design an adaptive key frame selection algorithm that can be applied to characterize human action from human action video stream. Thereafter, we engineer a multimodel feature based on an auxiliary human action retrieval system to achieve sophisticated activity understanding. Extensive experimental evaluations have demonstrated that the effectiveness and robustness of our proposed method.
Retrieval-based cartoon gesture recognition and applications via semi-supervised heterogeneous classifiers learning
2013, Pattern Recognition
Citation Excerpt :
In [34], high local extremum points which correspond to pose images along the motion curve are considered as most representative in key-pose selection. HD is used for pose inferring in [35]. In our application, we use the HD as a distance metric and evaluate extremum values for keyframe determination, and an example result is visualized in Fig. 6(c), where each keyframe is marked by white rectangle.
2D cartoon plays an important role in many areas, but it requires effective methods to relieve manual labors. In this paper, we propose a heterogeneous cartoon gesture recognition method with applications. Firstly, heterogeneous features with different dimensions are assigned to express cartoon and human-subject images according to their characteristics. Then for recognition, we simultaneously integrate shared structure learning (SSL) and graph-based transductive learning into a joint framework to learn reliable classifiers on heterogeneous features. Provided with the framework, the similarities between cartoon and human-subject gestures can be quantitatively evaluated in a cross-feature manner. Extensive experiments on self-defined datasets have demonstrated the effectiveness of our method. Finally, applications illustrate the usages in various aspects of 2D cartoon industry.
3D human pose recovery from image by efficient visual feature selection
2011, Computer Vision and Image Understanding
Citation Excerpt :
The mapping is often approximated using regression models [14]. Alternatively, one can directly rely on a dense training database, leading to the so-called “examplar-based” methods [15,3,5,16]. Generally speaking, generative methods are more accurate, but the computation is often expensive.
In this paper we propose a new examplar-based approach to recover 3D human poses from monocular images. Given the visual feature of each frame, pose retrieval is first conducted in the examplar database to find relevant pose candidates. Then, dynamic programming is applied on the pose candidates to recover a continuous pose sequence. We make two contributions within this framework. First, we propose to use an efficient feature selection algorithm to select effective visual feature components. The task is formulated as a trace-ratio criterion which measures the score of the selected feature component subset, and the criterion is efficiently optimized to achieve the global optimum. The selected components are used instead of the original full feature set to improve the accuracy and efficiency of pose recovery. As second contribution, we propose to use sparse representation to retrieve the pose candidates, where the measured visual feature is expressed as a sparse linear combination of the examplars in the database. Sparse representation ensures that semantically similar poses have larger probability to be retrieved. The effectiveness of our approach is validated quantitatively through extensive evaluations on both synthetic and real data, and qualitatively by inspecting the results of the real time system we have implemented.
Silhouette representation and matching for 3D pose discrimination - A comparative study
2010, Image and Vision Computing
Inferring 3D human poses from marker-free images is an important but challenging task. A large body of algorithms has been proposed to that end, among which the discriminative methods using silhouettes as visual inputs are an important category. For these methods, silhouette representation and matching is very important. An effective silhouette representation method computes discriminative and compact silhouette descriptors which are used for learning the silhouette-pose mapping, and a good silhouette matching algorithm enables effective comparison and search in the example database. However, there has not been an extensive study on the abundance of shape analysis techniques in the context of pose discrimination. In this paper, we give a systematic study on the performances of shape representation and matching algorithms for pose discrimination, and we explore the influences of different realistic factors encountered in practical systems, such as yaw angle, camera tilt, silhouette noise, and the selection of training examples. We conduct various quantitative evaluations using synthetic and real silhouettes based on HumanEva dataset. Our work provides new insights into pose inferring algorithms and the designing and building of practical systems.
An image-constrained particle filter for 3D human motion tracking
2019, IEEE Access

View all citing articles on Scopus

^☆: Based on ‘Silhouette lookup for automatic pose tracking’, by Nicholas R. Howe which appeared in IEEE Workshop on Articulated and Non-rigid Motion. ©2004 IEEE.

View full text

Silhouette lookup for monocular 3D pose tracking☆

Abstract

Introduction

Section snippets

SiLo tracking

Experimental results

Conclusion

Acknowledgements

Image and Vision Computing

Computer Vision and Image Understanding

Information Processing and Management

Computer Vision and Image Understanding

Model-based image analysis of human motion using constraint propagation

IEEE Transactions on Pattern Analysis and Machine Intelligence

A robust human-silhouette extraction technique for interactive virtual environments,