Silhouette lookup for monocular 3D pose tracking☆
Introduction
Researchers have worked for decades towards the goal of a computer system that can track the articulated pose of a moving human being from monocular video input [1], [2], [3]. An effective pose tracking system would immediately enable applications in security, ergonomics, human–computer interaction, and many other fields. Yet a recent study concluded that none of the automated tracking methods tested could successfully track a moderately difficult example [4]. Recovery from tracking errors therefore, deserves more than the scant research attention it has received [5] to date. Furthermore, currently popular approaches based upon particle tracking are slowed by the need to propagate multiple samples at each frame. Research into non-incremental, recoverable tracking mechanisms therefore, fills a pressing need.
This paper develops a lookup-based approach to pose tracking, focusing in particular on silhouette lookup. This approach, hereafter referred to as SiLo tracking, offers significant advantages over currently popular methods using parameter optimization and particle tracking algorithms. The SiLo tracker described in Section 2 requires no human input for initialization. Even if it makes grave errors during difficult sections of a video, it can automatically recover to track the correct pose on subsequent frames. Furthermore, although the implementation described here is not optimized for speed, the approach invites significantly faster implementations than those based upon optimization and particle tracking.
Several developments contribute to enable these advances. The many-to-one silhouette-to-pose relationship has in the past proved a barrier to the development of silhouette-based trackers. The new technique exploits temporal continuity to choose the best hypothesis among multiple candidate poses at each frame, via a Markov chain formulation. Relieved of the burden of finding the perfect match, simple yet effective metrics make feasible the rapid retrieval of candidate silhouettes. Finally, smoothing and optimization based upon polynomial splines ensure that the tracked output forms a plausible human motion that matches the observations.
The sections that follow describe each of these contributions in more detail. Section 2 describes the SiLo tracking algorithm and places it in the context of previous work. Section 3 describes experimental results using the algorithm. Section 4 concludes with an analysis of the approach's strengths and weaknesses, and a discussion of future work.
Section snippets
SiLo tracking
The algorithm described below takes as its input raw video from a single fixed viewpoint, assumed for simplicity to contain a single human being entirely within the camera frame and unoccluded by other objects. Multiple subjects, partial visibility, and camera motions make the already challenging problem more difficult. Although this paper will at times indicate how such additional complications might be addressed, they fall beyond its focus, and the experiments will all use input that conforms
Experimental results
Quantitative evaluation of 3D pose reconstruction is notoriously difficult, and standard test sets have yet to emerge. It is difficult to obtain ground truth calibrated with real video. This section therefore, begins with quantitative results for synthetic input for which ground truth is known. Further, experiments apply the methods described above to real video clips without ground truth, but representing a wider range in difficulty.
Conclusion
The SiLo tracker demonstrates successful self-initialization and error-recovery for 3D pose tracking from monocular video. It infers realistic depth information missing from the 2D input. Like many other current algorithms for monocular 3D pose tracking, it makes some errors, but unlike most techniques it can recover automatically and regain the correct track on subsequent frames without human intervention.
Despite the positive results presented in this paper, silhouette lookup remains an
Acknowledgements
This material is based upon work supported by the National Science Foundation under Grant no. IIS-0328741. The training data used in this project was obtained from mocap.cs.cmu.edu. That database was created with funding from NSF EIA-0186217.
References (33)
Model-based vision: a program to see a walking person
Image and Vision Computing
(1983)- et al.
A survey of computer vision-based human motion capture
Computer Vision and Image Understanding
(2001) - et al.
Combining the evidence of multiple query representations for information retrieval
Information Processing and Management
(1995) - et al.
A flexible image database system for content-based retrieval
Computer Vision and Image Understanding
(1999) - et al.
Model-based image analysis of human motion using constraint propagation
IEEE Transactions on Pattern Analysis and Machine Intelligence
(1980) - D. DiFranco, T.-J. Cham, J.M. Rehg, Reconstruction of 3-d figure motion from 2-d correspondences, in: IEEE Computer...
- S. Ioffe, D. Forsyth, Human tracking with mixtures of trees, in: International Conference on Computer Vision, 2001, pp....
- et al.
A robust human-silhouette extraction technique for interactive virtual environments,
- K.-P. Karmann, A. von Brandt, Moving object recognition using an adaptive background memory, in: Time-Varying Image...
- T. Horprasert, D. Harwood, L. Davis, A robust background subtraction and shadow detection, in: Proceedings of the Asian...
Cited by (34)
Video quality evaluation toward complicated sport activities for clustering analysis
2021, Future Generation Computer SystemsCitation Excerpt :These fine-grained object details are discriminative for distinguishing different types of hand gestures. To extract all these object details, we employ the BING objectness measure [6] to produce a concise set of object patches. The key advantage of BING is its highly competitive speed.
Deeply fusing multi-model quality-aware features for sophisticated human activity understanding
2020, Signal Processing: Image CommunicationCitation Excerpt :Kolev et al. [9] formulated the color moment to construct 3D-appearance that similar to the evaluated human action profile. Howe et al. [10] calculated the human posture from the known human posture profile sets. The proposed algorithm by Howe et al. was simple and effective.
Retrieval-based cartoon gesture recognition and applications via semi-supervised heterogeneous classifiers learning
2013, Pattern RecognitionCitation Excerpt :In [34], high local extremum points which correspond to pose images along the motion curve are considered as most representative in key-pose selection. HD is used for pose inferring in [35]. In our application, we use the HD as a distance metric and evaluate extremum values for keyframe determination, and an example result is visualized in Fig. 6(c), where each keyframe is marked by white rectangle.
3D human pose recovery from image by efficient visual feature selection
2011, Computer Vision and Image UnderstandingCitation Excerpt :The mapping is often approximated using regression models [14]. Alternatively, one can directly rely on a dense training database, leading to the so-called “examplar-based” methods [15,3,5,16]. Generally speaking, generative methods are more accurate, but the computation is often expensive.
Silhouette representation and matching for 3D pose discrimination - A comparative study
2010, Image and Vision ComputingAn image-constrained particle filter for 3D human motion tracking
2019, IEEE Access
- ☆
Based on ‘Silhouette lookup for automatic pose tracking’, by Nicholas R. Howe which appeared in IEEE Workshop on Articulated and Non-rigid Motion. ©2004 IEEE.