Local velocity-adapted motion events for spatio-temporal recognition
Introduction
Video interpretation is a key component in many potential applications within video surveillance, video indexing, robot navigation and human–computer interaction. This wide area of application motivates the development of generic methods for video analysis that do not rely on specific assumptions about the particular types of motion, environments and imaging conditions.
In recent years many successful methods were proposed that learn and classify motion directly from image measurements [6], [53], [11], [12], [47], [62], [48], [8], [3], [46]. These direct methods are attractive due to the possibility of training motion models from the video data alone. In particular, using such methods recognition of human activities was shown to be possible without constructing and matching elaborated models of human bodies [11], [62], [3].
Direct methods to video analysis often rely on the dense motion measurements. To enable subsequent recognition with such methods, it is essential for the measurements in the test and the training data to correspond to some extent. A simple approach to ensure such correspondence is to accumulate all measurements in the video using global descriptors. Global representations, however, depend on the background motion and do not scale well to complex scenes. To avoid the background problem, many methods deploy motion-based segmentation and compute motion descriptors in segmented regions. Complex scenes with non-rigid backgrounds and motion parallax, however, often make motion-based segmentation unreliable and distract subsequent recognition.
In this work, we focus on a local approach to motion recognition. One of the main goals of our method is to avoid the need of segmentation and to enable motion recognition in complex scenes. As a motivation, we observe that local space-time neighborhoods often contain discriminative information. A few examples of such neighborhoods for image sequences with human actions are illustrated in Fig. 1. Here, the similarity of motion in corresponding neighborhoods can be observed despite the difference in the appearance and the gross motions of people performing the same type of action. At the same time, the dissimilarity of image data is evident for non-corresponding neighborhoods. From this example it follows that some of the spatio-temporal neighborhoods may provide sufficient information for identifying corresponding space-time points across image sequences. Such correspondences could be useful for solving different tasks in video analysis. In particular, local correspondence in space-time could be used to formulate methods for motion recognition that do not rely on segmentation and, hence, could be applied to complex scenes.
To investigate this approach and to find corresponding points in space-time, we exploit the spatial and the temporal consistency or coherence of motion measurements between pairs of space-time neighborhoods. Considering all the pairs of neighborhoods for a given pair of sequences is computationally hard. Moreover, neighborhoods with simple motions and simple spatial structures may be ambiguous and may not allow for reliable matching when using local image information only. To address this problem, we select informative neighborhoods with low accidental similarity by maximizing the local spatio-temporal variation of image values over space and time. The detection of such neighborhoods, denoted here as local motion events, has been recently proposed by Laptev and Lindeberg [26] and is summarized in Section 3.
Local motion events (or simply events) are defined in this paper by the position and the shape of associated space–time neighborhoods. Both the shape and the position of events in video may depend on the recording conditions such as the relative distance and the relative velocity of the camera with respect to the object. Hence, to exploit the inherent motion properties of events, it is important to detect events independently of external transformations that effect the image sequences. Invariance of local motion events with respect to the scaling transformations has been previously addressed in [26]. Here, we extend this work and investigate event detection under Galilean transformations arising from the relative motion of the camera. A method for detecting motion events independently of the scale and Galilean transformations is presented in Section 3.
To match corresponding events in image sequences, we evaluate the coherence of motion measurements at pairs of space–time neighborhoods. For this purpose in Section 4 we formulate a set of alternative motion descriptors capturing motion information in the neighborhoods of detected events. Using these descriptors together with associated similarity measures we demonstrate the matching of corresponding events across image sequences in Section 5. Based on the estimated correspondences, we then define a nearest neighbor (NN) classifier and a support vector machine (SVM) classifier as two alternative methods for recognizing instances of motion classes. Fig. 2 summarizes the four steps of the method in this paper.
In Section 6 we evaluate different steps of the method. In particular the influence of local velocity adaptation as well as the choice of motion descriptors and recognition methods is analyzed on the problem of recognizing human actions in simple scenes. Results of human action recognition in complex scenes are then presented in Section 6.4. We conclude the paper with the discussion in Section 7.
This work is partly based on results previously presented in [27], [28], [51].
Section snippets
Related work
This work is related to several domains including motion-based recognition, local feature detection, adaptive filtering and human motion analysis. In the area of motion-based recognition, a large number of different schemes have been developed based on various combinations of visual tasks and image descriptors; see e.g. the monograph by Shah and Jain [52] and the survey paper by Gavrila [14] for overviews of early works. Concerning more recent approaches, Yacoob and Black [60] performed
Galilean- and scale-adapted event detection
Space–time interest points have recently been proposed to capture and represent local events in video [26]. Such points have stable locations in space–time and correspond to moving two-dimensional image structures at moments of non-constant motion (see Fig. 3a). A direct approach to detect such points consists of maximizing a measure of the local variations in the image sequence f(x, y, t) over both space (x, y) and time t. To formulate such an approach, consider a scale-space representation L(·, Σ)
Local descriptors for space–time neighborhoods
This section presents a set of alternative spatio-temporal image descriptors for the purpose of matching corresponding events in video sequences. To enable the matching, the event descriptors should be both discriminative and invariant with respect to irrelevant variations in image sequences. The method of previous section will be used here to adapt local motion events to scale and velocity transformations. Other variations, however, such as the individual variations of motion within a class
Matching and recognition
To find corresponding events based on the information in motion descriptors, it is necessary to evaluate the similarity of the descriptors. In this work, we use three alternative dissimilarity measures for comparing descriptors defined by the vectors d1 and d2:
- •
The normalized scalar product
- •
The Euclidean distance
- •
The χ2-measure
The normalized scalar product and the Euclidean distance can be applied
Evaluation
In this section, we evaluate methods described in Sections 3 Galilean- and scale-adapted event detection, 4 Local descriptors for space–time neighborhoods, 5 Matching and recognition, respectively. We perform evaluation using video sequences with six types of human actions (walking, jogging, running, boxing, hand waving and hand clapping) performed by different subjects in scenes with homogeneous and complex backgrounds. Scenes with homogeneous backgrounds (see Fig. 11) are used initially to
Summary and discussion
This paper explored the notion of local motion events for motion recognition. The original motivation for the method was to overcome difficulties associated with motion recognition in complex scenes. Towards this goal, the experiments in Section 6.4 confirmed the expected advantage of event-based motion representations by demonstrating promising results for the task of recognizing human actions in complex scenes.
To obtain invariance with respect to relative camera motion we proposed to adapt
Acknowledgments
The support from the Swedish Research Council and from the Royal Swedish Academy of Sciences as well as the Knut and Alice Wallenberg Foundation is gratefully acknowledged.
References (62)
The visual analysis of human movement: a survey
Comput. Vis. Image Und.
(1999)- et al.
Velocity-adaptation of spatio-temporal receptive fields for direct recognition of activities: an experimental study
Image Vis. Comput.
(2004) - et al.
Shape-adapted smoothing in estimation of 3-D shape cues from affine deformations of local 2-D brightness structure
Image Vis. Comput.
(1997) - et al.
Parameterized modeling and recognition of activities
Comput. Vis. Image Und.
(1999) - et al.
Spectral partitioning with indefinite kernels using the nyström extension
- et al.
Eigentracking: Robust matching and tracking of articulated objects using view-based representation
Int. J. Comput. Vis.
(1998) - et al.
Recognizing human motion using parameterized models of optical flow
- M. Blank, L. Gorelick, E. Shechtman, M. Irani, R. Basri, Actions as space–time shapes, in: Proc. 10th Int. Conf. on...
- et al.
The recognition of human movement using temporal templates
IEEE Trans. Pattern Anal. Mach. Intell.
(2001) - O. Boiman, M. Irani, Detecting irregularities in images and in video. in: Proc. 10th Int. Conf. on Computer Vision,...
SVMs for histogram-based image classification
IEEE Trans. Neural Network
A probabilistic sensor for the perception and recognition of activities
An Introduction to Support Vector Machines and Other Kernel-based Learning Methods
Recognizing action at a distance
Motion recognition using nonparametric image motion models estimated from temporal and multiscale co-occurrence statistics
IEEE Trans. Pattern Anal. Mach. Intell.
Object class recognition by unsupervised scale-invariant learning
Representation and recognition of complex human motion
Signal processing and pattern recognition
Saliency, scale and image description
Int. J. Comput. Vis.
Generic neighborhood operators
IEEE Trans. Pattern Anal. Mach. Intell.
Representation of local geometry in the visual system
Biol. Cybern.
Space–time interest points
Cited by (135)
Extracting water channels from aerial videos based on image-to-BIM registration and spatio-temporal continuity
2021, Automation in ConstructionSequential data feature selection for human motion recognition via Markov blanket
2017, Pattern Recognition LettersMulti-view Tracking Using Weakly Supervised Human Motion Prediction
2023, Proceedings - 2023 IEEE Winter Conference on Applications of Computer Vision, WACV 2023Counting People by Estimating People Flows
2022, IEEE Transactions on Pattern Analysis and Machine IntelligenceDetection of human behavioral events combining perceptual causality
2021, Proceedings of IEEE Asia-Pacific Conference on Image Processing, Electronics and Computers, IPEC 2021