Local velocity-adapted motion events for spatio-temporal recognition

https://doi.org/10.1016/j.cviu.2006.11.023Get rights and content

Abstract

In this paper, we address the problem of motion recognition using event-based local motion representations. We assume that similar patterns of motion contain similar events with consistent motion across image sequences. Using this assumption, we formulate the problem of motion recognition as a matching of corresponding events in image sequences. To enable the matching, we present and evaluate a set of motion descriptors that exploit the spatial and the temporal coherence of motion measurements between corresponding events in image sequences. As the motion measurements may depend on the relative motion of the camera, we also present a mechanism for local velocity adaptation of events and evaluate its influence when recognizing image sequences subjected to different camera motions.

When recognizing motion patterns, we compare the performance of a nearest neighbor (NN) classifier with the performance of a support vector machine (SVM). We also compare event-based motion representations to motion representations in terms of global histograms. A systematic experimental evaluation on a large video database with human actions demonstrates that (i) local spatio-temporal image descriptors can be defined to carry important information of space-time events for subsequent recognition, and that (ii) local velocity adaptation is an important mechanism in situations when the relative motion between the camera and the interesting events in the scene is unknown. The particular advantage of event-based representations and velocity adaptation is further emphasized when recognizing human actions in unconstrained scenes with complex and non-stationary backgrounds.

Introduction

Video interpretation is a key component in many potential applications within video surveillance, video indexing, robot navigation and human–computer interaction. This wide area of application motivates the development of generic methods for video analysis that do not rely on specific assumptions about the particular types of motion, environments and imaging conditions.

In recent years many successful methods were proposed that learn and classify motion directly from image measurements [6], [53], [11], [12], [47], [62], [48], [8], [3], [46]. These direct methods are attractive due to the possibility of training motion models from the video data alone. In particular, using such methods recognition of human activities was shown to be possible without constructing and matching elaborated models of human bodies [11], [62], [3].

Direct methods to video analysis often rely on the dense motion measurements. To enable subsequent recognition with such methods, it is essential for the measurements in the test and the training data to correspond to some extent. A simple approach to ensure such correspondence is to accumulate all measurements in the video using global descriptors. Global representations, however, depend on the background motion and do not scale well to complex scenes. To avoid the background problem, many methods deploy motion-based segmentation and compute motion descriptors in segmented regions. Complex scenes with non-rigid backgrounds and motion parallax, however, often make motion-based segmentation unreliable and distract subsequent recognition.

In this work, we focus on a local approach to motion recognition. One of the main goals of our method is to avoid the need of segmentation and to enable motion recognition in complex scenes. As a motivation, we observe that local space-time neighborhoods often contain discriminative information. A few examples of such neighborhoods for image sequences with human actions are illustrated in Fig. 1. Here, the similarity of motion in corresponding neighborhoods can be observed despite the difference in the appearance and the gross motions of people performing the same type of action. At the same time, the dissimilarity of image data is evident for non-corresponding neighborhoods. From this example it follows that some of the spatio-temporal neighborhoods may provide sufficient information for identifying corresponding space-time points across image sequences. Such correspondences could be useful for solving different tasks in video analysis. In particular, local correspondence in space-time could be used to formulate methods for motion recognition that do not rely on segmentation and, hence, could be applied to complex scenes.

To investigate this approach and to find corresponding points in space-time, we exploit the spatial and the temporal consistency or coherence of motion measurements between pairs of space-time neighborhoods. Considering all the pairs of neighborhoods for a given pair of sequences is computationally hard. Moreover, neighborhoods with simple motions and simple spatial structures may be ambiguous and may not allow for reliable matching when using local image information only. To address this problem, we select informative neighborhoods with low accidental similarity by maximizing the local spatio-temporal variation of image values over space and time. The detection of such neighborhoods, denoted here as local motion events, has been recently proposed by Laptev and Lindeberg [26] and is summarized in Section 3.

Local motion events (or simply events) are defined in this paper by the position and the shape of associated space–time neighborhoods. Both the shape and the position of events in video may depend on the recording conditions such as the relative distance and the relative velocity of the camera with respect to the object. Hence, to exploit the inherent motion properties of events, it is important to detect events independently of external transformations that effect the image sequences. Invariance of local motion events with respect to the scaling transformations has been previously addressed in [26]. Here, we extend this work and investigate event detection under Galilean transformations arising from the relative motion of the camera. A method for detecting motion events independently of the scale and Galilean transformations is presented in Section 3.

To match corresponding events in image sequences, we evaluate the coherence of motion measurements at pairs of space–time neighborhoods. For this purpose in Section 4 we formulate a set of alternative motion descriptors capturing motion information in the neighborhoods of detected events. Using these descriptors together with associated similarity measures we demonstrate the matching of corresponding events across image sequences in Section 5. Based on the estimated correspondences, we then define a nearest neighbor (NN) classifier and a support vector machine (SVM) classifier as two alternative methods for recognizing instances of motion classes. Fig. 2 summarizes the four steps of the method in this paper.

In Section 6 we evaluate different steps of the method. In particular the influence of local velocity adaptation as well as the choice of motion descriptors and recognition methods is analyzed on the problem of recognizing human actions in simple scenes. Results of human action recognition in complex scenes are then presented in Section 6.4. We conclude the paper with the discussion in Section 7.

This work is partly based on results previously presented in [27], [28], [51].

Section snippets

Related work

This work is related to several domains including motion-based recognition, local feature detection, adaptive filtering and human motion analysis. In the area of motion-based recognition, a large number of different schemes have been developed based on various combinations of visual tasks and image descriptors; see e.g. the monograph by Shah and Jain [52] and the survey paper by Gavrila [14] for overviews of early works. Concerning more recent approaches, Yacoob and Black [60] performed

Galilean- and scale-adapted event detection

Space–time interest points have recently been proposed to capture and represent local events in video [26]. Such points have stable locations in space–time and correspond to moving two-dimensional image structures at moments of non-constant motion (see Fig. 3a). A direct approach to detect such points consists of maximizing a measure of the local variations in the image sequence f(x, y, t) over both space (x, y) and time t. To formulate such an approach, consider a scale-space representation L(·, Σ)

Local descriptors for space–time neighborhoods

This section presents a set of alternative spatio-temporal image descriptors for the purpose of matching corresponding events in video sequences. To enable the matching, the event descriptors should be both discriminative and invariant with respect to irrelevant variations in image sequences. The method of previous section will be used here to adapt local motion events to scale and velocity transformations. Other variations, however, such as the individual variations of motion within a class

Matching and recognition

To find corresponding events based on the information in motion descriptors, it is necessary to evaluate the similarity of the descriptors. In this work, we use three alternative dissimilarity measures for comparing descriptors defined by the vectors d1 and d2:

  • The normalized scalar productS(d1,d2)=1-id1(i)d2(i)id12(i)id22(i)

  • The Euclidean distanceE(d1,d2)=i(d1(i)-d2(i))2

  • The χ2-measureχ2(d1,d2)=i(d1(i)-d2(i))2d1(i)+d2(i)

The normalized scalar product and the Euclidean distance can be applied

Evaluation

In this section, we evaluate methods described in Sections 3 Galilean- and scale-adapted event detection, 4 Local descriptors for space–time neighborhoods, 5 Matching and recognition, respectively. We perform evaluation using video sequences with six types of human actions (walking, jogging, running, boxing, hand waving and hand clapping) performed by different subjects in scenes with homogeneous and complex backgrounds. Scenes with homogeneous backgrounds (see Fig. 11) are used initially to

Summary and discussion

This paper explored the notion of local motion events for motion recognition. The original motivation for the method was to overcome difficulties associated with motion recognition in complex scenes. Towards this goal, the experiments in Section 6.4 confirmed the expected advantage of event-based motion representations by demonstrating promising results for the task of recognizing human actions in complex scenes.

To obtain invariance with respect to relative camera motion we proposed to adapt

Acknowledgments

The support from the Swedish Research Council and from the Royal Swedish Academy of Sciences as well as the Knut and Alice Wallenberg Foundation is gratefully acknowledged.

References (62)

  • O. Chapelle et al.

    SVMs for histogram-based image classification

    IEEE Trans. Neural Network

    (1999)
  • O. Chomat et al.

    A probabilistic sensor for the perception and recognition of activities

  • N. Cristianini et al.

    An Introduction to Support Vector Machines and Other Kernel-based Learning Methods

    (2000)
  • P. Dollár, V. Rabaud, G. Cottrell, S. Belongie, Behavior recognition via sparse spatio-temporal features, in: VS-PETS,...
  • A.A. Efros et al.

    Recognizing action at a distance

  • R. Fablet et al.

    Motion recognition using nonparametric image motion models estimated from temporal and multiscale co-occurrence statistics

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2003)
  • R. Fergus et al.

    Object class recognition by unsupervised scale-invariant learning

  • J.M. Gryn, R.P. Wildes, J.K. Tsotsos, Detecting motion patterns via direction maps with application to surveillance,...
  • C. Harris, M.J. Stephens, A combined corner and edge detector, in: Alvey Vision Conference, 1988, pp....
  • J. Hoey et al.

    Representation and recognition of complex human motion

  • B. Jähne et al.

    Signal processing and pattern recognition

  • T. Kadir et al.

    Saliency, scale and image description

    Int. J. Comput. Vis.

    (2001)
  • Y. Ke, R. Sukthankar, PCA-SIFT: a more disctinctive representation for local image descriptors, Technical Report...
  • Y. Ke, R. Sukthankar, M. Hebert, Efficient visual event detection using volumetric features, in: Proc. 10th Int. Conf....
  • J.J. Koenderink et al.

    Generic neighborhood operators

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1992)
  • J.J. Koenderink et al.

    Representation of local geometry in the visual system

    Biol. Cybern.

    (1987)
  • I. Laptev, Local Spatio-Temporal Image Features for Motion Interpretation. Ph.D. thesis, Department of Numerical...
  • I. Laptev, S. Belongie, P. Pérez, J. Wills. Periodic motion detection and segmentation via approximate sequence...
  • I. Laptev et al.

    Space–time interest points

  • I. Laptev, T. Lindeberg, Local descriptors for spatio-temporal recognition, in: First Int. Workshop on Spatial...
  • I. Laptev, T. Lindeberg, Velocity adaptation of space–time interest points, in: Proc. Int. Conf. on Pattern...
  • Cited by (135)

    • Multi-view Tracking Using Weakly Supervised Human Motion Prediction

      2023, Proceedings - 2023 IEEE Winter Conference on Applications of Computer Vision, WACV 2023
    • Counting People by Estimating People Flows

      2022, IEEE Transactions on Pattern Analysis and Machine Intelligence
    • Detection of human behavioral events combining perceptual causality

      2021, Proceedings of IEEE Asia-Pacific Conference on Image Processing, Electronics and Computers, IPEC 2021
    View all citing articles on Scopus
    View full text