Action recognition with appearance–motion features and fast search trees

https://doi.org/10.1016/j.cviu.2010.11.002Get rights and content

Abstract

In this paper we propose an approach for action recognition based on a vocabulary of local appearance-motion features and fast approximate search in a large number of trees. Large numbers of features with associated motion vectors are extracted from video data and are represented by many trees. Multiple interest point detectors are used to provide features for every frame. The motion vectors for the features are estimated using optical flow and a descriptor based matching. The features are combined with image segmentation to estimate dominant homographies, and then separated into static and moving ones despite the camera motion. Features from a query sequence are matched to the trees and vote for action categories and their locations. Large number of trees make the process efficient and robust. The system is capable of simultaneous categorisation and localisation of actions using only a few frames per sequence. The approach obtains excellent performance on standard action recognition sequences. We perform large scale experiments on 17 challenging real action categories from various sport disciplines. We demonstrate the robustness of our method to appearance variations, camera motion, scale change, asymmetric actions, background clutter and occlusion.

Research highlights

► A system for object-action recognition and localization based on local features and Hough tranform. ► Randomized kd-trees for fast matching of local descriptors. ► Estimation of multiple background planes for camera motion compensation.

Introduction

Significant progress has been made in classification of static scenes and action recognition is receiving increasing attention in computer vision community. Many existing methods [5], [9], [37], [40], [45], [49] obtain high classification scores for simple action sequences with exaggerated motion, static and uniform background in controlled environments. It is however hard to make a visual correspondence between a lab controlled action and a real action of the same category as their appearance, motion and clutter significantly differ. Real action scenes represent a challenge rarely addressed in the literature. Our main goal in this paper is not only to propose a generic solution which could handle actions in real environment, but also to demonstrate how the performance for the controlled environment and the real one can differ. An illustration of frequently used action recognition benchmark data and real example of the same category is in Fig. 1. The need for using real world data is argued in image recognition community i.e. in the Pascal Visual Object Classes Challenge [6] the 2010 challenge includes an action recognition teaser. The TREC Video Retrieval Evaluation is another example of large community efforts to provide a real application data and independent evaluation benchmark. TRECVid also introduced a human action detection task in airport surveillance videos. Initial results for this data with spatio-temporal descriptors and bag-of-words are reported in [53].

In this paper we address the problem of recognising object-actions with a data driven method, which does not require long sequences or high level reasoning. We build on our previous work from [33], [35], [48] which we combine together and modify the system components to optimise the efficiency and recognition accuracy. The main contribution is a generic solution to action classification including localisation of objects performing actions. We draw from existing work recently undertaken in recognition and retrieval of static images [23], [32], [42]. Our approach follows the popular paradigm, which is the use of local features, vocabulary based representation and voting. Such systems have been very successful in retrieval and recognition of static images. However, recognition of actions is a distinct problem and a number of modifications must be proposed to adopt it to this new application scenario. Compared to existing, approaches which usually focus on one of the issues associated with action recognition and make strong assumptions about the camera motion or background, our system can deal with appearance variations, camera motion, scale change, asymmetric actions, background clutter and occlusion. So far, very little has been done to address all of these issues simultaneously. The key idea explored here is the use of large number of features represented in many search trees, in contrast to many existing action classification methods based on a single, small codebook and SVM [5], [38], [45]. This message also comes from the static object recognition [6], where efficient search methods using many different features from large number of examples provide the best results. The advantage of using multiple trees has been demonstrated in image retrieval [42]. In this paper the trees are built from various types of features, representing appearance–action models and are learnt efficiently from videos as well as from static images. Moreover, we use a simple nearest neighbour classifier unlike other methods based on SVMs [5], [38], [45], although for comparison we also provide results with SVMs.

Another contribution of this paper is a feature extraction approach with dominant motion compensation. The features represent local appearance combined with local motion information extracted from a video regardless the camera motion and background clutter. These features allow accurate classification and localisation of multiple actions performed simultaneously.

Among other contributions, we implement an object-action representation which allows to hypothesise an action category, its location and pose from a single feature. We show how to make use of static training data and static features to support an action hypothesis. In contrast to the other systems our method can simultaneously classify the entire sequence as well as recognise and localise multiple actions within the sequence. Finally, we consider the use of new action categories and recognition results reported in this paper as one of our major contributions.

Early approaches to human centred motion analysis have been reviewed in [1] under three categories: body parts motion analysis, tracking, and activity recognition. It is argued there that ‘the key to successful execution of high-level tasks is to establish feature correspondence between consecutive frames, which still remains a bottleneck in the whole processing’. A decade has passed since this review but the observation is still valid and our approach is focusing on improving the quality of feature correspondence for action recognition task.

Recently, a boosted space–time window classifier from [18] was applied to real movie sequences in [21], [22]. However, boosting systems are known to require weak classifiers and large number of training examples to generalise, otherwise the performance is low. Also space–time features, space–time pyramids and multi- channel non-linear SVMs do not allow for efficient processing of large databases.

Another frequently followed class of approaches is based on spatio-temporal features computed globally [3], [7], [52] or locally [5], [9], [38], [45]. Both methods suffer from various drawbacks. Global methods cannot recognise multiple actions simultaneously or localise them spatially. In these methods recognition can be done by computing similarity between globally represented actions using cross-correlation [7] or histograms of spatio-temporal gradients [52]. Spatio-temporal interest points [20] result in a very memory efficient representation but are too sparse to build action models robust to camera motion, background clutter, occlusion, motion blur, etc. Moreover, local features are often used to represent the entire sequence as a distribution, which at the end results in a global representation. It was demonstrated in [49] that as few as 5–25 spatio-temporal interest points give high recognition performance on standard test data. We argue that this number is insufficient for real actions. The need for more features has been observed in [5], where Harris interest point detector was combined with a Gabor filter to extract more spatio-temporal points. This argument was also emphasised by [9], [38], which proposed a hybrid of spatio-temporal and static features to improve the recognition performance. This shifts the attention from motion towards the appearance of objects performing actions. In this context it seems more appropriate to address object-action categorisation problem rather than action via motion only.

A different class of approach relies upon one strong assumption that body parts can be reliably tracked [43], even though existing tracking tools often fail in real video data. These methods use relatively large temporal extent and recognise more complex actions often viewed from multiple cameras, thus are less relevant to this work.

Another frequent assumption in the literature is a static camera and uniform background. This is valid for many surveillance applications but not for the general action recognition problem. In [21] the camera was assumed to be fixed. The action recognition approach from [51], claims to be the first to deal with camera motion, explored multi-view geometry. This solution however requires multiple camera setup or very similar actions captured from different viewpoints, which limits the range of possible applications. Other work relevant to camera motion estimation and dominant plane segmentation perform combined motion and image segmentation [47] or plane estimation [50], but these are concerned with either precise segmentation of moving regions or accurate reconstruction of 3D scene structure. Iterative estimation of dominant planes based on optical flow was also explored for robot navigation in [41]. A recent approach that addresses similar problems is from [26]. A divisive information-theoretic algorithm is employed to group semantically related features and AdaBoost is chosen to integrate all features for recognition. Realistic data was considered but the training complexity of this approach is high.

A number of recently proposed methods attempt to deal with complex scenes, background clutter, occlusion, large variations in appearance and motion. The volumetric features have been extended in [19] to handle occlusion of individual features. Colour based segmentation of spatio-temporal volumes was applied to the video and the volumetric segments were then efficiently matched to models. Sequences without significant camera motion were considered and manual interaction was required.

The complementary nature of appearance and motion features was also emphasised in [15]. Their main concern however was the inaccurate alignment of action locations in space–time volumes, which makes it difficult to match test examples against the training data. A static camera was assumed, and background subtraction used, for extracting appearance features. Their main contribution was a simulated annealing multiple instance learning SVM that iteratively evaluates action candidates and allows relabelling of training examples if their scores are low, to build a more accurate classifier. The classification however requires initialisation of action location, which was achieved with a head detector.

Another related approach was proposed in [17]. They focused on similar problems to our paper but addressed them differently. Spatio-temporal interest points were considered as votes in the video volume with weights estimated as log likelihood ratios calculated from positive and negative matches. Fast matching was done using Locality Sensitive Hashing [13]. The main contribution however was in efficient extension of sliding window with branch-and-bound to search in a 3D volume for the sub-volumes of actions. Our approach avoids the problem of searching for sub-volumes as the action locations are indicated by local maxima in the voting space and not by undefined volumes. Moreover, the problem of camera motion is not addressed there.

A biologically inspired system based on a neurobiological model of motion processing was investigated in [16] as an extension of their image classification method. The system consists of a hierarchy of spatio-temporal feature detectors of increasing complexity followed by the SVM classifier. Background subtraction was applied to reduce the area over which the proposed high complexity features were computed.

An efficient prototype based approach for action recognition robust to moving cameras and dynamic background was proposed in [25]. Tree based prototype matching and look up table indexing was adopted to search in a large collection of prototypes. The proposed representation is a binary silhouette combined with the motion fields computed for bounding box of human performing actions. A human detector or background subtraction is necessary for this system to provide bounding boxes. Moreover, the motion compensation is done by a simple median of flow vectors which assumes single plane normal to the camera axis.

In contrast to the methods discussed above our approach does not require additional object detectors or manual interaction, does not assume simple background model or static camera, and last but not least it is applicable to any object-action category, including humans.

The main components of the system are illustrated in Fig. 2. The representation of object-action categories is based on multiple vocabulary trees. Training of object-action representations starts with feature extraction which includes scale invariant feature detection, motion estimation, and region description discussed in Section 2. Motion vectors of features are then compensated with dominant motion which is estimated in Section 3. These features are used to build appearance–motion model of object-action categories that is discussed in Section 4. Fast matching is done by Approximate Nearest Neighbour search with randomised kd-trees as explained in Section 5. Section 5.2 discusses the recognition where features and their motion vectors are first extracted from the query sequence and matched to the trees. The features that match to the tree nodes accumulate scores for different categories and vote for their locations and scales within a frame. The learning and recognition process is very efficient due to the use of many trees and highly parallelised architecture discussed in Section 5.3. Finally, experimental results are presented in Section 6.

Section snippets

Local features

Local features with associated motion vectors are the crucial elements of our object-action representation. The features are localised by four different detectors and described by Gradient Location-Orientation Descriptor. The dimensionality of descriptors is reduced and the features are tracked to obtain motion vectors. The details of these operations are given in the following sections.

Motion compensation

Given a number of features with associated motion vectors extracted from consecutive frames the problem is to separate the local motions characterising the actions from the dominant camera motion. Frequently used single plane assumption does not hold in many applications as there is often the ground plane and the background plane in the outdoor scenes or even more planes in the indoor scenes. This requires image segmentation into multiple dominant motion planes which can then be used to correct

Action representation

This section discusses the representation of action object categories used in our system. We first present the model that captures relations between features and then discuss the vocabulary construction.

Recognition with search trees

In this section we present our approach to fast matching of features to the vocabulary. We exploit the idea of randomised kd-trees successfully used in image retrieval [42]. In contrast to [35] where the objective was to cluster and compress the amount of information in the feature set, we now focus on the Approximate Nearest Neighbour (ANN) search [2], [33] with kd-trees, which is much more efficient than with flat codebooks in [5], [23], [37], [38], [45] or metric trees in [33], [35], [39].

Experimental results

This section discusses the experimental setup, results for motion compensation and performance of our action recognition system on various data.

Conclusions

In this paper we have presented and evaluated an approach to action recognition via local features, tracking and camera motion compensation. A number of recent action recognition methods have been reviewed and their limitations have been discussed.

The system is capable of simultaneous recognition and localisation of various object-actions within the same sequence. It works on data from an uncontrolled environment with camera motion, background clutter and occlusion. The key idea here is the use

Acknowledgment

This research was supported by UK EPSRC EP/F003420/1 Grant and BBC Research and Development.

References (52)

  • J.K. Aggarwal et al.

    Human motion analysis: a review

    Computer Vision and Image Understanding

    (1999)
  • S. Arya et al.

    An optimal algorithm for approximate nearest neighbor searching fixed dimensions

    Journal of the Association for Computing Machinery

    (1998)
  • M. Blank, L. Gorelick, E. Shechtman, M. Irani, R. Basri, Actions as space–time shapes, in: Proceedings of the...
  • P. Dollar, V. Rabaud, G. Cottrell, S. Belongie. Behavior recognition via sparse spatio-temporal teatures, in:...
  • M. Everingham et al. The PASCAL Visual Object Classes Challenge,...
  • A. Efros, A. Berg, G. Mori, J. Malik. Recognizing action at a distance, in: Proceedings of the International Conference...
  • P. Felzenszwalb et al.

    Efficient graph based image segmentation

    International Journal of Computer Vision

    (2005)
  • C. Fanti, L. Zelnik-Manor, P. Perona. Hybrid models for human motion recognition, in: Proceedings of the International...
  • V. Ferrari, F. Jurie, C. Schmid. Accurate object detection with deformable shape models learnt from images, in: The...
  • K. Fukunaga

    Introduction to Statististical Pattern Recognition

    (1990)
  • L. Gorelick et al.

    Actions as space–time shapes

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2007)
  • P. Indyk, R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality, in: Proceedings of...
  • M. Irani et al.

    A unified approach to moving object detection in 2D and 3D scenes

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1998)
  • Y. Hu, L. Cao, F. Lv, S. Yan, Y. Gong, T.S. Huang. Action detection in complex scenes with spatial and temporal...
  • H. Jhuang, T. Serre, L. Wolf, T. Poggio. A biologically inspired system for action recognition, in: Proceedings of the...
  • J. Yuan, Z. Liu, Y. Wu. Discriminative subvolume search for efficient action detection, in: Proceedings of the IEEE...
  • Y. Ke, R. Sukthankar, M. Hebert. Efficient visual event detection using volumetric features, in: Proceedings of the...
  • Y. Ke, R. Sukthankar, M. Hebert. Event detection in crowded videos, in: Proceedings of the International Conference of...
  • I. Laptev

    On space–time interest points

    International Journal of Computer Vision

    (2005)
  • I. Laptev, P. Perez. Retrieving actions in movies, in: Proceedings of the International Conference of Computer Vision,...
  • I. Laptev, M. Marsza?ek, C. Schmid, B. Rozenfeld. Learning realistic human actions from movies, in: Proceedings of the...
  • B. Leibe et al.

    Robust object detection with interleaved categorization and segmentation

    International Journal of Computer Vision

    (2007)
  • B. Leibe, K. Mikolajczyk, B. Schiele. Efficient clustering and matching for object class recognition, in: Proceedings...
  • Z. Lin, Z. Jiang, L.S. Davis. Recognizing actions by shape–motion prototype trees, in: Proceedings of the International...
  • J. Liu, J. Luo, M. Shah. Recognizing realistic actions from videos “in the Wild”, in: Proceedings of the IEEE...
  • D. Lowe

    Distinctive image features from scale-invariant keypoints

    International Journal of Computer Vision

    (2004)
  • Cited by (26)

    • A deep genetic algorithm for human activity recognition leveraging fog computing frameworks

      2021, Journal of Visual Communication and Image Representation
    • Human activity recognition in videos using a single example

      2013, Image and Vision Computing
      Citation Excerpt :

      Moreover, the same action performed by two persons can appear to be very different. In addition, clothing, illumination and background changes can increase this dissimilarity [7–9]. To date, in the computer vision community, “action” has largely been taken to be a human motion performed by a single person, taking up to a few seconds, and containing one or more events.

    • A performance evaluation of gradient field HOG descriptor for sketch based image retrieval

      2013, Computer Vision and Image Understanding
      Citation Excerpt :

      Images are ranked according to histogram distance d(HI, Hs) as defined using one of the measures below. The distance measure between two frequency histograms is often critical to the success of BoVW search—and different similarity measures as cited as optimal for different domains (e.g. activity recognition [58], building search [5]). We outline several of the most commonly used distance measures in BoW retrieval systems within Table 1.

    • HARDeep: design and evaluation of a deep ensemble model for human activity recognition

      2023, International Journal of Innovative Computing and Applications
    View all citing articles on Scopus
    View full text