Elsevier

Image and Vision Computing

Volume 27, Issue 10, 2 September 2009, Pages 1515-1526
Image and Vision Computing

Histogram of oriented rectangles: A new pose descriptor for human action recognition

https://doi.org/10.1016/j.imavis.2009.02.002Get rights and content

Abstract

Most of the approaches to human action recognition tend to form complex models which require lots of parameter estimation and computation time. In this study, we show that, human actions can be simply represented by pose without dealing with the complex representation of dynamics. Based on this idea, we propose a novel pose descriptor which we name as Histogram-of-Oriented-Rectangles (HOR) for representing and recognizing human actions in videos. We represent each human pose in an action sequence by oriented rectangular patches extracted over the human silhouette. We then form spatial oriented histograms to represent the distribution of these rectangular patches. We make use of several matching strategies to carry the information from the spatial domain described by the HOR descriptor to temporal domain. These are (i) nearest neighbor classification, which recognizes the actions by matching the descriptors of each frame, (ii) global histogramming, which extends the idea of Motion Energy Image proposed by Bobick and Davis to rectangular patches, (iii) a classifier-based approach using Support Vector Machines, and (iv) adaptation of Dynamic Time Warping on the temporal representation of the HOR descriptor. For the cases when pose descriptor is not sufficiently strong alone, such as to differentiate actions “jogging” and “running”, we also incorporate a simple velocity descriptor as a prior to the pose based classification step. We test our system with different configurations and experiment on two commonly used action datasets: the Weizmann dataset and the KTH dataset. Results show that our method is superior to other methods on Weizmann dataset with a perfect accuracy rate of 100%, and is comparable to the other methods on KTH dataset with a very high success rate close to 90%. These results prove that with a simple and compact representation, we can achieve robust recognition of human actions, compared to complex representations.

Introduction

Human action recognition is one of the appealing, yet challenging problems of computer vision. Reliable and effective solutions to this problem can serve many areas, ranging from human–computer interaction to security surveillance. However, current solutions are very limited, and understanding what people are doing remains unresolved.

Human action recognition has been a widely studied topic (for extensive reviews see [[1], [2]]), but the solutions to the problem that have been submitted to date are very premature and still specific to the dataset at hand.

There are three key elements that define an action:

  • pose of the body (and parts),

  • speed of body motion (and parts),

  • relative ordering of the poses.

We can formulate action recognition as a mixture of these three elements. The relative importance of these elements is based on the nature of the actions that we aim to recognize. For example, if we want to differentiate an instance of a “bend” action from a “walk” action, the pose of the human figure gives sufficient information. However, if we want to discriminate between “jog” and “run” actions, the pose alone may not be enough, due to the similarity in the nature of these actions in the pose domain, and in such cases, the speed information needs to be incorporated. Similarly, for recognizing “stand up” and “sit down” actions, the relative ordering of the poses will be important, since these two actions include same poses in reverse temporal orders.

Various attempts in action recognition literature try to model some or all of these aspects. For instance, methods based on spatio-temporal templates mostly pay attention to the pose of the human body, whereas methods based on dynamical models focus on modeling the ordering of these poses in greater detail.

We argue that the human pose encapsulates many useful clues for recognizing the ongoing activity. Actions can mostly be represented by configurations of the body parts, before building complex models for understanding the dynamics.

Using this idea, we base the foundation of our method on defining the pose of the human body to discriminate actions, and by introducing a new pose descriptor, we want to evaluate how far we can go only with a good description of the pose of the body. We also evaluate how our system benefits from adding the remaining action components whenever necessary. Unlike most of the methods that use complex modeling of body configurations, we follow the analogy of Forsyth and Fleck [3], which represents the body as a set of rectangles, and explore the layout of these rectangles.

Our pose descriptor is based on a basic intuition: the human body can be represented by a collection of oriented rectangles in the spatial domain and the orientations of these rectangles form a signature for each action. Rather than detecting and learning the exact configuration of body parts, we are only interested in the distribution of the rectangular regions which may be the candidates for the body parts.

This idea is similar to the bag-of-words approach, where the images are represented by a collection of regions, ignoring their spatial relationships. The bag-of-words approach – which is adapted from text retrieval literature – has shown to be successful for object and scene recognition [[4], [5]] and for annotation and retrieval of large image and video collections [[6], [7]]. In such approaches, the images are represented by the distribution of words from a fixed visual vocabulary (i.e. image patches) which is usually obtained by vector quantization of visual features. In our approach, we use rectangles as our visual words and achieve vector quantization by histogramming over their orientation angles. However, our approach is basically different from bag-of-words. First, we are using the distribution of rectangular regions as opposed to complex visual words. Second, we place a grid over these rectangles to capture their spatial layout.

In this study, our main contribution is to show how a good pose descriptor can boost the performance of action recognition. We introduce a novel pose descriptor which is based on candidate rectangular regions over the human body. We show that using our pose descriptor, we can recognize human actions even in complicated settings.

In the rest of the paper, we first cover the literature on human action recognition within a brief overview. Then, we present the details of our pose descriptor, which represents the human figure as a distribution of oriented rectangular patches. After that, we list the matching methods that can be applied to our pose descriptor for efficient identification of human actions. These are, namely, nearest neighbor classification, global histogramming, SVM classification and Dynamic Time Warping. We test our system with different configurations and compare the results to state-of-art action recognition methods. We also provide the run time evaluations of our system. After reporting comprehensive experiments and their results, we conclude our discussion with future research directions.

Section snippets

Related work

There are three major approaches to human action understanding in videos. The first one is to use temporal logics to represent crucial order relations between states that constrain activities. Examples of such approaches include Pinhanez and Bobick [[8], [9]] who described a method based on interval algebra. In addition, Siskind [10] described methods to infer activities related to objects using a form of logical inference.

The second general approach to recognizing human motion is to use models

Histogram of oriented rectangles as a new pose descriptor

Following the body plan analogy of Forsyth and Fleck [3], we represent the human body as a collection of rectangular patches and we base our motion understanding approach on the fact that the orientations and positions of these rectangles change over time with respect to the actions carried out. With this intuition, our algorithm first extracts rectangular patches over the human figure available in each frame, and then forms a spatial histogram of these rectangles by grouping over orientations.

Recognizing actions with histograms of oriented rectangles

After calculating the pose descriptors for each frame, we perform action classification in a supervised manner. There are four matching methods we perform in order to evaluate the performance of our pose descriptor in action classification problems.

Datasets

We test the effectiveness of our method over two state-of-the-art datasets. The first is the Weizmann dataset and second is the KTH dataset, which are the current benchmark datasets in action recognition literature.

Weizmann dataset: This is the dataset that Blank et al. introduced in [21]. We used the same set of actions as in [21], which is a set of nine actions: walk, run, jump, gallop sideways, bend, one-hand wave, two-hands wave, jump in place and jumping jack. Example frames from this

Conclusions and future work

In this paper, we have approached the problem of human action recognition and proposed a new pose descriptor based on the orientation of body parts. Our pose descriptor is simple and effective; we extract the rectangular regions from a human silhouette and form a spatial oriented histogram of these rectangles. We show that, by effective classification of such histograms, reliable human action recognition is possible. We demonstrate the effectiveness of our method over the state-of-the-art

Acknowledgements

This work has been supported by TUBITAK grants 104E065, 104E077 and 105E065.

References (41)

  • J.M. Siskind

    Reconstructing force-dynamic models from video sequences

    Artificial Intelligence

    (2003)
  • N. Oliver et al.

    Layered representations for learning and inferring office activity from multiple sensory channels

    Computer Vision and Image Understanding

    (2004)
  • S. Hongeng et al.

    Video-based event recognition: activity representation and probabilistic recognition methods

    Computer Vision and Image Understanding

    (2004)
  • W. Hu et al.

    A survey on visual surveillance of object motion and behaviors

    IEEE Transactions on Systems, Man, and Cybernetics – Part C: Applications and Reviews

    (2004)
  • D. Forsyth et al.

    Computational studies of human motion I: tracking and animation

    Foundations and Trends in Computer Graphics and Vision

    (2006)
  • D. Forsyth, M. Fleck, Body plans, in: IEEE Conf. on Computer Vision and Pattern Recognition, 1997, pp....
  • L. Fei-Fei, P. Perona, A Bayesian hierarchical model for learning natural scene categories, in: IEEE Conf. on Computer...
  • J. Sivic, B. Russell, A. Efros, A. Zisserman, W. Freeman, Discovering object categories in image collections, in: Int....
  • F. Monay et al.

    Modeling semantic aspects for cross-media image retrieval

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2007)
  • C.W.N. Yu-Gang Jiang, J, Yang, Towards optimal bag-of-features for object categorization and semantic video retrieval,...
  • C. Pinhanez, A. Bobick, Pnf propagation and the detection of actions described by temporal intervals, in: DARPA IU...
  • C. Pinhanez, A. Bobick, Human action detection using pnf propagation of temporal constraints, in: IEEE Conf. on...
  • M. Brand, N. Oliver, A. Pentland, Coupled hidden Markov models for complex action recognition, in: IEEE Conf. on...
  • A. Wilson et al.

    Parametric hidden Markov models for gesture recognition

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1999)
  • C. Sminchisescu, A. Kanaujia, Z. Li, D. Metaxas, Conditional models for contextual human motion recognition, in: ICCV,...
  • P. Hong, M. Turk, T. Huang, Gesture modeling and recognition using finite state machines, in: Int. Conf. Automatic Face...
  • N. Ikizler, D. Forsyth, Searching video for complex activities with finite state models, in: IEEE Conf. on Computer...
  • R. Polana, R. Nelson, Detecting activities, in: IEEE Conf. on Computer Vision and Pattern Recognition, 1993, pp....
  • A. Bobick et al.

    The recognition of human movement using temporal templates

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2001)
  • A.A. Efros, A.C. Berg, G. Mori, J. Malik, Recognizing action at a distance, in: ICCV’03, 2003, pp....
  • Cited by (111)

    • Recent trends in human activity recognition – A comparative study

      2023, Cognitive Systems Research
      Citation Excerpt :

      Radon transform comes with lots of features such as low computational capacity, robust to frame loss and various drills in shapes. Nazli et al. (Ikizler and Duygulu, 2009) presented a newer approach Histogram of Oriented Rectangle (HOR) in which whole body is represented in the form of candidate rectangles. Then these spatial rectangles captured are used to form human silhouettes using different classification techniques such as SVM, Nearest neighbor classifiers, Motion Energy image and many more.

    • Dance Guiding Art Innovation Framework Based on Video Action Recognition Technology

      2021, Proceedings of the 5th International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud), I-SMAC 2021
    View all citing articles on Scopus
    View full text