Spatial–temporal convolutional neural networks for anomaly detection and localization in crowded scenes

https://doi.org/10.1016/j.image.2016.06.007Get rights and content

Highlights

  • A mechanism for localizing dynamic regions in crowded scenes is proposed.

  • A spatial–temporal Convolutional Neural Network is designed to automatically extract spatial–temporal features of the crowd.

  • The performance of anomaly detection is improved when the analysis is concentrated on the dynamic regions only.

  • The anomaly events that take place in small regions are effectively detected and localized by the spatial–temporal Convolutional Neural Network.

Abstract

Abnormal behavior detection in crowded scenes is extremely challenging in the field of computer vision due to severe inter-object occlusions, varying crowd densities and the complex mechanics of a human crowd. We propose a method for detecting and locating anomalous activities in video sequences of crowded scenes. The key novelty of our method is the coupling of anomaly detection with a spatial–temporal Convolutional Neural Networks (CNN), which to the best of our knowledge has not been previously done. This architecture allows us to capture features from both spatial and temporal dimensions by performing spatial–temporal convolutions, thereby, both the appearance and motion information encoded in continuous frames are extracted. The spatial–temporal convolutions are only performed within spatial–temporal volumes of moving pixels to ensure robustness to local noise, and increase detection accuracy. We experimentally evaluate our model on benchmark datasets containing various situations with human crowds, and the results demonstrate that the proposed approach surpass state-of-the-art methods.

Introduction

The widespread use of surveillance systems in railway stations, airports, roads or malls has resulted in large volumes of video data. There is an increasing need not only for recognition of objects and their behaviors, but also in particular for detecting abnormal behavior in the large body of ordinary data. Automatically detecting abnormal activities or events from long duration video sequences is crucial for intelligent surveillance [39], behavior analysis [37], and security applications [11]. In particular, abnormal behavior detection in crowded scenes is a challenging problem due to the large number of pedestrians in close proximity, the volatility of individual appearance, the frequent partial occlusions that they produce, and the irregular motion pattern of the crowd. In addition, there are potential dangerous activities in crowded environments, such as crowd panic, stampedes and accidents involving a large number of individuals, which make automated scene analysis in the most need.

One of the main challenges is to detect anomalies both in time and space domains [9]. This implies to find out which frames that anomalies occur (we refer to it as frame level) and to localize regions that generate the anomalies within these frames (we refer to it as pixel level) [30]. Another fundamental limitation is that there is no commonly accepted definition of anomaly. The definition of anomaly varies significantly depending on the given scenario. Conventionally, anomalies are identified as those events that display a low probability or a significant difference of occurring based on earlier observations [16], [27], [44]. The existing approaches for detecting anomalies can be classified into two categories: (1) object-centric approaches [41], [33], (2) holistic methods [29], [32]. In a typical object-centric approach, the crowd is treated as a set of individuals. To understand the crowd activities, it needs to segment the crowd of interest into objects. However, these object-centric approaches face considerable complexity in detecting objects, tracking trajectories, and recognizing behaviors in crowded scenes. One common drawback among these methods is that they are not capable of handling crowded scenes. Once the density of objects increases, a degradation of their performance is observed.

For the holistic methods, the aim is not to detect and track each individual. Instead, the crowd is considered as a whole entity. Typical features, such as spatial–temporal gradients or optical flow data, are used in these methods to localize regions with dramatic motion. Through modeling the normal/abnormal crowd motion patterns, anomaly detection is carried out by pre-trained classifiers.

In fact, anomaly detection can be arguably defined as a binary classification problem, i.e., activities of the crowd are classified as either normal or abnormal. Recently, many works have demonstrated the power of CNN [24] in a wide variety of computer vision tasks, such as object classification and detection [23], [36], text recognition [13], edge detection [38], and face recognition [40]. For video classification tasks, the CNN also show potential applicable value. Ji et al. [17] propose a 3D CNN for human action recognition based on video sequence. Wang et al. [42] develop a novel reconfigurable CNN for automatic 3D human activity recognition from RGB-D videos. Maturana and Scherer [28] employ a 3D CNN to detect save landing zones for autonomous helicopters from LiDAR point clouds. Encouraged by these surprising results, we are interested in anomaly detection in crowded scenes. However, motion patterns of the crowd show both spatial and temporal characters. To detect and localize anomalous events in video sequences of crowded scenes, we develop a spatial–temporal CNN model, which accesses to not only the appearance information present in a single, static image, but also complex motion information extracted from continuous frames. To capture anomalous events appearing in a small part of the frame, the spatial–temporal CNN model is applied only on spatial–temporal volumes of interest (SVOI), which not only ensures the robustness to noise, but also achieves a lower computational cost. The experimental results on available benchmark datasets show that our algorithm outperforms state-of-the-art methods. Compared with previous works, our method possesses two advantages: (1) It acts directly on the raw inputs (SVOI) without any preprocessing instead of relying on hand-crafted features. (2) It does not rely on foreground segmentation results because only motion and appearance information is considered in our method. Once the crowd is moving, the motion patterns and texture of the crowd would be captured by the spatial–temporal CNN model and a reasonable anomaly detection results would be achieved.

Section snippets

Related work

In this section, we give a review to the related works in previous. A considerable amount of the literature has been published on anomaly detection from static cameras [43], [10], [6]. However, most of these works are limited to sparse scenarios, where detailed visual information of each individual can be captured. Once the density of objects increases, such information cannot be easily extracted by traditional methods. Hence, a lot of algorithms have been presented to address crowded scenes.

Methodology

In this work, we focus on the challenge of detecting anomalies in both time and space in video sequences with crowds of varying densities. Both motion and appearance features extracted by our spatial–temporal CNN model are used to effectively and robustly capture these anomalies for a wide range of scenarios.

Experiments

In this section, we evaluate the effectiveness of our method on four benchmark datasets, i.e. UCSD [27], UMN [29], Subway [1], and U-turn [3], where different kinds of anomalies appear, and compare its performance with state-of-the-art methods. The accuracy of our algorithm both on frame level and pixel level are calculated. As mentioned in Section 3.1, the spatial size of SVOI should be large enough to contain useful appearance information, at the same time, small enough to capture local

Conclusion

In this work, we develop a spatial–temporal CNN model for anomaly detection and localization in different scenes, recorded from static cameras. In our method, the spatial–temporal volumes that carry rich motion information are fed to train the spatial–temporal CNN model for anomaly detection. The spatial–temporal CNN model is designed to robustly construct features from both spatial and temporal dimensions by performing spatial–temporal convolutions, therefore, appearance feature as well as

References (48)

  • K.W. Cheng, Y.T. Chen, W.H. Fang, Video anomaly detection and localization using hierarchical feature representation...
  • Y. Cong, J. Yuan, J. Liu, Sparse reconstruction cost for abnormal event detection, in: 2011 IEEE Conference on Computer...
  • N.P. Cuntoor et al.

    Activity modeling using event probability sequences

    IEEE Trans. Image Process.

    (2008)
  • T. Gandhi et al.

    Pedestrian protection systemsissues, survey, and challenges

    IEEE Trans. Intell. Transp. Syst.

    (2007)
  • C. Garcia et al.

    Convolutional face findera neural architecture for fast and robust face detection

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2004)
  • I.J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud, V. Shet, Multi-digit number recognition from street view imagery...
  • M.A. Hearst et al.

    Support vector machines

    IEEE Intell. Syst. Appl.

    (1998)
  • D. Helbing et al.

    Social force model for pedestrian dynamics

    Phys. Rev. E

    (1995)
  • T. Hospedales, S. Gong, T. Xiang, A Markov clustering topic model for mining behaviour in video, in: 2009 IEEE 12th...
  • S. Ji et al.

    3d convolutional neural networks for human action recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2013)
  • Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: convolutional...
  • V. Kaltsa et al.

    Swarm intelligence for detecting interesting events in crowded environments

    IEEE Trans. Image Process.

    (2015)
  • J. Kim, K. Grauman, Observe locally, infer globally: a space–time mrf for detecting abnormal activities with...
  • L. Kratz, K. Nishino, Anomaly detection in extremely crowded scenes using spatio-temporal motion pattern models, in:...
  • Cited by (253)

    • TransGANomaly: Transformer based Generative Adversarial Network for Video Anomaly Detection

      2024, Journal of Visual Communication and Image Representation
    View all citing articles on Scopus
    View full text