Spatial–temporal convolutional neural networks for anomaly detection and localization in crowded scenes
Introduction
The widespread use of surveillance systems in railway stations, airports, roads or malls has resulted in large volumes of video data. There is an increasing need not only for recognition of objects and their behaviors, but also in particular for detecting abnormal behavior in the large body of ordinary data. Automatically detecting abnormal activities or events from long duration video sequences is crucial for intelligent surveillance [39], behavior analysis [37], and security applications [11]. In particular, abnormal behavior detection in crowded scenes is a challenging problem due to the large number of pedestrians in close proximity, the volatility of individual appearance, the frequent partial occlusions that they produce, and the irregular motion pattern of the crowd. In addition, there are potential dangerous activities in crowded environments, such as crowd panic, stampedes and accidents involving a large number of individuals, which make automated scene analysis in the most need.
One of the main challenges is to detect anomalies both in time and space domains [9]. This implies to find out which frames that anomalies occur (we refer to it as frame level) and to localize regions that generate the anomalies within these frames (we refer to it as pixel level) [30]. Another fundamental limitation is that there is no commonly accepted definition of anomaly. The definition of anomaly varies significantly depending on the given scenario. Conventionally, anomalies are identified as those events that display a low probability or a significant difference of occurring based on earlier observations [16], [27], [44]. The existing approaches for detecting anomalies can be classified into two categories: (1) object-centric approaches [41], [33], (2) holistic methods [29], [32]. In a typical object-centric approach, the crowd is treated as a set of individuals. To understand the crowd activities, it needs to segment the crowd of interest into objects. However, these object-centric approaches face considerable complexity in detecting objects, tracking trajectories, and recognizing behaviors in crowded scenes. One common drawback among these methods is that they are not capable of handling crowded scenes. Once the density of objects increases, a degradation of their performance is observed.
For the holistic methods, the aim is not to detect and track each individual. Instead, the crowd is considered as a whole entity. Typical features, such as spatial–temporal gradients or optical flow data, are used in these methods to localize regions with dramatic motion. Through modeling the normal/abnormal crowd motion patterns, anomaly detection is carried out by pre-trained classifiers.
In fact, anomaly detection can be arguably defined as a binary classification problem, i.e., activities of the crowd are classified as either normal or abnormal. Recently, many works have demonstrated the power of CNN [24] in a wide variety of computer vision tasks, such as object classification and detection [23], [36], text recognition [13], edge detection [38], and face recognition [40]. For video classification tasks, the CNN also show potential applicable value. Ji et al. [17] propose a 3D CNN for human action recognition based on video sequence. Wang et al. [42] develop a novel reconfigurable CNN for automatic 3D human activity recognition from RGB-D videos. Maturana and Scherer [28] employ a 3D CNN to detect save landing zones for autonomous helicopters from LiDAR point clouds. Encouraged by these surprising results, we are interested in anomaly detection in crowded scenes. However, motion patterns of the crowd show both spatial and temporal characters. To detect and localize anomalous events in video sequences of crowded scenes, we develop a spatial–temporal CNN model, which accesses to not only the appearance information present in a single, static image, but also complex motion information extracted from continuous frames. To capture anomalous events appearing in a small part of the frame, the spatial–temporal CNN model is applied only on spatial–temporal volumes of interest (SVOI), which not only ensures the robustness to noise, but also achieves a lower computational cost. The experimental results on available benchmark datasets show that our algorithm outperforms state-of-the-art methods. Compared with previous works, our method possesses two advantages: (1) It acts directly on the raw inputs (SVOI) without any preprocessing instead of relying on hand-crafted features. (2) It does not rely on foreground segmentation results because only motion and appearance information is considered in our method. Once the crowd is moving, the motion patterns and texture of the crowd would be captured by the spatial–temporal CNN model and a reasonable anomaly detection results would be achieved.
Section snippets
Related work
In this section, we give a review to the related works in previous. A considerable amount of the literature has been published on anomaly detection from static cameras [43], [10], [6]. However, most of these works are limited to sparse scenarios, where detailed visual information of each individual can be captured. Once the density of objects increases, such information cannot be easily extracted by traditional methods. Hence, a lot of algorithms have been presented to address crowded scenes.
Methodology
In this work, we focus on the challenge of detecting anomalies in both time and space in video sequences with crowds of varying densities. Both motion and appearance features extracted by our spatial–temporal CNN model are used to effectively and robustly capture these anomalies for a wide range of scenarios.
Experiments
In this section, we evaluate the effectiveness of our method on four benchmark datasets, i.e. UCSD [27], UMN [29], Subway [1], and U-turn [3], where different kinds of anomalies appear, and compare its performance with state-of-the-art methods. The accuracy of our algorithm both on frame level and pixel level are calculated. As mentioned in Section 3.1, the spatial size of SVOI should be large enough to contain useful appearance information, at the same time, small enough to capture local
Conclusion
In this work, we develop a spatial–temporal CNN model for anomaly detection and localization in different scenes, recorded from static cameras. In our method, the spatial–temporal volumes that carry rich motion information are fed to train the spatial–temporal CNN model for anomaly detection. The spatial–temporal CNN model is designed to robustly construct features from both spatial and temporal dimensions by performing spatial–temporal convolutions, therefore, appearance feature as well as
References (48)
- et al.
Multi-scale and real-time non-parametric approach for anomaly detection and localization
Comput. Vis. Image Underst.
(2012) - et al.
Abnormal event detection in crowded scenes using sparse representation
Pattern Recognit.
(2013) - et al.
Spatio–temporal context analysis within video volumes for anomalous-event detection and localization
Neurocomputing
(2015) - et al.
Human action segmentation and recognition via motion and shape analysis
Pattern Recognit. Lett.
(2012) - et al.
Context-aware local abnormality detection in crowded scene
Sci. China Inf. Sci.
(2015) - et al.
Robust real-time unusual event detection using multiple fixed-location monitors
IEEE Trans. Pattern Anal. Mach. Intell.
(2008) - S. Ali, M. Shah, A Lagrangian particle dynamics approach for crowd flow segmentation and stability analysis, in: IEEE...
- Y. Benezeth, P.M. Jodoin, V. Saligrama, C. Rosenberger, Abnormal events detection based on spatio–temporal...
- et al.
Latent Dirichlet allocation
J. Mach. Learn. Res.
(2003) - et al.
Anomaly detectiona survey
ACM Comput. Surv. (CSUR)
(2009)
Activity modeling using event probability sequences
IEEE Trans. Image Process.
Pedestrian protection systemsissues, survey, and challenges
IEEE Trans. Intell. Transp. Syst.
Convolutional face findera neural architecture for fast and robust face detection
IEEE Trans. Pattern Anal. Mach. Intell.
Support vector machines
IEEE Intell. Syst. Appl.
Social force model for pedestrian dynamics
Phys. Rev. E
3d convolutional neural networks for human action recognition
IEEE Trans. Pattern Anal. Mach. Intell.
Swarm intelligence for detecting interesting events in crowded environments
IEEE Trans. Image Process.
Cited by (253)
TransGANomaly: Transformer based Generative Adversarial Network for Video Anomaly Detection
2024, Journal of Visual Communication and Image RepresentationVideo anomaly detection based on cross-frame prediction mechanism and spatio-temporal memory-enhanced pseudo-3D encoder
2023, Engineering Applications of Artificial IntelligenceA survey on deep learning-based real-time crowd anomaly detection for secure distributed video surveillance
2024, Personal and Ubiquitous Computing