Elsevier

Pattern Recognition

Volume 41, Issue 7, July 2008, Pages 2309-2326
Pattern Recognition

Activity based surveillance video content modelling

https://doi.org/10.1016/j.patcog.2007.11.024Get rights and content

Abstract

This paper tackles the problem of surveillance video content modelling. Given a set of surveillance videos, the aims of our work are twofold: firstly a continuous video is segmented according to the activities captured in the video; secondly a model is constructed for the video content, based on which an unseen activity pattern can be recognised and any unusual activities can be detected. To segment a video based on activity, we propose a semantically meaningful video content representation method and two segmentation algorithms, one being offline offering high accuracy in segmentation, and the other being online enabling real-time performance. Our video content representation method is based on automatically detected visual events (i.e. ‘what is happening in the scene’). This is in contrast to most previous approaches which represent video content at the signal level using image features such as colour, motion and texture. Our segmentation algorithms are based on detecting breakpoints on a high-dimensional video content trajectory. This differs from most previous approaches which are based on shot change detection and shot grouping. Having segmented continuous surveillance videos based on activity, the activity patterns contained in the video segments are grouped into activity classes and a composite video content model is constructed which is capable of generalising from a small training set to accommodate variations in unseen activity patterns. A run-time accumulative unusual activity measure is introduced to detect unusual behaviour while usual activity patterns are recognised based on an online likelihood ratio test (LRT) method. This ensures robust and reliable activity recognition and unusual activity detection at the shortest possible time once sufficient visual evidence has become available. Comparative experiments have been carried out using over 10 h of challenging outdoor surveillance video footages to evaluate the proposed segmentation algorithms and modelling approach.

Introduction

The rapid increase in the amount of CCTV surveillance video data generated has led to an urgent demand for automated analysis of video content. Content analysis for surveillance videos is more challenging than that for broadcasting videos such as news and sports programmes because the latter are more constrained, well structured and of better quality. This paper aims to address a number of key issues of surveillance video content analysis:

  • 1.

    How to construct a representation of video content which is informative, concise and able to bridge the gap between the low level visual features embedded in the video data and the high level semantic concepts used by human to describe the video content.

  • 2.

    How to segment a continuous surveillance video temporally into activity patterns according to changes in video content.

  • 3.

    Given a training video dataset, how to construct a model for the video content which can accommodate the variations in the unseen activity patterns both in terms of duration and temporal ordering?

  • 4.

    Given a video content model and an unseen video, how to perform online activity recognition and unusual activity detection?

To this end, we first propose in this paper a semantic meaningful representation based on automatically detected discrete visual events in a video and two segmentation algorithms. One of the proposed algorithms is offline offering better accuracy in segmentation but is computationally more demanding, while the other is online enabling real-time performance. We then develop a generative video content model based on unsupervised learning. Using this model an unseen activity pattern can be recognised into different classes if similar patterns are included in the training dataset and unusual activities can also be detected.

A suitable representation is crucial for video segmentation. For activity based video segmentation, we propose to represent surveillance video content holistically in space and over time based on visual events detected automatically in the scene. This is in contrast to most previous approaches which represent video content at the signal level using image features such as colour, motion, and texture [1], [2], [3], [4], [5], [6]. The essence of our method is to represent video content based on ‘what is happening’ rather than ‘what is present’ in the scene. ‘What is happening’ in the scene is reflected through activities captured in the video which most likely to involve multiple objects interacting or co-existing in a shared common space. An activity is composed of groups of co-occurring events which are defined as significant visual changes detected in image frames over time. Events are detected and classified by unsupervised clustering using Gaussian mixture model (GMM) with automatic model selection based on Schwarz's Bayesian information criterion (BIC) [7], [8]. Video content is then represented as temporally correlated events automatically labelled into different classes. By doing so changes in the presence of and temporal correlations among different classes of events can indicate video content changes therefore providing vital cues for activity based video segmentation.

Video segmentation has been studied extensively in the past two decades. Traditionally, a four-layer hierarchical structure is adopted for video structure analysis which consists of a frame layer, a shot layer, a scene layer and a video layer [4]. At the bottom of the structure, continuous image frames taken by a single camera are grouped into shots. A series of related shots are then grouped into a scene. Shot change detection is performed as the first step for video segmentation by most previous approaches [1], [2], [3], [4]. The segmentation of shots and scenes heavily relies on a well-defined feature space usually dominated by colour and motion. For example, in Ref. [1], image frames were represented using a holistic colour histogram and the frame difference was exploited to detect shots. This structure is in general valid for constrained, well-structured broadcast videos of news and sports programmes. However, for a surveillance video which is taken continuously by a fixed camera without script-driven panning and zooming, global colour and motion information is either highly unreliable or unavailable [9]. More importantly, there is only one shot in a surveillance video and any shot-change detection based segmentation approach would be unsuitable.

Recently, DeMenthon et al. [10]proposed to represent a video as a high-dimensional temporal trajectory based on colour histogram and treat video segmentation as a trajectory breakpoint detection problem. Compared to the thresholding based segmentation algorithms adopted by most previous video segmentation approaches [1], [2], a trajectory breakpoint detection based approach is more robust to local noise at individual frames because segmentation is performed holistically over the whole duration of the video. Various approaches have been proposed to segment a continuous trajectory into segments through breakpoint detection for time-series data segmentation [11], [12], [13]. Most of them are based on piecewise linear approximation (PLA) or probabilistic graphical models such as hidden Markov models (HMMs) [14]. PLA refers to finding the best approximation of a trajectory using straight lines by either linear interpolation or linear regression. However, the computational cost of PLA is nontrivial especially when the dimensionality of the trajectory space is high [13], resulting in most of the existing PLA segmentation algorithms only being applied to trajectories in a space with a dimensionality no bigger than 3. On the other hand, HMMs have the potential to be robust to noise and are capable of dynamic time warping (DTW). However, a large number of parameters are needed to describe an HMM when the dimensionality of the trajectory space is high. This makes an HMM vulnerable to over-fitting when training data are insufficient. To solve this problem, we propose a multi-observation hidden Markov model (MOHMM) which requires less parameters compared to a conventional HMM. It is thus more suitable for high-dimensional video content trajectory segmentation.

Most existing segmentation algorithms are offline. For the purpose of video segmentation, an online algorithm has its distinctive advantage due to the huge amount of surveillance video data to be processed and more importantly the real-time nature of some surveillance applications. For instance, if a video content change is detected online and in real-time, it can be used for alerting CCTV control room operators to act accordingly. One of the most popular online segmentation algorithms for temporal trajectories is the sliding window (SW) algorithm based on the SW principle [15] and local PLA. However, the SW algorithm tends to over-segment [11], [15], [16]. One possible explanation is that it lacks a global view of the data since it only ‘looks backward’ without ‘looking forward’. Keogh et al. [11] attempted to solve the problem by combining the bottom-up offline algorithm [15] with the SW principle. Nevertheless, their algorithm works only on trajectories with very short segments. It is thus impossible for the algorithm to run in real-time on a typical surveillance video sequence which comprisessegments lasting over hours. In this paper, we propose a novel forward–backward relevance (FBR) algorithm. Compared to a conventional SW algorithm, FBR is less sensitive to noise and more importantly, can be run in real-time.

Using the proposed segmentation algorithms, continuously recorded video or online CCTV input can be segmented into activity patterns. Given these activity patterns, the goal of video content modelling is to learn a model that is capable of detecting unusual activity patterns whilst recognising novel instances of expected usual activity patterns. In this context, we define an unusual activity as an atypical activity pattern that is not represented by sufficient samples in a training dataset but critically it satisfies the specificity constraint to an unusual pattern. This is because one of the main challenges for the model is to differentiate unusual activity from outliers caused by noisy visual features used for activity representation. The effectiveness of an video content modelling approach shall be measured by (1) how well unusual activities can be detected (i.e. measuring specificity to expected patterns of activity) and (2) how accurately and robustly different classes of usual activity patterns can be recognised (i.e. maximising between-class discrimination).

To solve the problem, we develop a novel framework for fully unsupervised video content modelling and online unusual activity detection. Our framework has the following key components:

  • 1.

    Discovering natural groupings of activity patterns using an activity affinity matrix. A number of affinity matrix based clustering techniques have been proposed recently [17], [18], [19]. However, these approaches require known number of clusters. Given an unlabelled dataset, the number of activity classes are unknown in our case. To automatically determine the number of clusters, a recently proposed spectral clustering algorithm [20] is deployed.

  • 2.

    A composite generative video content model using a mixture of dynamic Bayesian networks (DBNs). The advantages of the such a generative video content model are twofold: (a) It can accommodate well the variations in the unseen and usual activity patterns both in terms of duration and temporal ordering by generalising from a training set of limited number of samples. This is important because in reality the same usual activity can be executed in many different usual ways. These variations cannot possibly be captured in a limited training dataset and need to be dealt with by a learned video content model. (b) Such a model is robust to errors in activity representation. This is because that a mixture of DBNs can cope with errors occurred at individual frames and is also able to distinguish an error corrupted usual activity pattern from an unusual one.

  • 3.

    Online unusual activity detection using a run-time accumulative unusual activity measure and usual activity recognition using an online LRT method. A run-time accumulative measure is introduced to determine how usual/unusual an unseen activity pattern is on-the-fly. The activity pattern is then recognised as one of the usual activity classes if detected as being usual. Recognition of usual activities is carried out using an online LRT method which holds the decision on recognition until sufficient visual evidence has become available. This is in order to overcome any ambiguity among different activity classes observed online due to insufficientvisual evidence at a given time instance. By doing so, robust activity recognition and unusual activity detection are ensured at the shortest possible time, as opposed to previous work such as [9], [21], [22] which requires completed activity patterns being observed. Our online LRT based activity recognition approach is also advantageous over previous ones based on the maximum likelihood (ML) method [9], [22], [23]. An ML based approach makes a forced decision on activity recognition at each time instance without considering the reliability and sufficiency of the accumulated visual evidence. Consequently, it can be error prone.

The rest of the paper is structured as follows: in Section 2, we describe an event based surveillance video content representation approach. In Section 3, we address the problem of surveillance video segmentation. Two novel segmentation algorithms, MOHMM and FBR are introduced in 3.1 Multi-observation hidden Markov model (MOHMM), 3.2 Forward–backward relevance (FBR), respectively. Comparative experiments are conducted using over 10 h of challenging outdoor surveillance video footages and the results are presented in Section 3.3. The pros and cons of both algorithms are analysed. The advantage of our event based video content representation over the traditional image feature based representation is also made clear by our experimental results. The proposed video content modelling approach is described in Section 4, where experiments are also presented to evaluate the effectiveness and robustness of the approach. A conclusion is drawn in Section 5.

Section snippets

Semantic video content representation

We consider an activity based video content representation. Visual events are detected and classified automatically in the scene. The semantics of video content are considered to be best encoded in the occurrence of such events and the temporal correlations among them.

Temporal segmentation of surveillance videos

It has been shown in the preceding section that a cumulative scene vector can represent the video content at the semantic level over time and is capable of capturing video content changes despite variations in activity durations and occurrences of inactivity break-ups within activities. After mapping a video sequence into a cumulative scene vector trajectory,2 the breakpoints on the trajectory correspond to

Video content modelling for activity recognition and unusual activity detection

Using the approaches described in the preceding section, continuous surveillance videos in a training dataset are segmented into N video segments, ideally each video segment now containing a single activity. Taking an unsupervised approach, the task now is to discover the natural grouping of the activity patterns (i.e. clustering), upon which a video content model is built for the N video segments. Given an unseen activity, the model is then employed to detect if the activity pattern is

Conclusion

In conclusion, we have presented a novel approach for representing, segmenting and modelling the content of CCTV surveillance videos according to the activities captured in the scene. The video content is represented by constructing a cumulative multi-event scene vector over time. An offline MOHMM based segmentation algorithm was introduced to deal with noise in video content representation. An online FBR algorithm was also developed to detect breakpoints in the video content and segment a

About the Author—Dr. TAO XIANG is a Lecturer at the Department of Computer Science, Queen Mary, University of London. Dr. Xiang was awarded his Ph.D. in Electrical and Computer Engineering from the National University of Singapore in 2002, which involved research into 3-D Computer Vision and Visual Perception. He also received his B.Sc. degree in Electrical Engineering from Xi’an Jiaotong University in 1995, and his M.Sc. degree in Electronic Engineering from the Communication University of

References (44)

  • G. Schwarz

    Estimating the dimension of a model

    Ann. Statist.

    (1978)
  • T. Xiang et al.

    Autonomous visual events detection and classification without explicit object-centred segmentation and tracking

  • S. Gong et al.

    Recognition of group activities using dynamic probabilistic networks

  • D. DeMenthon et al.

    Relevance ranking of video data using hidden markov model distances and polygon simplification

  • E. Keogh et al.

    An online algorithm for segmenting time series

  • P. Heckbert et al.

    Survey of ploygonal surface simplification algorithms

  • X. Ge et al.

    Segmental semi-markov models for endpoint detection in plasma etching

  • E. Keogh et al.

    On the need for time series data mining benchmarks: a survey and empirical demonstration

    Data Min. Knowl. Discovery

    (2003)
  • H. Shatkay et al.

    Approximate queries and representations for large data sequences

  • Y. Weiss

    Segmentation using eigenvectors: a unifying view

  • J. Shi et al.

    Normalized cuts and image segmentation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2000)
  • S. Yu et al.

    Multiclass spectral clustering

  • Cited by (43)

    • Effective action recognition with embedded key point shifts

      2021, Pattern Recognition
      Citation Excerpt :

      Action recognition has attracted interest in vision and machine learning communities [1,2] thanks to its applications such as surveillance [3] and smart homes [4].

    • PNL: Efficient long-range dependencies extraction with pyramid non-local module for action recognition

      2021, Neurocomputing
      Citation Excerpt :

      Action recognition has received considerable attention from the vision community in recent years [1–4] thanks to its increasing applications in various fields, such as surveillance [5–7] and smart homes [8–10] etc.

    • Incremental behavior modeling and suspicious activity detection

      2013, Pattern Recognition
      Citation Excerpt :

      Several research groups have investigated methods for video clustering [16–21,10]. Especially effective are those methods using HMMs to model different classes of behavior in video surveillance [20,10,21]. There has been some work on anomalous time series detection outside the context of video surveillance using relatively simple statistical models [22,23].

    • Pairwise constraints based multiview features fusion for scene classification

      2013, Pattern Recognition
      Citation Excerpt :

      Scene classification is a critical task in many practical applications, e.g., video analysis [50], video surveillance [34], content-based image retrieval [38] and robotics path planning [42].

    • Detecting and discriminating behavioural anomalies

      2011, Pattern Recognition
      Citation Excerpt :

      Among these approaches, graphical models especially DBNs are the most popular method [19,20]. Various DBN topologies have been developed, which perform object-based decomposition and factorise the state space and/or observation space by introducing multiple hidden state variables and observation state variables, e.g. multi-observation HMM (MOHMM) [21], parallel HMM (PaHMM) [7] and coupled HMM (CHMM) [6]. In the case of single object behaviour modelling, there are also several attempts to embed hierarchical behaviour structure in the model topology.

    View all citing articles on Scopus

    About the Author—Dr. TAO XIANG is a Lecturer at the Department of Computer Science, Queen Mary, University of London. Dr. Xiang was awarded his Ph.D. in Electrical and Computer Engineering from the National University of Singapore in 2002, which involved research into 3-D Computer Vision and Visual Perception. He also received his B.Sc. degree in Electrical Engineering from Xi’an Jiaotong University in 1995, and his M.Sc. degree in Electronic Engineering from the Communication University of China (CUC) in 1998. His research interests include computer vision, image processing, statistical learning, pattern recognition, machine learning and data mining. He has been working recently on topics such as spectral clustering, video based behaviour analysis and recognition and model order selection for dynamic Bayesian networks.

    About the Author—Professor SHAOGANG GONG is Professor of Visual Computation at the Department of Computer Science, Queen Mary, University of London and a Member of the UK Computing Research Committee. He heads the Queen Mary Computer Vision Group and has worked in computer vision and pattern recognition for over 20 years, published over 170 papers and a monograph. He twice won the Best Science Prize (1999 and 2001) of British Machine Vision Conferences, the Best Paper Award (2001) of IEEE International Workshops on Recognition and Tracking of Faces and Gestures and the Best Paper Award (2005) of IEE International Symposium on Imaging for Crime Detection and Prevention. He was a recipient of a Queen's Research Scientist Award (1987), a Royal Society Research Fellow (1987 and 1988), a GEC-Oxford Fellow (1989), a visiting scientist at Microsoft Research (2001), and Samsung (2003).

    View full text