Intelligent multi-camera video surveillance: A review

https://doi.org/10.1016/j.patrec.2012.07.005Get rights and content

Abstract

Intelligent multi-camera video surveillance is a multidisciplinary field related to computer vision, pattern recognition, signal processing, communication, embedded computing and image sensors. This paper reviews the recent development of relevant technologies from the perspectives of computer vision and pattern recognition. The covered topics include multi-camera calibration, computing the topology of camera networks, multi-camera tracking, object re-identification, multi-camera activity analysis and cooperative video surveillance both with active and static cameras. Detailed descriptions of their technical challenges and comparison of different solutions are provided. It emphasizes the connection and integration of different modules in various environments and application scenarios. According to the most recent works, some problems can be jointly solved in order to improve the efficiency and accuracy. With the fast development of surveillance systems, the scales and complexities of camera networks are increasing and the monitored environments are becoming more and more complicated and crowded. This paper discusses how to face these emerging challenges.

Highlights

► Review major modules and research topics on multi-camera video surveillance. ► Review technologies from the perspective of computer vision and pattern recognition. ► Detailed descriptions of technical challenges and comparison of different solutions. ► Emphasizes the connection and integration of different modules. ► Some problems can be jointly solve to improve the efficiency and accuracy.

Introduction

Intelligent video surveillance has been one of the most active research areas in computer vision. The goal is to efficiently extract useful information from a huge amount of videos collected by surveillance cameras by automatically detecting, tracking and recognizing objects of interest, and understanding and analyzing their activities. Video surveillance has a wide variety of applications both in public and private environments, such as homeland security, crime prevention, traffic control, accident prediction and detection, and monitoring patients, elderly and children at home. These applications require monitoring indoor and outdoor scenes of airports, train stations, highways, parking lots, stores, shopping malls and offices. There is an increasing interest in video surveillance due to the growing availability of cheap sensors and processors, and also a growing need for safety and security from the public. Nowadays there are tens of thousands of cameras in a city collecting a huge amount of data on a daily basis. Researchers are urged to develop intelligent systems to efficiently extract information from large scale data.

The view of a single camera is finite and limited by scene structures. In order to monitor a wide area, such as tracking a vehicle traveling through the road network of a city or analyzing the global activities happening in a large train station, video streams from multiple cameras have to be used. Many intelligent multi-camera video surveillance systems have been developed (Collins et al., 2001, Aghajan and Cavallaro, 2009, Valera and Velastin, 2004). It is a multidisciplinary field related to computer vision, pattern recognition, signal processing, communication, embedded computing and image sensors. This paper reviews the recent development of relevant technologies from the perspective of computer vision. Some key computer vision technologies used in multi-camera surveillance systems are shown in Fig. 1.

  • 1.

    Multi-camera calibration maps different camera views to a single coordinate system. In many surveillance systems, it is a key pre-step for other multi-camera based analysis.

  • 2.

    The topology of a camera network identifies whether camera views are overlapped or spatially adjacent and describes the transition time of objects between camera views.

  • 3.

    Object re-identification is to match two image regions observed in different camera views and recognize whether they belong to the same object or not, purely based the appearance information without spatio-temporal reasoning.

  • 4.

    Multi-camera tracking is to track objects across camera views.

  • 5.

    Multi-camera activity analysis is to automatically recognize activities of different categories and detect abnormal activities in a large area by fusing information from multiple camera views.

Different modules support one another and the arrows in Fig. 1 show the information flow between them.

While some existing reviews Valera and Velastin, 2004, Aghajan and Cavallaro, 2009 tried to cover all the aspects of architectures, technologies and applications, this paper emphasizes the connection and integration of these key computer vision and pattern recognition technologies in various environments and application scenarios and reviews their most recent development. Many existing surveillance systems solve these problems sequentially according to a pipeline. However, recent research works show that some of these problems can be jointly solved or even be skipped in order to overcome the challenges posed by certain application scenarios. For example, while it is easy to compute the topology of a camera network after cameras are well calibrated, some approaches are proposed to compute the topology without camera calibration, because existing calibration methods have various limitations and may not be efficient or accurate enough in certain scenarios. On the other hand, the topology information can help with calibration. If it is known that two camera views have overlap, the homography between them can be computed in an automatic manner. Therefore, these two problems are jointly solved in some approaches. Multi-camera tracking requires matching tracks obtained from different camera views according to their visual and spatio-temporal similarities. Matching the appearance of image regions is studied in object re-identification. The spatio-temporal reasoning requires camera calibration and the knowledge of topology. Some studies show that the complete trajectories across camera views can be used to calibrate cameras and to compute the topology. Therefore, multi-camera tracking can be jointly solved with camera calibration and inference of the topology. Multi-camera tracking is often a pre-step for multi-camera activity analysis, which uses the complete tracks of objects over the camera network as features. It is also possible to directly model activities in multiple camera views without tracking object across camera views. Once the models of activities are learned, they can provide useful information for multi-camera tracking, since if two tracks are classified as the same activity category, it is more likely for them to be the same object. A good understanding of the relationship of these modules helps to design optimal multi-camera video surveillance meeting the requirements of different applications.

Intelligent multi-camera video surveillance faces many challenges with the fast growth of camera networks. A few of them are briefly mentioned below. More detailed discussions are found in later sessions.

  • A multi-camera video surveillance system may be applied to many different scenes and have various configurations. As the scales of camera networks increase, it is expected that the multi-camera surveillance systems can self-adapt to a variety of scenes with less human intervention. For example, it is very time consuming to manually calibrate all the cameras on a large network and the human effort has to be repeated when the configuration of the camera network changes. Therefore, automatic calibration is preferred. Object re-identification and multi-camera activity analysis prefer unsupervised approaches in order to avoid manually labeling new training samples scenes and camera views change.

  • The topology of a large camera network could be complex and the fields of views of cameras are limited by scene structures. Some camera views are disjointed and may cover multiple ground planes. These bring great challenges for camera calibration, inference of topology and multi-camera tracking.

  • There are often large changes of viewpoints, illumination conditions and camera settings between different camera views. It is difficult to match the appearance of objects across camera views.

  • Many scenes of high security interest, such as airports, train stations, shopping malls and street intersections are very crowded. It is difficult to track objects over long distances without failures because of frequent occlusions among objects in such scenes. Although some existing surveillance systems work well in sparse scenes, there are many challenges unsolved in their applications to crowded environments.

  • In order to monitor a wide area with a small number of cameras and to acquire high resolution images from optimal viewpoints, some surveillance systems employ both static cameras and active cameras, whose panning, tilting and zooming (PTZ) parameters are automatically and dynamically controlled by the system. Calibration, motion detection, object tracking and activity analysis with hybrid cameras face many new challenges compared with only using static cameras.

This paper reviews the five key computer vision and pattern recognition technologies (i.e., multi-camera calibration, computing the topology of camera views, multi-camera tracking, object re-identification and multi-camera activity analysis) from Sections 2–6. Cooperative video surveillance both with static and active cameras is discussed in Section 7. Detailed descriptions of their technical challenges and comparison of different solutions are provided under each topic. Finally some unsolved challenges and future research directions are discussed in Section 8.

Section snippets

Camera calibration

Camera calibration is a fundamental problem in computer vision and is indispensable in many video surveillance applications. There has been a huge literature on calibrating camera views with respect to a 3D world coordinate system (Faugeras, 1993, Triggs, 1999, Jones et al., 2002, Hartley and Zisserman, 2004). They estimate both the intrinsic parameters (such as focal length, principal point, skew coefficients and distortion coefficients) and extrinsic parameters (such as the position of the

Computing the topology of camera views

Topology identifies camera views that are overlapped or spatially adjacent. Spatial adjacency means that there is no other viewfield between the two camera views and hence there may potentially exist an inter-connecting pathway directly connecting tracks of objects observed in the two camera views. When an object leaves a camera view, it may reappear in some of other adjacent camera views with certain probabilities. Due to the constraints of scene structures and the configurations of camera

Object tracking across camera views

Multi-camera tracking consists of two parts: (1) intra-camera tracking, i.e. tracking objects within a camera view; and (2) inter-camera tracking, i.e. associating the tracks of objects observed in different camera views. There is a huge literature on intra-camera tracking and a comprehensive survey can be found in (Yilmaz et al., 2006). This section focuses on inter-camera tracking, which is more challenging because (1) the prediction of the spatio-temporal information of objects across camera

Object re-identification

In some application scenarios, the topology of a camera network and tracking information are not available, especially when the cameras are far in distance and the environments are crowded. For example, only the snapshots of objects instead of tracks captured by different cameras are available. In this case spatio-temporal reasoning is not feasible or accurate for inter-camera tracking. In recent years, a lot of research work (Nakajima et al., 2003, Bird et al., 2005, Javed et al., 2005, Shan

Multi-camera activity analysis

Activity analysis is a key task in video surveillance. It classifies activities into different categories and discovers typical and abnormal activities. The proposed approaches fall into two categories. The supervised approaches (Murata and Properties, 1989, Bobick and Ivanov, 1998, Oliver et al., 2000, Smith et al., 2005) require manually labeling training samples. However, since the observations of activities change dramatically in different camera views, it often requires relabeling training

Cooperative video surveillance with static and active cameras

Many techniques discussed above are applied to static cameras. With a limited number of static cameras to monitor a large area, the observed objects are often small in size and there exist gaps between camera views. By including active cameras, whose panning, tilting and zooming (PTZ) parameters are automatically and dynamically controlled by the systems, the performance of video surveillance can be significantly improved (Collins et al., 2001, Collins et al., 2002, Matsuyama and Ukita, 2002,

Discussion and conclusions

By employing distributed camera networks, video surveillance systems substantially extend their capabilities and improve their robustness through data fusion and cooperative sensing. With multi-camera surveillance systems, activities in wide areas are analyzed, the accuracy and robustness of object tracking are improved by fusing data from multiple camera views, and one camera handovers objects to another camera to realize tracking over long distances without break. As the sizes and

Acknowledgements

This work is supported by the General Research Fund sponsored by the Research Grants Council of Hong Kong (Projects Nos. CUHK417110 and CUHK417011) and National Natural Science Foundation of China (Project No. 61005057).

References (243)

  • Antone, M., Bosse, M., 2004. Calibration of outdoor cameras from cast shadows. In: Proc. IEEE Internat. Conf. Systems,...
  • Azzari, P., Stefano, D.L., Bevilacqua, A., 2005. An effective real-time mosaicing algorithm apt to detect motion...
  • Bajcsy, R., 1985. Active perception vs. passive perception. In: Proc. IEEE Workshop on Computer Vision: Representation...
  • Baker, P., Aloimonos, Y., 2003. Calibration of a multicamera network. In: Proc. Omnivis 2003: Omnidirectional Vision...
  • A. Bakhtari et al.

    An active vision system for multitarget surveillance in dynamic environments

    IEEE Trans. Syst. Man Cybernet.

    (2007)
  • A. Bakhtari et al.

    Active-vision-based multisensor surveillance – An implementation

    IEEE Trans. Syst. Man Cybernet.

    (2006)
  • A. Bakhtari et al.

    Active-vision for the autonomous surveillance of dynamic, multi-object environments

    J. Intell. Robot Syst.

    (2009)
  • Bartoli, A., Dalal, N., Bose, B., Horaud, R., 2002. From video sequences to motion panoramas. In: Proc. IEEE Workshop...
  • Bay, H., Tuytelaars, T., Gool, L.V., 2006. Surf: Speed up robust features. In: Proc. European Conf. Computer...
  • Beardsley, P., Murray, D., 1992. Camera calibration using vanishing points. In: Proc. British Machine Vision...
  • S. Belongie et al.

    Shape matching and object recognition using shape contexts

    IEEE Trans. Pattern Anal. Machine Intell.

    (2002)
  • Berclaz, J., Fleuret, F., Fua, P., 2008. Multi-camera tracking and atypical motion detection with behavioral maps. In:...
  • Bevilacqua, A., Azzari, P., 2006. High-quality real time motion detection using ptz cameras. In: Proc. Advanced Video...
  • Bevilacqua, A., Azzari, P., 2007. A fast and reliable image mosaicing technique with application to wide area motion...
  • Bevilacqua, A., Stefano, L.D., Azzari, P., 2005. An effective real-time mosaicing algorithm apt to detect motion...
  • Bhat, K.S., Saptharishi, M., Khosla, P.K., 2000. Motion detection and segmentation using image mosaics. In: Proc. IEEE...
  • N. Bird et al.

    Detection of loitering individuals in public transportation areas

    IEEE Trans. Intell. Transport. Syst.

    (2005)
  • Black, J., Ellis, T.J., Rosin, P., 2002. Multi view image surveillance and tracking. In: Proc. IEEE Workshop on Motion...
  • Blake, A., Yuille, A., 1993. Active Vision. MIIT...
  • D.M. Blei et al.

    Latent dirichlet allocation

    J. Machine Learn. Res.

    (2003)
  • Bobick, A.F., Ivanov, Y.A., 1998. Action recognition using probabilistic parsing. In: Proc. IEEE Internat. Conf....
  • Bose, B., Grimson, E., 2003. Ground plane rectification by tracking moving objects. In: Proc. Workshop on Visual...
  • M. Brand et al.

    Discovery and segmentation of activities in video

    IEEE Trans. Pattern Anal. Machine Intell.

    (2000)
  • Brown, M., Lowe, D., 2003. Recognising panoramas. In: Proc. IEEE Internat. Conf. Computer...
  • Q. Cai et al.

    Tracking human motion in structured environments using a distributed-camera system

    IEEE Trans. Pattern Anal. Machine Intell.

    (1996)
  • Cao, X., Foroosh, H., 2006. Camera calibration and light source orientation from solar shadows. In: Journal of Computer...
  • Capel, D.P., 2001. Image Mosaicing and Super-resolution, Ph.D. Thesis. University of...
  • B. Caprile et al.

    Using vanishing points for camera calibration

    Internat. J. Comput. Vision

    (1990)
  • Carneiro, G., Lowe, D., 2006. Sparse flexible models of local features. In: Proc. European Conf. Computer...
  • Caspi, Y., Irani, M., 2000. A step towards sequence-to-sequence alignment. In: Proc. IEEE Internat. Conf. Computer...
  • Y. Caspi et al.

    Feature-based sequence-to-sequence matching

    Internat. J. Comput. Vision

    (2006)
  • Chang, T.H., Gong, S., 2001. Tracking multiple people with a multi-camera system. In: Proc. IEEE Internat. Conf....
  • F. Chaumette et al.

    Visual servo control, Part I: Basic approaches

    IEEE Robot. Automat. Mag.

    (2006)
  • F. Chaumette et al.

    Visual servo control, Part ii: Advanced approaches

    IEEE Robot. Automat. Mag.

    (2007)
  • Chen, K., Lai, C., Hung, Y., Chen, C., 2008. An adaptive learning method for target tracking across multiple cameras....
  • Chen, C., Yao, Y., Dira, A., Koschan, A., Abidi, M., 2009. Cooperative mapping of multiple ptz cameras in automated...
  • Cheng, E.D., Piccardi, M., 2006. Matching of objects moving across disjoint cameras. In: Proc. IEEE Internat. Conf....
  • Cipolla, R., Drummond, T., Drummond, D.P., 1999. Camera calibration from vanishing points in images of architecural...
  • R.T. Collins et al.

    Algorithms for cooperative multisensor surveillance

    Proc. IEEE

    (2001)
  • Collins, R., Amidi, O., Kanade, T., 2002. An active camera system for acquiring multi-view video. In: Proc. IEEE...
  • Cited by (633)

    View all citing articles on Scopus
    View full text