Intelligent multi-camera video surveillance: A review

doi:10.1016/j.patrec.2012.07.005

Pattern Recognition Letters

Volume 34, Issue 1, 1 January 2013, Pages 3-19

https://doi.org/10.1016/j.patrec.2012.07.005 Get rights and content

Abstract

Intelligent multi-camera video surveillance is a multidisciplinary field related to computer vision, pattern recognition, signal processing, communication, embedded computing and image sensors. This paper reviews the recent development of relevant technologies from the perspectives of computer vision and pattern recognition. The covered topics include multi-camera calibration, computing the topology of camera networks, multi-camera tracking, object re-identification, multi-camera activity analysis and cooperative video surveillance both with active and static cameras. Detailed descriptions of their technical challenges and comparison of different solutions are provided. It emphasizes the connection and integration of different modules in various environments and application scenarios. According to the most recent works, some problems can be jointly solved in order to improve the efficiency and accuracy. With the fast development of surveillance systems, the scales and complexities of camera networks are increasing and the monitored environments are becoming more and more complicated and crowded. This paper discusses how to face these emerging challenges.

Highlights

► Review major modules and research topics on multi-camera video surveillance. ► Review technologies from the perspective of computer vision and pattern recognition. ► Detailed descriptions of technical challenges and comparison of different solutions. ► Emphasizes the connection and integration of different modules. ► Some problems can be jointly solve to improve the efficiency and accuracy.

Introduction

Intelligent video surveillance has been one of the most active research areas in computer vision. The goal is to efficiently extract useful information from a huge amount of videos collected by surveillance cameras by automatically detecting, tracking and recognizing objects of interest, and understanding and analyzing their activities. Video surveillance has a wide variety of applications both in public and private environments, such as homeland security, crime prevention, traffic control, accident prediction and detection, and monitoring patients, elderly and children at home. These applications require monitoring indoor and outdoor scenes of airports, train stations, highways, parking lots, stores, shopping malls and offices. There is an increasing interest in video surveillance due to the growing availability of cheap sensors and processors, and also a growing need for safety and security from the public. Nowadays there are tens of thousands of cameras in a city collecting a huge amount of data on a daily basis. Researchers are urged to develop intelligent systems to efficiently extract information from large scale data.

The view of a single camera is finite and limited by scene structures. In order to monitor a wide area, such as tracking a vehicle traveling through the road network of a city or analyzing the global activities happening in a large train station, video streams from multiple cameras have to be used. Many intelligent multi-camera video surveillance systems have been developed (Collins et al., 2001, Aghajan and Cavallaro, 2009, Valera and Velastin, 2004). It is a multidisciplinary field related to computer vision, pattern recognition, signal processing, communication, embedded computing and image sensors. This paper reviews the recent development of relevant technologies from the perspective of computer vision. Some key computer vision technologies used in multi-camera surveillance systems are shown in Fig. 1.

1.
Multi-camera calibration maps different camera views to a single coordinate system. In many surveillance systems, it is a key pre-step for other multi-camera based analysis.
2.
The topology of a camera network identifies whether camera views are overlapped or spatially adjacent and describes the transition time of objects between camera views.
3.
Object re-identification is to match two image regions observed in different camera views and recognize whether they belong to the same object or not, purely based the appearance information without spatio-temporal reasoning.
4.
Multi-camera tracking is to track objects across camera views.
5.
Multi-camera activity analysis is to automatically recognize activities of different categories and detect abnormal activities in a large area by fusing information from multiple camera views.

Different modules support one another and the arrows in Fig. 1 show the information flow between them.

While some existing reviews Valera and Velastin, 2004, Aghajan and Cavallaro, 2009 tried to cover all the aspects of architectures, technologies and applications, this paper emphasizes the connection and integration of these key computer vision and pattern recognition technologies in various environments and application scenarios and reviews their most recent development. Many existing surveillance systems solve these problems sequentially according to a pipeline. However, recent research works show that some of these problems can be jointly solved or even be skipped in order to overcome the challenges posed by certain application scenarios. For example, while it is easy to compute the topology of a camera network after cameras are well calibrated, some approaches are proposed to compute the topology without camera calibration, because existing calibration methods have various limitations and may not be efficient or accurate enough in certain scenarios. On the other hand, the topology information can help with calibration. If it is known that two camera views have overlap, the homography between them can be computed in an automatic manner. Therefore, these two problems are jointly solved in some approaches. Multi-camera tracking requires matching tracks obtained from different camera views according to their visual and spatio-temporal similarities. Matching the appearance of image regions is studied in object re-identification. The spatio-temporal reasoning requires camera calibration and the knowledge of topology. Some studies show that the complete trajectories across camera views can be used to calibrate cameras and to compute the topology. Therefore, multi-camera tracking can be jointly solved with camera calibration and inference of the topology. Multi-camera tracking is often a pre-step for multi-camera activity analysis, which uses the complete tracks of objects over the camera network as features. It is also possible to directly model activities in multiple camera views without tracking object across camera views. Once the models of activities are learned, they can provide useful information for multi-camera tracking, since if two tracks are classified as the same activity category, it is more likely for them to be the same object. A good understanding of the relationship of these modules helps to design optimal multi-camera video surveillance meeting the requirements of different applications.

Intelligent multi-camera video surveillance faces many challenges with the fast growth of camera networks. A few of them are briefly mentioned below. More detailed discussions are found in later sessions.

•
A multi-camera video surveillance system may be applied to many different scenes and have various configurations. As the scales of camera networks increase, it is expected that the multi-camera surveillance systems can self-adapt to a variety of scenes with less human intervention. For example, it is very time consuming to manually calibrate all the cameras on a large network and the human effort has to be repeated when the configuration of the camera network changes. Therefore, automatic calibration is preferred. Object re-identification and multi-camera activity analysis prefer unsupervised approaches in order to avoid manually labeling new training samples scenes and camera views change.
•
The topology of a large camera network could be complex and the fields of views of cameras are limited by scene structures. Some camera views are disjointed and may cover multiple ground planes. These bring great challenges for camera calibration, inference of topology and multi-camera tracking.
•
There are often large changes of viewpoints, illumination conditions and camera settings between different camera views. It is difficult to match the appearance of objects across camera views.
•
Many scenes of high security interest, such as airports, train stations, shopping malls and street intersections are very crowded. It is difficult to track objects over long distances without failures because of frequent occlusions among objects in such scenes. Although some existing surveillance systems work well in sparse scenes, there are many challenges unsolved in their applications to crowded environments.
•
In order to monitor a wide area with a small number of cameras and to acquire high resolution images from optimal viewpoints, some surveillance systems employ both static cameras and active cameras, whose panning, tilting and zooming (PTZ) parameters are automatically and dynamically controlled by the system. Calibration, motion detection, object tracking and activity analysis with hybrid cameras face many new challenges compared with only using static cameras.

This paper reviews the five key computer vision and pattern recognition technologies (i.e., multi-camera calibration, computing the topology of camera views, multi-camera tracking, object re-identification and multi-camera activity analysis) from Sections 2–6. Cooperative video surveillance both with static and active cameras is discussed in Section 7. Detailed descriptions of their technical challenges and comparison of different solutions are provided under each topic. Finally some unsolved challenges and future research directions are discussed in Section 8.

Section snippets

Camera calibration

Camera calibration is a fundamental problem in computer vision and is indispensable in many video surveillance applications. There has been a huge literature on calibrating camera views with respect to a 3D world coordinate system (Faugeras, 1993, Triggs, 1999, Jones et al., 2002, Hartley and Zisserman, 2004). They estimate both the intrinsic parameters (such as focal length, principal point, skew coefficients and distortion coefficients) and extrinsic parameters (such as the position of the

Computing the topology of camera views

Topology identifies camera views that are overlapped or spatially adjacent. Spatial adjacency means that there is no other viewfield between the two camera views and hence there may potentially exist an inter-connecting pathway directly connecting tracks of objects observed in the two camera views. When an object leaves a camera view, it may reappear in some of other adjacent camera views with certain probabilities. Due to the constraints of scene structures and the configurations of camera

Object tracking across camera views

Multi-camera tracking consists of two parts: (1) intra-camera tracking, i.e. tracking objects within a camera view; and (2) inter-camera tracking, i.e. associating the tracks of objects observed in different camera views. There is a huge literature on intra-camera tracking and a comprehensive survey can be found in (Yilmaz et al., 2006). This section focuses on inter-camera tracking, which is more challenging because (1) the prediction of the spatio-temporal information of objects across camera

Object re-identification

In some application scenarios, the topology of a camera network and tracking information are not available, especially when the cameras are far in distance and the environments are crowded. For example, only the snapshots of objects instead of tracks captured by different cameras are available. In this case spatio-temporal reasoning is not feasible or accurate for inter-camera tracking. In recent years, a lot of research work (Nakajima et al., 2003, Bird et al., 2005, Javed et al., 2005, Shan

Multi-camera activity analysis

Activity analysis is a key task in video surveillance. It classifies activities into different categories and discovers typical and abnormal activities. The proposed approaches fall into two categories. The supervised approaches (Murata and Properties, 1989, Bobick and Ivanov, 1998, Oliver et al., 2000, Smith et al., 2005) require manually labeling training samples. However, since the observations of activities change dramatically in different camera views, it often requires relabeling training

Cooperative video surveillance with static and active cameras

Many techniques discussed above are applied to static cameras. With a limited number of static cameras to monitor a large area, the observed objects are often small in size and there exist gaps between camera views. By including active cameras, whose panning, tilting and zooming (PTZ) parameters are automatically and dynamically controlled by the systems, the performance of video surveillance can be significantly improved (Collins et al., 2001, Collins et al., 2002, Matsuyama and Ukita, 2002,

Discussion and conclusions

By employing distributed camera networks, video surveillance systems substantially extend their capabilities and improve their robustness through data fusion and cooperative sensing. With multi-camera surveillance systems, activities in wide areas are analyzed, the accuracy and robustness of object tracking are improved by fusing data from multiple camera views, and one camera handovers objects to another camera to realize tracking over long distances without break. As the sizes and

Acknowledgements

This work is supported by the General Research Fund sponsored by the Research Grants Council of Hong Kong (Projects Nos. CUHK417110 and CUHK417011) and National Natural Science Foundation of China (Project No. 61005057).

References (243)

T.G. Dietterich et al.
A pharmaceutical, solving the multiple-instance problem with axis-parallel rectangles
Artif. Intell.
(1997)
E.J. Gonzalez-Galvan et al.
Camera pan/tilt to eliminate the workspace-size/pixel-resolution tradeoffs with camera-space manipulation
Robotics Comput. Integrated Manufact.
(2002)
Abdel-Hakim, A.E., Farag, A.A., 2006. Csift: A sift descriptor with color invariant characteristics. In: Proc. European...
Agarwal, A., Triggs, B., 2006. Hyperfeatures – multilevel local coding for visual recognition. In: Proc. European Conf....
Agin, G.J., 1979. Real Time Control of a Robot with a Mobile Camera, Tech. Rep. SRI International Technical...
Agrawal, M., Davis, L., 2003. Complete camera calibration using spheres: Dual space approach. In: Proc. IEEE Internat....
Alexander, H., Lucchesi, C., xxxx. Matching algorithms for bipartite graphs. Relatorio Tecnico...
J. Aloimonos et al.
Active vision
Int. J. Comput. Vision
(1988)

Antone, M., Bosse, M., 2004. Calibration of outdoor cameras from cast shadows. In: Proc. IEEE Internat. Conf. Systems,...

Azzari, P., Stefano, D.L., Bevilacqua, A., 2005. An effective real-time mosaicing algorithm apt to detect motion...

Bajcsy, R., 1985. Active perception vs. passive perception. In: Proc. IEEE Workshop on Computer Vision: Representation...

Baker, P., Aloimonos, Y., 2003. Calibration of a multicamera network. In: Proc. Omnivis 2003: Omnidirectional Vision...

A. Bakhtari et al.

An active vision system for multitarget surveillance in dynamic environments

IEEE Trans. Syst. Man Cybernet.

(2007)

A. Bakhtari et al.

Active-vision-based multisensor surveillance – An implementation

IEEE Trans. Syst. Man Cybernet.

(2006)

A. Bakhtari et al.

Active-vision for the autonomous surveillance of dynamic, multi-object environments

J. Intell. Robot Syst.

(2009)

Bartoli, A., Dalal, N., Bose, B., Horaud, R., 2002. From video sequences to motion panoramas. In: Proc. IEEE Workshop...

Bay, H., Tuytelaars, T., Gool, L.V., 2006. Surf: Speed up robust features. In: Proc. European Conf. Computer...

Beardsley, P., Murray, D., 1992. Camera calibration using vanishing points. In: Proc. British Machine Vision...

S. Belongie et al.

Shape matching and object recognition using shape contexts

IEEE Trans. Pattern Anal. Machine Intell.

(2002)

Berclaz, J., Fleuret, F., Fua, P., 2008. Multi-camera tracking and atypical motion detection with behavioral maps. In:...

Bevilacqua, A., Azzari, P., 2006. High-quality real time motion detection using ptz cameras. In: Proc. Advanced Video...

Bevilacqua, A., Azzari, P., 2007. A fast and reliable image mosaicing technique with application to wide area motion...

Bevilacqua, A., Stefano, L.D., Azzari, P., 2005. An effective real-time mosaicing algorithm apt to detect motion...

Bhat, K.S., Saptharishi, M., Khosla, P.K., 2000. Motion detection and segmentation using image mosaics. In: Proc. IEEE...

N. Bird et al.

Detection of loitering individuals in public transportation areas

IEEE Trans. Intell. Transport. Syst.

(2005)

Black, J., Ellis, T.J., Rosin, P., 2002. Multi view image surveillance and tracking. In: Proc. IEEE Workshop on Motion...

Blake, A., Yuille, A., 1993. Active Vision. MIIT...

D.M. Blei et al.

Latent dirichlet allocation

J. Machine Learn. Res.

(2003)

Bobick, A.F., Ivanov, Y.A., 1998. Action recognition using probabilistic parsing. In: Proc. IEEE Internat. Conf....

Bose, B., Grimson, E., 2003. Ground plane rectification by tracking moving objects. In: Proc. Workshop on Visual...

M. Brand et al.

Discovery and segmentation of activities in video

IEEE Trans. Pattern Anal. Machine Intell.

(2000)

Brown, M., Lowe, D., 2003. Recognising panoramas. In: Proc. IEEE Internat. Conf. Computer...

Q. Cai et al.

Tracking human motion in structured environments using a distributed-camera system

IEEE Trans. Pattern Anal. Machine Intell.

(1996)

Cao, X., Foroosh, H., 2006. Camera calibration and light source orientation from solar shadows. In: Journal of Computer...

Capel, D.P., 2001. Image Mosaicing and Super-resolution, Ph.D. Thesis. University of...

B. Caprile et al.

Using vanishing points for camera calibration

Internat. J. Comput. Vision

(1990)

Carneiro, G., Lowe, D., 2006. Sparse flexible models of local features. In: Proc. European Conf. Computer...

Caspi, Y., Irani, M., 2000. A step towards sequence-to-sequence alignment. In: Proc. IEEE Internat. Conf. Computer...

Y. Caspi et al.

Feature-based sequence-to-sequence matching

Internat. J. Comput. Vision

(2006)

Chang, T.H., Gong, S., 2001. Tracking multiple people with a multi-camera system. In: Proc. IEEE Internat. Conf....

F. Chaumette et al.

Visual servo control, Part I: Basic approaches

IEEE Robot. Automat. Mag.

(2006)

F. Chaumette et al.

Visual servo control, Part ii: Advanced approaches

IEEE Robot. Automat. Mag.

(2007)

Chen, K., Lai, C., Hung, Y., Chen, C., 2008. An adaptive learning method for target tracking across multiple cameras....

Chen, C., Yao, Y., Dira, A., Koschan, A., Abidi, M., 2009. Cooperative mapping of multiple ptz cameras in automated...

Cheng, E.D., Piccardi, M., 2006. Matching of objects moving across disjoint cameras. In: Proc. IEEE Internat. Conf....

Cipolla, R., Drummond, T., Drummond, D.P., 1999. Camera calibration from vanishing points in images of architecural...

R.T. Collins et al.

Algorithms for cooperative multisensor surveillance

Proc. IEEE

(2001)

Collins, R., Amidi, O., Kanade, T., 2002. An active camera system for acquiring multi-view video. In: Proc. IEEE...

Cited by (633)

ECPC – versatile multicamera system calibration framework for immersive video applications
2024, SoftwareX
Accurate extrinsic parameters calibration is crucial particularly in immersive video, where camera calibration plays a significant role, as its quality is essential for accurate reconstruction and efficient compression of three-dimensional scenes. While methods for intrinsic parameters calibration, color correction, and depth estimation are publicly available, there is a lack of versatile techniques for estimating extrinsics in the context of immersive video. The proposed Extrinsic Camera Parameters Calibration (ECPC) software addresses these limitations by proposing an extrinsic parameters estimation method and a framework for testing its accuracy. The software is compatible with MPEG Immersive Video framework, allowing for seamless integration and evaluation. The proposed method contributes to the advancement of immersive video applications by providing a reliable and comprehensive approach for estimating and evaluating extrinsic parameters.
Unsupervised person Re-identification: A review of recent works
2024, Neurocomputing
Re-identification (Re-ID) is a process that seeks to identify concern individuals from successive non-overlapping photographs. The area of computer vision has recently seen an uptick in the amount of attention focused on deep neural networks, especially given the popularity of smart monitoring systems and the development of sophisticated learning algorithms. We classified existing Re-ID technologies into closed-world and open-world contexts based on the used components. The closed-world scenario has been commonly used under a variety of data analysis hypotheses, and it brought precise results when applied to a variety of datasets utilizing deep learning techniques. We began with a comprehensive overview of closed-world person Re-ID considering deep metric learning, an extensive representation of features learning, ranking optimization, and in-depth analysis. Due to the accomplishment of performance in the packed scenario, the Re-ID focuses research has lately turned to a bare environment setting, which brings new issues. This setting is more akin to what we'd find in real-world circumstances. We summarized the unsupervised Re-ID literature as well as current research trends and proposed future studies.
STDM-transformer: Space-time dual multi-scale transformer network for skeleton-based action recognition
2024, Neurocomputing
Transformer-based methods have currently demonstrated impressive results in the field of skeleton-based action recognition. Nevertheless, how to effectively model multi-scale features with transformers remains a challenging problem, which is crucial to distinguish various actions. In this paper, we propose a Space–time Dual Multi-scale transformer (STDM-transformer) to explore the multi-scale collaborative representation employing both fine and coarse scale motion information. In contrast to existing approaches which typically propagate information between scales in a single fusion manner, our Space–time Dual Multi-scale method stratifies the space–time multi-scale into dual levels. One level is to construct fine-grained local motion interactions. In detail, the space–time multi-scale partition strategy and the novel intra-inter space–time transformer module are proposed to extract and aggregate the feature in part scale and body scale, respectively. The other is aimed at modeling coarse-grained global motion contexts, in which the layer-wise multi-scale progressive fusion strategy is designed. Extensive experimental results demonstrate that the proposed STDM-transformer achieves the SOTA performance on large-scale datasets.
Performance analysis of joint imaging system with polarized, infrared, and visible cameras for multi-sensor imaging
2023, Optik
Polarized camera, infrared camera, and visible camera have their own advantages and disadvantages. Jointly imaging by these three cameras can effectively improve the detection and identification capabilities of the system. However, the performance of the joint imaging system differs from that of any single sub-camera when used in combination with several different cameras. To investigate the comprehensive performance of the imaging system when combined with several different cameras, isometric scaled plates from U.S. Navy resolution plate drawings were imaged at 5 m, 10 m, 15 m, and 20 m distances, and the results were analyzed using multiple quality evaluation metrics. Our work is instructive for the imaging applications of combined multi-source cameras.
Face and body-shape integration model for cloth-changing person re-identification
2023, Image and Vision Computing
Among the existing deep learning-based person re-identification (ReID) methods, human parsing based on semantic segmentation is the most promising solution for ReID because such models can learn to identify fine-grained details of different body parts or apparel of a target semantically. However, intra-class variations such as illumination changes, multi-pose angles, and cloth-changing (CC) across different non-overlapping camera viewpoints present a crucial challenge for this approach. Among these challenges, a person CC is the most distinctive problem for ReID models, which often fail to associate the target in new cloth against the learned feature semantics of the previous cloth worn in a different timeline. In this paper, we propose a face and body-shape integration (FBI) network as a tactical solution to address the long-term person CC-ReID problem. The FBI comprises hierarchically stacked parsing and edge prediction (PEP) CNN blocks that generate fine-grained human-parsing output at the initial stage. We then aligned the PEP to our proposed model agnostic plug-in feature overlay module (FOM) to mask cloth-relevant body attributes except the facial features pooled from the input sample. Thus, our human parsing PEP and FOM modules are attuned to discriminatively learn cloth-irrelevant features of the target pedestrian(s) to optimize the effectiveness of person ReID in solitary or minimally crowded areas. In our extensive person CC-ReID experiments, our FBI model achieves 83.4/61.8 in R1 and 91.7/65.8 in mAP evaluation results on the PRCC and LTCC datasets, respectively; thereby significantly out-competing several previous state-of-the-art ReID methods, and validating the effectiveness of the FBI.
CSCMOT: Multi-object tracking based on channel spatial cooperative attention mechanism
2023, Engineering Applications of Artificial Intelligence
Multi-object tracking has made good progress in recent years. Most mainstream methods use the fusion method of detection and Re-ID to complete the multi-target tracking technology. However, the current multi-tracking algorithm is slow and cannot meet the real-time requirements, which makes it difficult to implement in actual scenarios. In addition, the current mainstream multi-target tracking technology often has the problem of identity information jumping. Such frequent identity information hopping can lead to serious problems in some demanding practical applications, resulting in poor tracking performance. To solve these problems, we propose a simple framework CSCMOT. A non-parametric attention mechanism is adopted to focus on some feature points of the target without increasing the amount of computation, so as to reduce the amount of computation and improve the real-time performance of the algorithm. In addition, the jumping problem of identity information can be reduced by random simulation occlusion to improve tracking performance. Experiments show that the real-time performance of the proposed CSCMOT framework reaches 32.5 FPS, which exceeds most of the mainstream methods. In addition, the ID-switch was reduced to 2493 on the MOT17 dataset. Made a great breakthrough, better to solve the problem of identity information jump. The tracking accuracy is also 71.5, a competitive result that exceeds most of the mainstream methods. Effective data show that the framework improves the real-time performance of the algorithm, solves the problem of identity jump between targets, and is more conducive to experiment landing, which is easy to combine with the mobile robot platform.

View all citing articles on Scopus

View full text

Intelligent multi-camera video surveillance: A review

Abstract

Highlights

Introduction

Section snippets

Camera calibration

Computing the topology of camera views

Object tracking across camera views

Object re-identification

Multi-camera activity analysis

Cooperative video surveillance with static and active cameras

Discussion and conclusions

Acknowledgements

Artif. Intell.

Robotics Comput. Integrated Manufact.

Active vision

Int. J. Comput. Vision

An active vision system for multitarget surveillance in dynamic environments

IEEE Trans. Syst. Man Cybernet.

Active-vision-based multisensor surveillance – An implementation

IEEE Trans. Syst. Man Cybernet.

Active-vision for the autonomous surveillance of dynamic, multi-object environments

J. Intell. Robot Syst.

Shape matching and object recognition using shape contexts

IEEE Trans. Pattern Anal. Machine Intell.

Detection of loitering individuals in public transportation areas

IEEE Trans. Intell. Transport. Syst.

Latent dirichlet allocation

J. Machine Learn. Res.

Discovery and segmentation of activities in video

IEEE Trans. Pattern Anal. Machine Intell.

Tracking human motion in structured environments using a distributed-camera system

IEEE Trans. Pattern Anal. Machine Intell.

Using vanishing points for camera calibration

Internat. J. Comput. Vision

Feature-based sequence-to-sequence matching

Internat. J. Comput. Vision

Visual servo control, Part I: Basic approaches

IEEE Robot. Automat. Mag.

Visual servo control, Part ii: Advanced approaches

IEEE Robot. Automat. Mag.

Algorithms for cooperative multisensor surveillance

Proc. IEEE