1 Introduction
2 Related Work
2.1 Definition of the operational area
2.2 Semantic segmentation
2.3 Monocular people detection
2.4 Projection of per-camera detections
2.5 Fusion and refinement of per-camera detections
2.6 Improving detection’s localization
3 Proposed pedestrian detection method
3.1 Preliminaries
3.2 Pedestrian semantic filtering
3.3 Automatic definition of the \({\mathcal {AOI}}\)
3.3.1 Detection filtering
3.4 Fusion of multi-camera detections
3.5 Semantic-driven back-projection
3.5.1 The problem of back-projecting 3D detections
3.5.2 Method overview
3.5.3 Iterative steepest ascent algorithm
4 Results
4.1 Evaluation framework
4.1.1 Datasets
-
EPFL Wildtrack [4, 5]: A challenging multi-camera dataset which has been explicitly designed to evaluate deep learning approaches. It has been recorded with 7 HD cameras with overlapping fields of view. Pedestrian annotations for 400 frames are provided. All of them are used to define the evaluation set used in this paper.
-
PETS 2009 [12]: The most used video sequences from this widely used benchmark dataset have been chosen.
-
PETS 2009 S2 L1, which contains 795 frames recorded by eight different cameras of a medium density crowd—in this evaluation, we have just selected 4 of these cameras: view 1 (far field view) and views 5, 6 and 8 (close-up views).
-
PETS 2009 City Center (CC), recorded only using two far-field view cameras with around 1 minute of annotated recording (400 frames per camera).
-
Resolution | Cameras | Frames | Pedestrian Density | Occlusions Level | Camera PoV | GT Annotations (Cameras) | |
---|---|---|---|---|---|---|---|
PETS 2012 CC [12] | 768\(\times \)576 | 2 | 795 | + | + | High | Frame (1) |
PETS 2012 S2 L1 [12] | 768\(\times \)576 | 4 | 795 | + | + | High | Frame (1) |
366\(\times \)288 | 4 | 5008 | ++ | ++ | Low | Frame (4) | |
480\(\times \)270 | 3 | 1197 | ++ | ++ | Low | Frame (1) | |
1920\(\times \)1080 | 7 | 401 | +++ | +++ | Medium | Ground Plane (NA) |
4.1.2 Performance indicators
4.2 System setup
Dataset | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PETS 2009 CC | PETS 2009 S2 L1 | EPFL Terrace | EPFL RLC | |||||||||||||||
Filt | Fus & BP | AUC | F-S | NA | NP | AUC | F-S | NA | NP | AUC | F-S | NA | NP | AUC | F-S | NA | NP | |
Faster-RCNN | 0.90 | 0.91 | 0.85 | 0.76 | 0.90 | 0.91 | 0.85 | 0.76 | 0.82 | 0.84 | 0.71 | 0.74 | 0.77 | 0.78 | 0.58 | 0.69 | ||
\(\checkmark \) | 0.90 | 0.91 | 0.85 | 0.76 | 0.90 | 0.91 | 0.85 | 0.76 | 0.84 | 0.85 | 0.73 | 0.74 | 0.80 | 0.82 | 0.68 | 0.70 | ||
\(\checkmark \) | \(\checkmark \) | 0.94 | 0.94 | 0.88 | 0.79 | 0.92 | 0.93 | 0.89 | 0.79 | 0.87 | 0.90 | 0.83 | 0.77 | 0.81 | 0.81 | 0.68 | 0.70 | |
YOLOv3 | 0.92 | 0.92 | 0.87 | 0.79 | 0.96 | 0.96 | 0.92 | 0.67 | 0.83 | 0.87 | 0.76 | 0.73 | 0.80 | 0.78 | 0.59 | 0.72 | ||
\(\checkmark \) | 0.92 | 0.92 | 0.87 | 0.79 | 0.96 | 0.96 | 0.92 | 0.67 | 0.84 | 0.87 | 0.77 | 0.73 | 0.85 | 0.83 | 0.66 | 0.72 | ||
\(\checkmark \) | \(\checkmark \) | 0.94 | 0.94 | 0.88 | 0.79 | 0.93 | 0.92 | 0.89 | 0.67 | 0.86 | 0.89 | 0.85 | 0.76 | 0.85 | 0.83 | 0.68 | 0.72 | |
Efficient Det | 0.97 | 0.97 | 0.94 | 0.66 | 0.97 | 0.97 | 0.94 | 0.67 | 0.82 | 0.84 | 0.71 | 0.78 | 0.82 | 0.80 | 0.61 | 0.72 | ||
\(\checkmark \) | 0.97 | 0.97 | 0.94 | 0.66 | 0.97 | 0.97 | 0.94 | 0.66 | 0.82 | 0.87 | 0.76 | 0.78 | 0.85 | 0.83 | 0.68 | 0.72 | ||
\(\checkmark \) | \(\checkmark \) | 0.95 | 0.94 | 0.88 | 0.75 | 0.94 | 0.94 | 0.88 | 0.67 | 0.86 | 0.89 | 0.83 | 0.76 | 0.83 | 0.84 | 0.71 | 0.72 |
4.3 Results overview
-
The ablation studies aim to gauge the impact of the different stages in the performance of the proposed approach. To this end, the following versions of the proposed method are compared:Ablation Studies are conducted on four of the described datasets: Terrace, PETS 2009 S2 L1, PETS 2009 CC and RLC.1.“Baseline (Faster-RCNN, YOLOv3 and EfficientDet-D7)”, provides reference results of monocamera pedestrian detectors.2.“Baseline + Filtering (Filt)” is a simplified version of our method which aims to independently evaluate the effect of the proposed automatic \(\mathcal {AOI}\) computation obtained by the “Pedestrian Semantic Filtering" stage.3.“Baseline + Filtering (Filt) + Fusion (Fus) + Back-Projection (BP)” is the full version of the proposed method, which additionally evaluates the “Fusion of Multi-Camera Detections” and “Semantic-Driven Back-Projection” stages.
-
State-of-the-art comparison results analyze the proposed method with respect to several non-deep-learning state-of-the-art multi-camera pedestrian detectors on the same four scenarios used in the ablation studies. Additionally, the method is compared with novel deep-learning methods on the Wildtrack dataset.
4.4 Ablation studies
4.4.1 Evaluation criterion
4.4.2 Results
4.4.3 Discussion
4.5 State-of-the-art comparison
4.5.1 Evaluation criterion
4.5.2 State-of-the-art algorithms
-
POM [14]. This algorithm proposes to estimate the marginal probabilities of pedestrians at every location inside an \(\mathcal {AOI}\). It is based on a preliminary background subtraction stage.
-
POM-CNN [14]. An upgraded version of POM in which the background subtraction stage is performed based on an encoder–decoder CNN architecture.
-
MvBN+HAP [29]. Relies on a multi-view Bayesian network model (MvBN) to obtain pedestrian locations on the ground plane. Detections are then refined by a height-adaptive projection method (HAP) based on an optimization framework similar to the one proposed in this paper, but driven by background-subtraction cues.
-
RCNN-Projected [39]. The bottom of bounding boxes obtained thorough per-camera CNN detectors are projected onto ground plane, where 3D proximity is used to cluster detections.
-
Deep-Occlusion [3] is an hybrid method which combines a CNN trained on the Wildtrack dataset and a conditional random fields (CRF) method to incorporate information on the geometry and calibration of the scene.
-
DeepMCD [7] is an end-to-end deep learning approach based on different architectures and training scenarios:
-
Pre-DeepMCD: a GoogleNet [34] architecture trained on the PETS dataset.
-
Top-DeepMCD: a GoogleNet [34] architecture trained on the Wildtrack dataset.
-
ResNet-DeepMCD: a ResNet-18 [18] architecture trained on the Wildtrack dataset.
-
DenseNet-DeepMCD: a DenseNet-121 [19] architecture trained on the Wildtrack dataset.
-
Algorithm | Dataset | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
PETS CC | PETS S2 L1 | EPFL Terrace | EPFL RLC | |||||||||
F-S | N-A | N-P | F-S | N-A | N-P | F-S | N-A | N-P | F-S | N-A | N-P | |
Faster-RCNN [32] | 0.91 | 0.85 | 0.76 | 0.91 | 0.85 | 0.76 | 0.84 | 0.71 | 0.74 | 0.78 | 0.58 | 0.69 |
YOLO v3 [31] | 0.92 | 0.87 | 0.79 | 0.96 | 0.92 | 0.67 | 0.87 | 0.76 | 0.73 | 0.78 | 0.59 | 0.72 |
Efficient Det [35] | 0.97 | 0.94 | 0.66 | 0.97 | 0.94 | 0.67 | 0.84 | 0.71 | 0.78 | 0.80 | 0.61 | 0.72 |
POM [14] | - | 0.70 | 0.55 | - | 0.65 | 0.67 | - | 0.19 | 0.56 | - | - | - |
MvBN + HAP [29] | - | 0.87 | 0.78 | - | 0.87 | 0.76 | - | 0.82 | 0.73 | - | - | - |
Ours (Faster-RCNN) | 0.93 | 0.88 | 0.79 | 0.93 | 0.89 | 0.79 | 0.90 | 0.83 | 0.77 | 0.81 | 0.68 | 0.70 |
Ours (YOLOv3) | 0.94 | 0.88 | 0.79 | 0.92 | 0.89 | 0.67 | 0.89 | 0.85 | 0.76 | 0.83 | 0.68 | 0.72 |
Ours (Efficient Det) | 0.94 | 0.88 | 0.75 | 0.94 | 0.88 | 0.67 | 0.89 | 0.83 | 0.76 | 0.84 | 0.71 | 0.72 |
Algorithm | EPFL Wildtrack | ||||
---|---|---|---|---|---|
Authors \(\mathcal {AOI}\) | Fine-Tuned | F-Score | N-MODA | N-MODP | |
Deep-Occlusion [3] | \(\checkmark \) | \(\checkmark \) | 0.86 | 0.74 | 0.53 |
ResNet-DeepMCD [5] | \(\checkmark \) | \(\checkmark \) | 0.83 | 0.67 | 0.64 |
Ours* (EfficientDet) | \(\checkmark \) | 0.81 | 0.65 | 0.63 | |
DenseNet-DeepMCD [5] | \(\checkmark \) | \(\checkmark \) | 0.79 | 0.63 | 0.66 |
Top-DeepMCD [7] | \(\checkmark \) | \(\checkmark \) | 0.79 | 0.60 | 0.64 |
GMC-3D [22] | \(\checkmark \) | 0.78 | 0.56 | 0.67 | |
Ours (EfficientDet) | 0.74 | 0.48 | 0.63 | ||
Ours (YOLOv3) | 0.71 | 0.42 | 0.60 | ||
Ours (Faster-RCNN) | 0.69 | 0.39 | 0.55 | ||
Pre-DeepMCD [7] | \(\checkmark \) | 0.51 | 0.33 | 0.52 | |
POM-CNN [14] | \(\checkmark \) | 0.63 | 0.23 | 0.30 | |
RCNN-Projected [39] | \(\checkmark \) | 0.52 | 0.11 | 0.18 |
4.5.3 Results
4.5.4 Discussion
-
First, the qualitative results presented in Fig. 10 suggest that the results in Table 4 are highly biased by the authors’ manually annotated area. The proposed method obtains a broader \(\mathcal {AOI}\) (Fig. 10, green area) than the one provided by the authors (Fig. 10, red area). Although the automatically obtained \(\mathcal {AOI}\) seems to be better fitted to the ground floor in the scene than the manually annotated one, the performance of our method decreases because ground-truth data is reported only on the manually annotated area. Thereby, our true positive detections out of this area result in false positives in the statistics (see Fig. 10, cameras 1 and 4).
-
Second, they learn their occlusion modeling and their inference ground occupancy probabilistic models specifically on the Wildtrack scenario using samples from the dataset. This training, as any fine-tuning procedure in deep neural networks, is highly effective, as indicated by the increase in performance resulting from the use of the same architecture but adapted for the Wildtrack scenario (compare the results of Pre-DeepMCD and Top-DeepMCD). This training requires the use of human-annotated detections in each scenario, hindering the scalability of these solutions and its application to the real world. The proposed approach, on the other hand, performs equally without the need of being adapted for every target scenario reported in this paper.