Introduction
Sensors | Advantages | Disadvantages |
---|---|---|
Point sensors [3] | 1. Steady data sources for 24/7 monitoring | 1. Expensive and difficult for massive installation and maintenance |
VANET [4] | 1. No additional hardware required | 1. Limited accuracy when the penetration rate of PVs islow |
2. Potential data sources covered a large-scale traffic network | 2. Rare pilot studies have been conducted | |
UAV [5] | 1. High flexibility and instant deployment | 1. Challenging to long-time estimation with large perspective |
2. High-fidelity data sources | 2. Expensive for massive deployment | |
Traffic surveillance cameras | 1. Widespread in many cities | 1. Low data quality leading to potentially inaccurate results |
This paper | 2. Steady data sources for 24/7 monitoring | 2. Owing to privacy concerns, sometimes only images (and not videos) can be acquired |
-
Camera calibration: aims to estimate the road length L from camera images, in which the core problem is to measure the distance between the real-world coordinates corresponding to the image pixels.
-
Vehicle detection: focuses on counting the vehicle number N, and it can be formulated as the object detection problem.
MVCalib
) is developed to utilize the key point information of multiple vehicles simultaneously. The actual road length can be estimated from the pixel distance in images once the camera is calibrated. For vehicle detection, we develop a linear-program-based approach to hybridize various public vehicle datasets to balance the images during daytime and nighttime under various conditions, and these public datasets are originally used for different purposes. A deep learning network (YOLO-v5) is trained on the proposed hybrid dataset. The trained network can achieve decent detection accuracy during both daytime and nighttime in various surveillance camera systems without extra training on those surveillance cameras, which exempts additional effort from annotating labels on images.-
It provides a holistic framework for 24/7 traffic density estimation using traffic surveillance cameras with 4 L characteristics: Low frame rate, Low resolution, Lack of annotated data, and Located in complex road environments.
-
It first time develops a robust multi-vehicle camera calibration method
MVCalib
that collectively utilizes the spatial relationships among key points from multiple vehicles. The proposed method can be used to calibrate surveillance cameras under the 4L characteristics. -
It systematically designs a linear-program-based data mixing strategy to synergize image datasets from different cameras and to balance the performance of the deep-learning-based vehicle detection models under different traffic scenarios.
-
It validates the proposed framework in two traffic surveillance camera systems in Hong Kong and Sacramento, and the research outcomes create portals for rapid and massive deployment of the proposed framework in different cities.
Methods
The overall framework
Camera calibration
MVCalib
. The background about camera calibration is first reviewed, then the detailed information about the proposed camera calibration model will be elaborated subsequently.Overview of camera calibration problems
The MVCalib
method
MVCalib
. The pipeline of \(\texttt {MVCalib}\) is shown in Fig. 3.
Vehicle detection
Name | Size | Resolution | Camera angle | Original usage |
---|---|---|---|---|
BDD100K | 100,000 | \(1280 \times 720\) | Front | Autonomous driving |
BIT Vehicle | 9,850 | Multiple | Inclined top | Vehicle reID |
CityCam | 60,000 | \(352 \times 240\) | Inclined top | Vehicle detection |
COCO | 17,684 | Multiple | Multiple | Object detection & segmentation |
MIO-TCD-L | 137,743 | \(720 \times 480\) | Inclined top | Vehicle detection & classification |
UA-DETRAC | 138,252 | \(960 \times 540\) | Inclined top | Vehicle detection |
Numerical experiments
Experimental settings
HK
) and Sacramento, California (Sac
) where the ground true data can be obtained at both sites. A comparison of these two cameras is shown in Table. 3.Attributes | HK | Sac |
---|---|---|
Resolution | \(320 \times 240\) pixels | \(720 \times 480\) pixels |
Update rate | 2 min | 1/30 s |
Orientation | Vehicle head | Vehicle tail |
Road type | Urban road | Highway |
Speed limit | 50 km/h | 105.3 km/h |
-
HK
: Camera images in Hong Kong are obtained from HKeMobility3 at the Chatham Road South, Kowloon, Hong Kong SAR, with the camera code K109F. Images containing seven vehicles are selected from June 22nd to June 25th, 2020. -
Sac
: Camera images in Sacramento are obtained from Caltrans system4 at Capital City Freeway at E Street, Sacramento, CA, the US. Images containing seven vehicles are selected from February 17th to December 18th, 2022.
Experimental results
Camera calibration
HK
and Sac
. Based on the calibration results, we estimate the road length from the camera images, and the length estimated by each model is compared with the actual length.
MVCalib
, Fig. 6 plots the fine-tuning loss defined in Eq. (9) for the three stages: candidate generation, vehicle model matching and parameter fine-tuning. In particular, Fig. 6a includes the losses of all the vehicle index and vehicle model pairs for the first two stages, and Fig. 6b plots the loss based on the matched vehicle model with the minimal fine-tuning loss. One can see that the fine-tuning loss defined in Eq. (8) decreases after each stage, which indicates that the CMA-ES can successfully reduce the loss in each stage.HK
and Sac
studies is shown in Fig. 7.Sac
, we likewise use the actual lengths of the lane markings on the Capital City Freeway as the ground truth. According to the Manual on Uniform Traffic Control Devices (MUTCD) [57], the length of a white line is 10 feet (approximately 3.05 ms) and the interval is 30 feet (approximately 9.14 ms) (shown in Fig. 7c). On the camera images, we annotate 14 points resulting in 12 line segments (shown in Fig. 7d), elongated in 40 feet (approximately 12.19 ms) for each segment.HK
and Sac
(unit for RMSE and MAE: meter)Method | HK | Sac | ||||
---|---|---|---|---|---|---|
RMSE | MAE | MAPE | RMSE | MAE | MAPE | |
UPNP | 25.80 | 22.21 | 370.03% | 9.66 | 8.86 | 72.66% |
UPNP+GN | 2.02 | 0.62 | 10.36% | 9.40 | 9.34 | 76.67% |
GPNP | 3.14 | 2.76 | 46.15% | 2.12 | 1.71 | 14.09% |
GPNP+GN | 2.24 | 1.98 | 33.15% | 2.03 | 1.64 | 13.48% |
MVCalib CG | 1.68 | 1.49 | 24.91% | 4.48 | 4.46 | 36.61% |
MVCalib VM | 0.98 | 0.77 | 12.83% | 2.27 | 1.88 | 15.45% |
MVCalib | 0.49 | 0.43 | 7.22% | 1.28 | 1.05 | 8.62% |
EPnP
[27], UPnP
, UPnP+GN
(UPnP
fine-tuned with the Gauss-Newton method) [31], GPnP
and GPnP+GN
(GPnP
fine-tuned with the Gauss-Newton method) [32]. The calibration results are shown in Table. 4. The estimated lengths of the road markings on camera images with the actual lengths are compared and three metrics are employed for benchmark comparison: Rooted Mean Square Error (RMSE), Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE). The calculation of MAE, RMSE, and MAPE is shown in Eq. 13.MVCalib
, we compare its result with baseline methods in terms of their ability to solve the PnPf problem. To conduct an ablation study gauging the contribution of each stage, we run MVCalib
with only the first stage (candidate generation), with the first two stages (up to vehicle model matching), and with all three stages. The three models are referred to as MVCalib CG
, MVCalib VM
, and MVCalib
, respectively. In fact, the MVCalib CG
is equivalent to the EPnP
method.UPnP (GN)
and GPnP (GN)
yield unsatisfactory solutions owing to the low image quality. As they take the focal length into account, the complexity of the problem is significantly increased, and hence they require high-resolution images and more numerous and accurate annotation points.MVCalib CG
, MVCalib VM
, and MVCalib
to evaluate the contribution of each stage. In the vehicle model matching stage, if we optimize the focal length with other parameters simultaneously, the estimation results are greatly improved relative to MVCalib CG
, demonstrating that the estimation of focal length is necessary and important for the calibration of traffic surveillance cameras. In the full MVCalib
, we also incorporate the joint information of multi-vehicle under the same camera. MVCalib
achieves the best result among all models. For the surveillance camera in HK
, the average error is only approximately 40 cms for estimating the six-meter road markings, less than 10% in MAPE. while in Sac
, the average error is about 1 m for the forty-foot road markings, less than 10% in MAPE.MVCalib
outperforms the other models in terms of all three metrics, which means that the calibration results are close to the ground truth. Snapshots of calibration results of surveillance cameras in HK
and Sac
are shown in Fig. 8, where the distance between any two red dots is one meter.Dataset | # images in daytime | # images at nighttime | Total # images |
---|---|---|---|
BDD100K | 8319 | 8398 | 16,717 |
BITVehicle | 7325 | 0 | 7325 |
CityCam | 8459 | 0 | 8459 |
COCO | 7111 | 7,619 | 14,730 |
MIO-TCD-L | 8892 | 7413 | 16,305 |
UA-DETRAC | 7955 | 5407 | 13,362 |
Total | 48,061 | 28,837 | 76,898 |
Name | Precision | Recall | mAP@0.5 | mAP@0.5:0.95 | Dataset size |
---|---|---|---|---|---|
BDD-100K | 0.361 | 0.364 | 0.326 | 0.144 | 100,000 |
BITVehicle | 0.255 | 0.009 | 0.062 | 0.035 | 9,850 |
CityCam | 0.412 | 0.938 | 0.881 | 0.538 | 60,000 |
COCO | 0.978 | 0.017 | 0.556 | 0.340 | 17,684 |
MIO-TCD-L | 0.737 | 0.885 | 0.899 | 0.578 | 137,743 |
Pretrained | 0.455 | 0.899 | 0.838 | 0.552 | 0 |
UA-DETRAC | 0.775 | 0.693 | 0.758 | 0.488 | 138,252 |
Spaghetti | 0.605 | 0.948 | 0.927 | 0.608 | 434,993 |
Random | 0.588 | 0.942 | 0.919 | 0.595 | 76,898 |
LP hybrid | 0.583 | 0.949 | 0.921 | 0.594 | 76,898 |
MVCalib
are further presented in Appendix. B, and the choice of \(\tau \) is discussed in Appendix. D.Vehicle detection
Name | Precision | Recall | mAP@0.5 | mAP@0.5:0.95 | Dataset size |
---|---|---|---|---|---|
BDD-100K | 0.443 | 0.316 | 0.302 | 0.124 | 100,000 |
BITVehicle | 0.058 | 0.001 | 0.018 | 0.010 | 9,850 |
CityCam | 0.402 | 0.793 | 0.713 | 0.412 | 60,000 |
COCO | 0.949 | 0.003 | 0.397 | 0.223 | 17,684 |
MIO-TCD-L | 0.805 | 0.746 | 0.817 | 0.511 | 137,743 |
Pretrained | 0.387 | 0.862 | 0.781 | 0.471 | 0 |
UA-DETRAC | 0.708 | 0.573 | 0.629 | 0.365 | 138,252 |
Spaghetti | 0.689 | 0.872 | 0.882 | 0.546 | 434,993 |
Random | 0.674 | 0.864 | 0.871 | 0.536 | 76,898 |
LP hybrid | 0.653 | 0.89 | 0.886 | 0.545 | 76,898 |
Case study I: surveillance cameras in Hong Kong
HK
(unit: RMSE, MAE: veh/km/lane)Lane ID | RMSE | MAE | MAPE |
---|---|---|---|
Lane #1 | 16.94 | 12.65 | 19.60% |
Lane #2 | 12.98 | 9.23 | 27.48% |
Lane #3 | 11.11 | 7.77 | 41.24% |
Lane #4 | 8.70 | 6.53 | 50.44% |
Average | 12.43 | 9.04 | 34.69% |
Case study II: surveillance cameras in Sacramento
HK
, key points on vehicles are annotated manually for camera calibration. The ground true density data are obtained from a double-loop detector at the same location (shown in Fig. 12 center) within the same time period. The detector data are obtained from the PeMS system, which includes the average traffic speed, density, and flow data. Given the study region, we can also divide the roads into three lanes (numbered along the x-axis), and define vehicle locations along the the y-axis, as shown in Fig. 12 right.Sac
(unit for RMSE, MAE: veh/km/lane)Lane ID | Transition time | Non-transition time | ||||
---|---|---|---|---|---|---|
RMSE | MAE | MAPE | RMSE | MAE | MAPE | |
Lane #1 | 17.38 | 11.60 | 17.80% | 8.84 | 5.76 | 24.53% |
Lane #2 | 16.62 | 12.04 | 19.23% | 9.75 | 5.85 | 16.48% |
Lane #3 | 31.33 | 22.62 | 24.85% | 14.03 | 7.44 | 14.23% |
Average | 21.78 | 15.42 | 20.63% | 10.87 | 6.35 | 18.31% |
Conclusions
MVCalib
is developed to estimate the actual length of roads from camera images. For vehicle detection, the transfer learning scheme is adopted to fine-tune the deep-learning-based model parameters. A linear-program-based data mixing strategy that incorporates multiple datasets is proposed to synergize the performance of the vehicle detection model in different traffic scenarios.