Skip to main content
Erschienen in: Complex & Intelligent Systems 4/2023

Open Access 03.01.2023 | Original Article

CenterLoc3D: monocular 3D vehicle localization network for roadside surveillance cameras

verfasst von: Xinyao Tang, Wei Wang, Huansheng Song, Chunhui Zhao

Erschienen in: Complex & Intelligent Systems | Ausgabe 4/2023

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Monocular 3D vehicle localization is an important task for vehicle behaviour analysis, traffic flow parameter estimation and autonomous driving in Intelligent Transportation System (ITS) and Cooperative Vehicle Infrastructure System (CVIS), which is usually achieved by monocular 3D vehicle detection. However, monocular cameras cannot obtain depth information directly due to the inherent imaging mechanism, resulting in more challenging monocular 3D tasks. Currently, most of the monocular 3D vehicle detection methods still rely on 2D detectors and additional geometric constraint modules to recover 3D vehicle information, which reduces the efficiency. At the same time, most of the research is based on datasets of onboard scenes, instead of roadside perspective, which is limited in large-scale 3D perception. Therefore, we focus on 3D vehicle detection without 2D detectors in roadside scenes. We propose a 3D vehicle localization network CenterLoc3D for roadside monocular cameras, which directly predicts centroid and eight vertexes in image space, and the dimension of 3D bounding boxes without 2D detectors. To improve the precision of 3D vehicle localization, we propose a multi-scale weighted-fusion module and a loss with spatial constraints embedded in CenterLoc3D. Firstly, the transformation matrix between 2D image space and 3D world space is solved by camera calibration. Secondly, vehicle type, centroid, eight vertexes, and the dimension of 3D vehicle bounding boxes are obtained by CenterLoc3D. Finally, centroid in 3D world space can be obtained by camera calibration and CenterLoc3D for 3D vehicle localization. To the best of our knowledge, this is the first application of 3D vehicle localization for roadside monocular cameras. Hence, we also propose a benchmark for this application including a dataset (SVLD-3D), an annotation tool (LabelImg-3D), and evaluation metrics. Through experimental validation, the proposed method achieves high accuracy with \(A{P_{3D}}\) of 51.30%, average 3D localization precision of 98%, average 3D dimension precision of 85% and real-time performance with FPS of 41.18.
Hinweise

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

In recent years, 3D vehicle localization is becoming an essential component in CVIS and ITS. Vehicle behaviors [1, 2] and traffic flow statistics [3] can be analyzed and used for traffic state estimation [4] and traffic management [5], which is of great research significance and practical value. In practical applications, obtaining vehicle localization and dimension in 3D space is more important than vehicle information in 2D images. Different from sensors in autonomous vehicles, roadside sensors are usually installed on higher to obtain full-road vehicle information easily. In CVIS, by estimating 3D vehicle localization, vehicle locations can be obtained and sent to each vehicle for accurate path planning to avoid collisions.
The most common sensors currently used for 3D vehicle localization are lidar [69] and stereo vision systems represented by binocular and RGB-D depth cameras [1012]. We can obtain accurate 3D location provided by point cloud data directly. However, they are usually expensive and environmentally demanding. In contrast, monocular RGB cameras [1318] are cost-effective due to widespread deployment and fast data processing speed, which is more suitable for large-scale traffic scenes. Currently, monocular 3D vehicle localization is achieved by monocular 3D vehicle detection methods, which are focused on the onboard view. Differently, the roadside view is fixed with more geometric priors. At the same time, the roadside view is higher and wider than the onboard, which is more suitable for large-area perception. However, due to missing depth information, challenges like occlusion and congestion still exist in monocular roadside surveillance scenes.
Monocular 3D vehicle detection methods can be divided into the following two categories: (1) geometric constraint-based methods [1922]. (2) 3D feature estimation based methods [2328]. Methods of the first category are implemented based on mature 2D detectors, which use 3D bounding boxes or CAD models to represent vehicles. Then, 3D vehicle models are solved by establishing constraints of 3D bounding boxes fitting closely to 2D boxes. However, this geometric constraint is not enough for generating unambiguous results. Sub-networks will also reduce processing efficiency. Compared with the first category, 3D vehicle information like keypoint, orientation, and dimension is directly extracted without 2D detectors. Due to missing depth information in monocular images, additional geometric inference modules are needed, which will also increase processing time to some extent.
To solve the above problems, we propose a monocular 3D vehicle localization method for surveillance cameras in traffic scenes. First, the transformation matrix between 2D image space and 3D world space is solved by camera calibration, which is further used for 3D vehicle localization. Second, a one-stage 3D vehicle localization network CenterLoc3D is proposed, which contains three modules: backbone, multi-scale feature fusion, and multi-task detection head. In multi-scale feature fusion, we propose a weighted-fusion module to fuse five feature maps containing multi-scale information with different weights for multi-scale vehicle detection. In multi-task detection head, the multi-scale feature map is used as the input to obtain outputs. Outputs contain four branches: vehicle type, centroid, eight vertexes, and the dimension of 3D bounding boxes. T‘o improve the precision of 3D vehicle localization without sacrificing efficiency, a loss with spatial constraints embedding is proposed, including reprojection constraint of 2D-3D transformation obtained by camera calibration and IoU constraint of 3D box projection. Finally, we also propose a benchmark including a dataset, an annotation tool, and evaluation metrics for 3D vehicle localization in roadside monocular traffic scenes. Through experimental validation, it can be proved that our method is efficient and robust.
The main contributions of this paper are summarized as follows:
  • A monocular 3D vehicle localization network CenterLoc3D for roadside surveillance cameras in traffic scenes is proposed, which directly predicts accurate 3D vehicle projection vertexes and dimensions.
  • A weighted-fusion module is proposed in multi-scale feature fusion, which further enhances feature extraction capability.
  • A loss with spatial constraints embedding is proposed, which can effectively improve the accuracy of 3D vehicle localization.
  • A benchmark including a dataset, an annotation tool, and evaluation metrics is proposed for experimental validation, which is helpful for the development of monocular 3D vehicle localization in roadside monocular traffic scenes.
Deep learning-based monocular 3D vehicle detection methods can be divided into the following two categories: geometric constraint-based and 3D feature estimation-based.
Geometric constraint-based methods. Due to the continuous development of convolutional neural networks (CNNs), many excellent 2D object detection methods [2933] have emerged. Some methods [34, 35] directly apply the regional proposal in 2D object detection to 3D object detection. Mono3D [34] introduces sliding windows for 3D vehicle detection. First, a series of candidate bounding boxes in 3D space are generated and projected to the image plane by camera calibration. Second, priori information like vehicle segmentation contours is combined to further obtain vehicle areas and 3D bounding boxes. The best results are finally selected by non-maximum suppression (NMS). However, the search range in 3D space is much larger than that in 2D space while obtaining priori information is time-consuming, which greatly reduces the efficiency. Currently, 2D object detection methods are still used in most 3D vehicle detection methods [13, 14, 1922] to obtain vehicle location, dimension and orientation. To improve efficiency, generating candidate boxes in 3D space is replaced by using geometric constraints of 3D bounding boxes fitting closely to 2D boxes. Deep3DBox [19] obtains 3D vehicle bounding boxes and orientation by constructing 2D-3D box constraints and multi-bin loss function. Deep MANTA [20], a multi-task network, uses 3D CAD models and region proposal network (RPN) to obtain vehicle type, 2D bounding boxes, location, visibility, and similarity to CAD model, which is used to select the best 3D CAD model. CubeSLAM [21] obtains three orthogonal vanishing points by extracting straight line segments of vehicles and constructs 3D vehicle bounding boxes by geometric constraints between vanishing points and 2D bounding boxes. GS3D [22] is able to correct 3D bounding boxes, which contains two sub-networks: 2D+O and 3D sub-network. The 2D+O sub-network adds a vehicle orientation regression branch to Faster R-CNN [29], which is used to obtain 2D vehicle bounding boxes and orientation simultaneously. With camera calibration and the above information, a coarse 3D bounding box called guidance can be obtained. The 3D sub-network is used to extract visible 3D features of the guidance and complete the correction to obtain more accurate detection results. The above methods based on geometric constraints usually require 2D bounding boxes and time-consuming priori information extraction. The constraints that 2D bounding boxes can provide are limited, which leads to ambiguous results. At the same time, sub-networks will additionally increase the processing time.
3D feature estimation-based methods. To avoid ambiguous results from geometric constraints and further improve accuracy and efficiency, many new methods [15, 17, 18, 2328] focus on direct 3D information extraction in images which can be used for 3D bounding box regression by CNNs. Mono3DBox [23] proposes an end-to-end detector that can directly predict the center point of the vehicle bottom in the image. Then, the point is converted into 3D space by a look-up table to realize 3D vehicle detection and localization. MonoGRK [24] proposes an end-to-end keypoint-based 3D vehicle detection and localization network, using ResNet-101 [36] as the backbone. The detection head contains three sub-networks: 2D vehicle detection, 2D keypoint regression, and 3D vehicle dimension regression. Besides, 3D CAD models are also used like DeepMANTA [20], but only 14 keypoints are labelled for each model, which describes the CAD model coarsely. 2D-3D space is linked in the geometric inference module, which enables the model end-to-end training. Transformer3D [26] proposes a roadside 3D vehicle detection method using CNNs. Automatic camera calibration [37] is firstly used to obtain three orthogonal vanishing points and scale factors for perspective transformation. Then, the transformed image is fed into RetinaNet [32] to obtain 2D bounding boxes and conversion parameters for 3D boxes. Finally, 3D bounding boxes are recovered by camera calibration results and the conversion parameters for speed measurement. The pipeline is simple, but 3D bounding boxes are not directly evaluated in the experimental results. RTM3D [25] views 3D vehicle detection as keypoint detection without leveraging 2D detection results. The network is constructed based on ResNet-18 [36] and DLA-34 [38], which directly predicts center points and eight vertexes of 3D bounding boxes in the image. Then, vehicle location, dimension, and orientation can be solved by minimizing the reprojection error of 2D-3D bounding boxes. Lite-FPN [27] proposes a light-weight keypoint-based feature pyramid network, which contains three parts: backbone, detection head and post-processing. Top-k operation is used in the detection head to connect keypoints and regression sub-network. The attention module is added in keypoint detection loss to effectively solve the problem of mismatch between keypoint confidence and location. KM3D [28] proposes a 3D vehicle detection network that contains keypoint detection and geometric inference modules. Outputs include center point, eight vertexes, orientation, dimension, and confidence. In the network, keypoints are used to further guide vehicle orientation regression and 3D IoU strategy is used to help confidence branch training. The above method based on 3D feature estimation can improve the accuracy of 3D vehicle detection by designing a geometric inference module in the network. However, this module will increase model complexity and inference time.
Among the reviewed literature, geometric constraint-based methods leverage 2D detectors and additional geometric constraint modules to obtain 3D vehicle information, which reduces the efficiency. 3D feature estimation-based methods embed geometric inference modules into networks without 2D detectors, which increases the complexity. Moreover, almost all the reviewed research uses datasets of onboard scenes, instead of roadside perspective, which is limited in large-scale 3D perception.
In this paper, we focus on an efficient 3D vehicle detection structure for roadside surveillance cameras to directly obtain 3D vehicle information without 2D detectors and geometry inference modules, to tackle the above-mentioned research gaps and weaknesses.

CenterLoc3D for 3D vehicle localization

Framework

The overall framework of the proposed method is shown in Fig. 1, which consists of two components: camera calibration and 3D vehicle localization. Firstly, the transformation matrix between 2D image space and 3D world space is solved by camera calibration. Second, vehicle type, centroid, eight vertexes, and dimensions of 3D bounding boxes are obtained by CenterLoc3D. Finally, 3D vehicle centroids can be obtained by camera calibration for 3D vehicle localization.

Camera calibration

To complete 3D vehicle localization in traffic scenes, the transformation matrix between 2D image space and 3D world space must be derived through camera calibration. We refer to the previous work [39, 40] to define coordinate systems, establish the camera calibration model, and choose the single vanishing point-based calibration method VWL (One Vanishing Point, Known Width and Length) [39] to complete camera calibration.
Schematic diagram of coordinate systems and camera calibration model is shown in Fig. 2. Three coordinate systems are defined, all of which are right-handed. The world coordinate system is defined by xyz axis. The origin \({O_w}\) is located at the projection point of the camera on the road plane, whereas z is perpendicular to the road plane upwards. The camera coordinate system is defined by \({x_{\textrm{c}}}, {y_{\textrm{c}}}, {z_{\textrm{c}}}\) axis. The origin \({O_c}\) is located at the camera optical center. \({x_{\textrm{c}}}\) is parallel to x. \({z_{\textrm{c}}}\) points to the ground along the camera optical axis. \({y_{\textrm{c}}}\) is perpendicular to the plane \({x_c}{O_c}{z_c}\). The image coordinate system is defined by uv axis. The origin \({O_i}\) is located at the image center. In the image coordinate system, u is horizontal right and v is vertical downward. \({z_{\textrm{c}}}\) intersects the road plane at \(r = ({c_x},{c_y})\) in the image coordinate system, which is called the principal point and the default location is at the image center. \({c_x}\) and \({c_y}\) represent half of the image width and height, respectively.
In addition to the above parameters, another parameter is the roll angle, which can be expressed by a simple image rotation and has no effect on calibration results. Therefore, it is not considered. In this paper, VWL and road marks [40] are used to solve and optimize calibration parameters \(f,h,\phi ,\theta \). The vanishing point \(VP = ({u_0},{v_0})\) along the direction of traffic flow is extracted by road edge lines. To illustrate the road space in a straightforward way, \({O_i}\) and y axis are adjusted. Firstly, \({O_i}\) is moved to the upper left corner of the image, corresponding to the intrinsic parameter matrix K:
$$\begin{aligned} K = \left[ {\begin{array}{*{20}{c}} f&{}\quad 0&{}\quad {{c_x}}\\ 0&{}\quad f&{}\quad {{c_y}}\\ 0&{}\quad 0&{}\quad 1 \end{array}} \right] \end{aligned}$$
(1)
Then, y axis is adjusted to the direction along the traffic flow. The rotation matrix R contains a rotation of \(\phi + {\pi / 2}\) about x axis and \(\theta \) about z axis, which can be defined as:
$$\begin{aligned} \begin{aligned} R&= {R_x}(\phi + {\pi / 2}){R_z}(\theta )\\&= \left[ {\begin{array}{*{20}{r}} {\cos \theta }&{}{ - \sin \theta }&{}0\\ { - \sin \phi \sin \theta }&{}{ - \sin \phi \cos \theta }&{}{ - \cos \phi }\\ {\cos \phi \sin \theta }&{}{\cos \phi \cos \theta }&{}{ - \sin \phi } \end{array}} \right] \end{aligned} \end{aligned}$$
(2)
The translation matrix T is:
$$\begin{aligned} T = \left[ {\begin{array}{*{20}{c}} 1&{}\quad 0&{}\quad 0&{}\quad 0\\ 0&{}\quad 1&{}\quad 0&{}\quad 0\\ 0&{}\quad 0&{}\quad 1&{}\quad { - h} \end{array}} \right] \end{aligned}$$
(3)
The adjusted transformation formula from world point (xyz) to image point (uv) is:
$$\begin{aligned} s\left[ {\begin{array}{*{20}{c}} u\\ v\\ 1 \end{array}} \right] = KRT\left[ {\begin{array}{*{20}{c}} x\\ y\\ z\\ 1 \end{array}} \right] = H\left[ {\begin{array}{*{20}{c}} x\\ y\\ z\\ 1 \end{array}} \right] \end{aligned}$$
(4)
where \(H = \left[ {{h_{ij}}} \right] ,i = 1,2,3;j = 1,2,3,4\) is the \(3 \times 4\) projection matrix from world to image coordinate. s is the scale factor.
Finally, according to the above derivation, the adjusted transformation between the world and image can be described as follows:
$$\begin{aligned}{} & {} \left\{ \begin{array}{lll} u = \frac{{{h_{11}}x + {h_{12}}y + {h_{13}}z + {h_{14}}}}{{{h_{31}}x + {h_{32}}y + {h_{33}}z + {h_{34}}}}\\ v = \frac{{{h_{21}}x + {h_{22}}y + {h_{23}}z + {h_{24}}}}{{{h_{31}}x + {h_{32}}y + {h_{33}}z + {h_{34}}}} \end{array} \right. \end{aligned}$$
(5)
$$\begin{aligned}{} & {} \left\{ \begin{array}{llll} x = \frac{{{b_1}({h_{22}} - {h_{32}}v) - {b_2}({h_{12}} - {h_{32}}u)}}{{({h_{11}} - {h_{31}}u)({h_{22}} - {h_{32}}v) - ({h_{12}} - {h_{32}}u)({h_{21}} - {h_{31}}v)}}\\ y = \frac{{ - {b_1}({h_{21}} - {h_{31}}v) + {b_2}({h_{11}} - {h_{31}}u)}}{{({h_{11}} - {h_{31}}u)({h_{22}} - {h_{32}}v) - ({h_{12}} - {h_{32}}u)({h_{21}} - {h_{31}}v)}} \end{array} \right. \end{aligned}$$
(6)
where \(\left\{ \begin{array}{lll} {b_1} = u({h_{33}}z + {h_{34}}) - ({h_{13}}z + {h_{14}})\\ {b_2} = v({h_{33}}z + {h_{34}}) - ({h_{23}}z + {h_{24}}) \end{array} \right. \).

CenterLoc3D

In the roadside camera view, 3D vehicle localization can be described in terms of 3D vehicle detection. We propose a 3D vehicle localization network CenterLoc3D for roadside cameras, which uses a single RGB image as the input and the outputs include vehicle type, centroid, vertexes, and dimensions of 3D bounding boxes. Combined with camera calibration in Sect. “Camera calibration”, 3D vehicle localization results can be calculated.

Network architecture

The overall architecture of the network is shown in Fig. 3, which contains three parts: backbone, multi-scale feature fusion and multi-task detection head. First, the RGB image is scaled to \(I \in { \mathbb {R} ^{H \times W \times 3}}\) as input, where \(H = W = 512\). Then, the image is downsampled with a factor of \(S = 4\).
Backbone. To make a trade-off between accuracy and efficiency, we use ResNet-50 [36] as our backbone, containing residual and inverted bottleneck structures, which have good feature extraction capability. We extract the last three features with channels of 512, 1024, and 2048 for feature fusion.
Multi-scale feature fusion. The series of feature maps obtained by the backbone are hierarchical. The high-resolution feature map retains more accurate local features and is suitable for small object detection. The low-resolution feature map contains higher-level semantic information and is suitable for large object detection. In monocular roadside traffic scenes, vehicles are distributed in different locations of different sizes. To make the network adaptive to vehicles of different sizes, a weighted-fusion module is proposed in multi-scale feature fusion based on RetinaNet [32]. Figure 4 shows the schematic diagram of weighted-fusion module. The feature maps \({C_3},{C_4},{C_5}\) of size \(64 \times 64\), \(32 \times 32\), and \(16 \times 16\) are extracted by backbone and used to construct feature pyramid \({F_p} = \left\{ {{P_3},{P_4},{P_5},{P_6},{P_7}} \right\} \). Different from YOLOv4 [31], which directly retains the feature pyramid, we use deconvolution and unsample [41] to unify the five feature maps of different sizes in \({F_p}\) to the same size \(\overline{{F_p}} = \left\{ {\overline{{P_3}},\overline{{P_4}},\overline{{P_5}},\overline{{P_6}},\overline{{P_7}} } \right\} \). Then, weighted-fusion module is added by the fusion strategy shown in Eq. 7. Finally, a fused feature map \(F \in {\mathbb {R}^{\frac{H}{S} \times \frac{W}{S} \times 64}}\) containing multi-scale information of vehicles is obtained. This feature fusion module not only generates multi-scale feature maps, but also reduces computational effort in the prediction process and improve the overall efficiency.
$$\begin{aligned} F = \sum \limits _{i = 3}^7 {{w_i} \times \overline{{P_i}} } \end{aligned}$$
(7)
where \({w_i}\) denotes the weight of the feature map with \(\sum \nolimits _{i = 3}^7 {{w_i}} = 1\), which can be set according to the importance degree. We set 0.5, 0.2, 0.1, 0.1 and 0.1 respectively in the experiment.
Multi-task detection head. Based on the multi-scale fusion feature map, a multi-task detection head is designed based on the actual demand for 3D vehicle localization in roadside scenes. The detection head is comprised of four branches: vehicle type, centroid, eight vertexes, and dimensions of 3D bounding boxes. To further improve the ability to distinguish different types of vehicles, the type branch is implemented by fully convolutional layers and attention module [42]. The remaining branches are implemented by fully convolutional layers only. Inspired by CenterNet [33], vehicle type is defined as centroid heatmap \({M_c} \in {[0,1]^{\frac{H}{S} \times \frac{W}{S} \times C}}\), where C denotes the number of vehicle types. The remaining branches are defined as centroid offset \({M_{co}} \in {[0,1]^{\frac{H}{S} \times \frac{W}{S} \times 2}}\), vertexes \({M_v} \in {\mathbb {R}^{\frac{H}{S} \times \frac{W}{S} \times 16}}\) and dimension \({M_s} \in {\mathbb {R}^{\frac{H}{S} \times \frac{W}{S} \times 3}}\). To ensure the stability during training, we normalize centroids and vertexes to the fusion feature map with size H/SW/S.

Loss function

In the training process, the loss function contains six components: vehicle classification, centroid offset, vertexes, dimension, and a loss with spatial constraints embedding for improving the precision of 3D vehicle localization. The spatial constraints include reprojection constraint of 2D-3D transformation obtained by camera calibration and IoU constraint of 3D box projection.
Vehicle classification loss. To solve the problem of imbalanced positive and negative samples, we use focal loss [32] as vehicle classification loss:
$$\begin{aligned} {L_c}= & {} - \frac{1}{N}\sum \limits _{k = 1}^C \sum \limits _{i = 1}^{{W / S}}\nonumber \\{} & {} \sum \limits _{j = 1}^{{H / S}} \left\{ \begin{array}{*{20}{c}} {{{(1 - {{{\hat{p}}}_{cij}})}^\alpha }\log ({{{\hat{p}}}_{cij}})}&{}\text {if} {{p_{cij}} = 1}\\ {{\hat{p}}_{cij}^\alpha {{(1 - {p_{cij}})}^\beta }\log (1 - {{{\hat{p}}}_{cij}})}&{}\text {if} {{p_{cij}} < 1} \end{array} \right. \end{aligned}$$
(8)
where N is the number of positive samples, \(\alpha \) and \(\beta \) are hyper-parameters used to adjust the loss weights of positive and negative samples, which are usually set to 2 and 4. \({p_{cij}}\) is the response value of each ground truth vehicle in the heatmap described by the Gaussian kernel function \({e^{ - \frac{{{{(x - {p_{cijx}})}^2} + {{(y - {p_{cijy}})}^2}}}{{2{\sigma ^2}}}}}\). \(\sigma \) is the standard deviation calculated from the ground truth vehicle dimension in image space [33].
3D Vehicle information regression loss. We use L1 regression loss for vehicle centroid offset, vertex, and dimension loss:
$$\begin{aligned} {L_{co}}= & {} {} \frac{1}{N}\sum \limits _{i = 1}^{{W/ S}} {\sum \limits _{j = 1}^{{H / S}} {\mathbbm {1}_{ij}^\text {obj}|{M_{co}^{ij} - ({{{p_\text {center}}} / S} - {{{\tilde{p}}}_\text {center}})} |} } \end{aligned}$$
(9)
$$\begin{aligned} {L_v}= & {} \frac{1}{N}\sum \limits _{i = 1}^{{W/ S}} {\sum \limits _{j = 1}^{{H / S}} {\mathbbm {1}_{ij}^\textrm{obj}|{M_v^{ij} - {{{p_{vertex}}} / S}} |} } \end{aligned}$$
(10)
$$\begin{aligned} {L_s}= & {} \frac{1}{N}\sum \limits _{i = 1}^{{W / S}} {\sum \limits _{j = 1}^{{H / S}} {\mathbbm {1}_{ij}^\textrm{obj}|{M_s^{ij} - {\tilde{M}}_s^{ij}} |} } \end{aligned}$$
(11)
where \(\mathbbm {1}_{ij}^\textrm{obj}\) denotes whether centroid appears at ij. \({p_\textrm{center}}\) and \({p_{vertex}}\) denote the ground truth centroids and vertexes of 3D bounding boxes in the input image with size HW. \({\tilde{p}}_\textrm{center}^{}\) denotes the ground truth centroids in the fusion feature map with size H/SW/S. \({{\tilde{M}}_s}\) denotes the ground truth feature map of 3D vehicle dimension.
Loss with spatial constraints embedding. To further improve the precision of 3D vehicle localization, loss functions with spatial constraints embedding are designed, including reprojection constraints using camera calibration and vehicle IoU constraints.
The schematic diagram of reprojection constraint and vehicle IoU constraint is shown in Fig. 5. We define the predicted and ground truth vertexes in image as \(p_i^\textrm{pred} = (u_i^\textrm{pred},v_i^\textrm{pred})\) and \(p_i^{gt} = (u_i^{gt},v_i^{gt})\). The corresponding vertexes in world are \(P_i^\textrm{pred} = (x_i^\textrm{pred},y_i^\textrm{pred},z_i^\textrm{pred})\) and \(P_i^{gt} = (x_i^{gt},y_i^{gt},z_i^{gt})\). The projection of predicted vertexes in world and image are defined as \(P_i^\textrm{proj} = (x_i^\textrm{proj},y_i^\textrm{proj},z_i^\textrm{proj})\) and \(p_i^\textrm{proj} = (u_i^\textrm{proj},v_i^\textrm{proj})\), \(i = 1,2, \cdots ,8\). The predicted dimension is \(D_v^\textrm{pred} = (l_v^\textrm{pred},w_v^\textrm{pred},h_v^\textrm{pred})\). Based on \(p_i^\textrm{pred}\) from CenterLoc3D, \(P_i^\textrm{proj}\) can be obtained by Eq. 6 and Table 1. Then, \(p_i^\textrm{proj}\) can be obtained by Eq. 5. As shown in Fig. 5a, \(p_i^\textrm{proj}\) and \(p_i^\textrm{pred}\) constitute the projection constraints, where predicted, ground truth, and projection boxes are represented in blue, red, and green, respectively.
The minimum external rectangles of predicted and ground truth vertexes are \(v_{rec}^\textrm{pred}\) and \(v_{rec}^{gt}\). As shown in Fig. 5b, blue indicates the predicted box while red is the ground truth. The green area indicates the overlap between the two boxes.
Table 1
Calculation of 3D bounding box projection vertexes in world space
Vertex
World coordinate
\(P_1^\textrm{proj}\)
\((x_2^{gt} + w_v^\textrm{pred},y_2^{gt},z_2^{gt})\)
\(P_2^\textrm{proj}\)
\((x_2^{gt},y_2^{gt},z_2^{gt})\)
\(P_3^\textrm{proj}\)
\((x_2^{gt},y_2^{gt} + l_v^\textrm{pred},z_2^{gt})\)
\(P_4^\textrm{proj}\)
\((x_2^{gt} + w_v^\textrm{pred},y_2^{gt} + l_v^\textrm{pred},z_2^{gt})\)
\(P_5^\textrm{proj}\)
\((x_2^{gt} + w_v^\textrm{pred},y_2^{gt},z_2^{gt} + h_v^\textrm{pred})\)
\(P_6^\textrm{proj}\)
\((x_2^{gt},y_2^{gt},z_2^{gt} + h_v^\textrm{pred})\)
\(P_7^\textrm{proj}\)
\((x_2^{gt},y_2^{gt} + l_v^\textrm{pred},z_2^{gt} + h_v^\textrm{pred})\)
\(P_8^\textrm{proj}\)
\((x_2^{gt} + w_v^\textrm{pred},y_2^{gt} + l_v^\textrm{pred},z_2^{gt} + h_v^\textrm{pred})\)
The loss comprised of two constraints shown in Fig. 5 is defined as follows:
$$\begin{aligned} {L_\textrm{proj}}= & {} \frac{1}{N}\sum \limits _{i = 1}^{{W/ S}} {\sum \limits _{j = 1}^{{H / S}} {\mathbbm {1}_{ij}^ \text {obj}|{M_\textrm{proj}^{ij}(H,M_s^{ij}) - {{\bar{M}}_v^{ij}} } |} } \end{aligned}$$
(12)
$$\begin{aligned} {L_\textrm{iou}}= & {} \frac{1}{N}\sum \limits _{i = 1}^{{W / S}} {\sum \limits _{j = 1}^{{H / S}} {\mathbbm {1}_{ij}^\textrm{obj} \cdot IoU(R_\textrm{bbox}^{ij} - {\tilde{R}}_\textrm{bbox}^{ij})} } \end{aligned}$$
(13)
where \(M_\textrm{proj}^{}(H,M_s^{ij})\) denotes the feature map of projection vertexes calculated by camera calibration matrix H in Sect. “Camera calibration” and vehicle dimension feature map \({M_s}\) predicted by the network. \(\bar{M_v^{}} \) denotes predicted 3D bounding boxes vertexes in the original image. \(R_\textrm{bbox}^{ij}\) and \({\tilde{R}}_\textrm{bbox}^{ij}\) denote the predicted and ground truth minimum external rectangles. IoU denotes the IoU loss strategy and we use CIoU loss [43] in the experiment.
With the above six loss functions, the multi-task loss function can be defined as follows:
$$\begin{aligned} L = {\lambda _c}{L_c} + {\lambda _{co}}{L_{co}} + {\lambda _v}{L_v} + {\lambda _s}{L_s} + {\lambda _\textrm{proj}}{L_\textrm{proj}} + {\lambda _\textrm{iou}}{L_\textrm{iou}} \end{aligned}$$
(14)
where \(\lambda \) is the weight to balance the loss of each component. We set \({\lambda _c} = 1\), \({\lambda _{co}} = 1\), \({\lambda _v} = 0.1\), \({\lambda _s} = 0.1\), \({\lambda _\textrm{proj}} = 0.1\) and \({\lambda _\textrm{iou}} = 1\) in the experiment.

Dataset of 3D vehicle localization

Most currently available 3D vehicle localization datasets are based on onboard views [4446] instead of roadside, which is difficult for large-scale 3D perception and validation of roadside vehicle perception methods. Therefore, we propose a 3D vehicle localization dataset (SVLD-3D) for roadside surveillance cameras and an annotation tool (LabelImg-3D) for experimental validation.

Dataset composition

Scenes in SVLD-3D dataset are from BrnoCompSpeed [47] and self-collected urban scenes with resolution of \({{1920}} \times {{1080}}\) and \({{1080}} \times {{720}}\), respectively. BrnoCompSpeed is public and provided by the Brno University of Technology, which contains six highway scenes with three views (left, center, and right). Each frame contains ground truth road marks and vanishing points (used for camera calibration (Sect. “Camera Calibration”)). Each vehicle contains ground truth location and speed collected by Lidar and GPS (used for vehicle dimension annotation (Sect. “Label process”) and training loss function (Sect. “Loss Function”)). However, only highway scenes with low traffic volumes and single-vehicle types exist in BrnoCompSpeed. To further expand dataset diversity, urban scenes with more vehicle types, congestion, and occlusion are also included in SVLD-3D.
SVLD-3D contains five typical scenes with three vehicle types (car, truck, and bus), with a total of 14,593 images in the training and validation dataset and 2273 images in the test dataset. Some samples in SVLD-3D are shown in Fig. 6. Table 2 shows detailed information of different scenes in the dataset, where the effective field of view \({D_r} = ({D_{ry}},{D_{rx}})\) is the maximum distance that the roadside camera can perceive along and perpendicular to the road direction.
Table 2
Details of different scenes in SVLD-3D
Scene
\({D_{ry}}\)
\({D_{rx}}\)
Camera Calibration Parameters
f
\(\phi /rad\)
\(\theta /rad\)
h/mm
A
120
25
2878.13
0.17874
0.26604
10119.08
B
120
25
3994.17
0.15717
0.35346
8071.00
C
60
15
3384.25
0.26295
\(-\)0.24869
8126.49
D
80
10
3743.78
0.11225
\(-\)0.07516
7353.40
E
60
10
1142.26
0.33372
0.14387
7166.44

Label process

Since the original data in SVLD-3D only contains images and 3D vehicle location provided by Lidar, we develop an annotation tool LabelImg-3D and relabel vehicles to obtain ground truth vehicle type, centroid, vertexes, and dimensions of 3D bounding boxes. The ground truth 3D and 2D vehicle centroid, and dimension are denoted as \(P_\textrm{cen}^{gt} = (x_\textrm{cen}^{gt},y_\textrm{cen}^{gt},z_\textrm{cen}^{gt})\), \(p_\textrm{cen}^{gt} = (u_\textrm{cen}^{gt},v_\textrm{cen}^{gt})\) and \(D_v^{gt} = (l_v^{gt},w_v^{gt},h_v^{gt})\).
Compared with variable views of autonomous vehicles, the roadside camera usually has certain installation standards and the pan angle is relatively fixed. Therefore, only the scenes with typical pan angles are selected. To reduce label effort, previous work [48] and 2D bounding boxes provided by YOLOv4 [31] are used as guidance. The specific label steps are as follows:
(1)
Vehicle type. If the vehicle type obtained by YOLOv4 is consistent with the ground truth, label this type as ground truth. Otherwise, adjust to the correct type in LabelImg-3D.
 
(2)
Vehicle dimension. Firstly, YOLOv4 is used to obtain 2D vehicle bounding boxes as guidance. Secondly, the geometric constraint of 3D boxes fitting closely to 2D is used to adjust vehicle dimension by camera calibration. At the same time, vehicle dimensions can also be obtained by observing vehicle models and referring to relevant documents during the labeling process. These two sub-steps can be helpful to obtain more accurate vehicle dimensions \(D_v^{gt}\).
 
(3)
Vehicle centroid. \(p_\textrm{cen}^{gt}\) can be obtained by Eq. 5 when \(P_\textrm{cen}^{gt}\) is known by lidar.
 
(4)
Vehicle vertexes. Based on \(P_\textrm{cen}^{gt}\) and \(D_v^{gt}\) in step (2), \(p_i^{gt}\) can be obtained by Table 3 and Eq. 5.
 
Table 3
Calculation of 3D bounding box ground truth vertexes in world space
Vertex
World coordinate
\(P_1^{gt}\)
\((x_\textrm{cen}^{gt} + {{w_v^{gt}}/2},y_\textrm{cen}^{gt} - {{l_v^{gt}}/2},z_\textrm{cen}^{gt} - {{h_v^{gt}}/2})\)
\(P_2^{gt}\)
\((x_\textrm{cen}^{gt} - {{w_v^{gt}}/2},y_\textrm{cen}^{gt} - {{l_v^{gt}}/2},z_\textrm{cen}^{gt} - {{h_v^{gt}}/2})\)
\(P_3^{gt}\)
\((x_\textrm{cen}^{gt} - {{w_v^{gt}}/2},y_\textrm{cen}^{gt} + {{l_v^{gt}}/2},z_\textrm{cen}^{gt} - {{h_v^{gt}}/2})\)
\(P_4^{gt}\)
\((x_\textrm{cen}^{gt} + {{w_v^{gt}}/2},y_\textrm{cen}^{gt} + {{l_v^{gt}}/2},z_\textrm{cen}^{gt} - {{h_v^{gt}}/2})\)
\(P_5^{gt}\)
\((x_\textrm{cen}^{gt} + {{w_v^{gt}}/2},y_\textrm{cen}^{gt} - {{l_v^{gt}}/2},z_\textrm{cen}^{gt} + {{h_v^{gt}}/2})\)
\(P_6^{gt}\)
\((x_\textrm{cen}^{gt} - {{w_v^{gt}}/2},y_\textrm{cen}^{gt} - {{l_v^{gt}}/2},z_\textrm{cen}^{gt} + {{h_v^{gt}}/2})\)
\(P_7^{gt}\)
\((x_\textrm{cen}^{gt} - {{w_v^{gt}}/2},y_\textrm{cen}^{gt} + {{l_v^{gt}}/2},z_\textrm{cen}^{gt} + {{h_v^{gt}}/2})\)
\(P_8^{gt}\)
\((x_\textrm{cen}^{gt} + {{w_v^{gt}}/2},y_\textrm{cen}^{gt} + {{l_v^{gt}}/2},z_\textrm{cen}^{gt} + {{h_v^{gt}}/2})\)
Finally, vehicle type, centroid, vertexes, and dimensions are recorded in the annotation file. Figure 7 shows the flowchart of label process and visualization of annotations.

Experimental protocols

In this section, we introduce implementation details and evaluation metrics for our experiments.

Implementation details

We implement our network using PyTorch platform with Core i7-8700 CPU and one GTX 1080Ti GPU. The original image is scaled to \(512 \times 512\) for training and testing. We split the dataset into the training set and validation set with a ratio of 9:1. We use the Adam optimizer with a base learning rate of 0.001 for 100 epochs. The learning rate reduces by a factor of 10 when validation loss no longer decreases for three continuous epochs. The pretrained model on ImageNet is used for fine-tuning. We train our network using weights freezing in the first 60 epochs with a batch size of 16. The batch size drops \(2 \times \) in the rest epochs. When validation loss no longer decreases for 7 continuous epochs, the training process will be stopped by early stopping.
We use random color jitter, horizontal flip, and perspective transformation as image augmentation. As shown in Fig. 8, perspective transformation is used as the simulation of roadside view change according to the camera imaging principle. After data augmentation, the training and validation set can be expanded to 58,372 images, which quadruples the original volume.

Evaluation metrics

Evaluation metrics include average precision (AP), frame per second (FPS), precision and error of 3D vehicle localization and dimension prediction.

Average precision and speed

Referring to evaluation metrics of existing 3D vehicle detection datasets, we use \(A{P_{3D}}\) [34] to evaluate 3D average precision and FPS to evaluate speed.
\(A{P_{3D}}\) is similar to \(A{P_{2D}}\) in 2D detection provided by VOC dataset [49]. Differently, 3D IoU is used instead of 2D in calculating \(A{P_{3D}}\). The equation of AP can be expressed as:
$$\begin{aligned} \begin{array}{l} AP = \frac{1}{{11}}\sum \limits _{r \in \{ 0,0.1, \ldots ,1\} } {{p_{{\textrm{interp}}}}(r)} \\ {p_{{\textrm{interp}}}}(r) = \mathop {\max }\limits _{{\tilde{r}}:{\tilde{r}} \ge r} p({\tilde{r}}) \end{array} \end{aligned}$$
(15)
The precision at each recall level r is interpolated by taking the maximum precision measured for a method for which the corresponding recall exceeds r, \(p({\tilde{r}})\) is the measured precision at recall \({\tilde{r}}\).
The equation of FPS is defined as:
$$\begin{aligned} \textrm{FPS} = {1 / {{t_\textrm{proc}}}} \end{aligned}$$
(16)
where \(t_\textrm{proc}\) represents the processing time (measured in second) of the network for a single frame.

3D vehicle localization precision and error

Combined with camera calibration in Sect. “Camera calibration”, 3D vehicle centroids can be further obtained by Eq. 6 and used for 3D vehicle localization.
The predicted and ground-truth 3D vehicle centroid are denoted as \(P_\textrm{cen}^\textrm{pred} = (x_\textrm{cen}^\textrm{pred},y_\textrm{cen}^\textrm{pred},z_\textrm{cen}^\textrm{pred})\) and \(P_\textrm{cen}^{gt} = (x_\textrm{cen}^{gt},y_\textrm{cen}^{gt},z_\textrm{cen}^{gt})\).
3D vehicle localization precision and error can be defined as follows:
$$\begin{aligned} {P_\textrm{loc}}= & {} \left( 1 - \sum \limits _{k \in \{ x,y\} } {\frac{{|{k_\textrm{cen}^\textrm{pred} - k_\textrm{cen}^{gt}} |}}{{{{{D_{rk}}} / 2}}}} \right) \times 100\% \end{aligned}$$
(17)
$$\begin{aligned} {E_\textrm{loc}}= & {} \sum \limits _{k \in \{ x,y\} } {|{k_\textrm{cen}^\textrm{pred} - k_\textrm{cen}^{gt}} |} \end{aligned}$$
(18)
where \({D_r}\) can be found in Table 2.

3D vehicle dimension precision and error

We design a 3D vehicle dimension regression branch in the network, which can be used for 3D vehicle dimension prediction, including length, width, and height in meters.
The predicted and ground-truth 3D dimension are denoted as \(D_v^\textrm{pred} = (l_v^\textrm{pred},w_v^\textrm{pred},h_v^\textrm{pred})\) and \(D_v^{gt} = (l_v^{gt},w_v^{gt},h_v^{gt})\).
3D vehicle dimension precision and error can be defined as follows:
$$\begin{aligned} {P_\textrm{dim}}= & {} \left( 1 - \sum \limits _{k \in \{ {l_v},{w_v},{h_w}\} } {\frac{{|{{k^\textrm{pred}} - {k^{gt}}} |}}{{{k^{gt}}}}} \right) \times 100\% \end{aligned}$$
(19)
$$\begin{aligned} {E_\textrm{dim}}= & {} \sum \limits _{k \in \{ {l_v},{w_v},{h_w}\} } {|{{k^\textrm{pred}} - {k^{gt}}} |} \end{aligned}$$
(20)

Results and discussions

In this section, we provide experimental results, an ablation study and discussions to demonstrate the effectiveness of CenterLoc3D.

Average precision and speed of centerLoc3D

\(A{P_{3D}}\) and FPS of different monocular 3D vehicle detection methods on validation and test set are compared in Table 4, which are calculated by Eqs. 15 and 16. In addition, FLOPS and parameters are also important metrics, with 28.61GFlops and 34.95 M for our network. KITTI validation and test set are used in onboard scenes at easy, moderate, and hard settings, which are determined by vehicle size in image space. We use the proposed SVLD-3D dataset for experimental validation. The IoU thresholds are 0.5 and 0.7 respectively. SVLD-3D dataset has only one validation set without difficulty settings.
Table 4
Comparison of \(A{P_{3D}}\) and FPS of different monocular 3D vehicle detection methods
Method
Scene
Backbone
GPU
\(A{P_{3D}}(IOU > 0.5)\) \([va{l_1}/va{l_2}]\)
\(A{P_{3D}}(IOU > 0.7)\) \([va{l_1}/va{l_2}/test]\)
FPS
Easy
Moderate
Hard
Easy
Moderate
Hard
MonoGRNet [13]
Onboard
VGG-16
GTX Titan X
50.51/54.21
36.97/39.69
30.82/33.06
13.88/24.97/–
10.19/19.44/–
7.62/16.30/–
16.7
Deep3DBox [19]
Onboard
VGG-16
27.04/–
20.55/–
15.88/–
5.85/–/–
4.10/–/–
3.84/–/–
GS3D [22]
Onboard
VGG-16
32.15/30.60
29.89/26.40
26.19/22.89
13.46/11.63/7.69
10.97/10.51/6.29
10.38/10.51/6.16
0.4
RTM3D [25]
Onboard
ResNet-18
\(\hbox {GTX}\,1080\hbox {Ti}{\times }2\)
47.43/46.52
33.86/32.61
31.04/30.95
18.13/18.38/–
14.14/14.66/–
13.33/12.35/–
28.6
DLA-34
54.36/52.59
41.90/40.96
35.84/34.95
20.77/19.47/13.61
16.86/16.29/10.09
16.63/15.57/8.18
18.2
SMOKE [15]
Onboard
DLA-34
GTX TITAN \(\hbox {X}{\times }4\)
14.76/19.99/14.03
12.85/15.61/9.76
11.50/15.28/7.84
33.3
KM3D [28]
Onboard
ResNet-18
GTX 1080Ti
47.23/47.13
34.12/33.31
31.51/25.84
19.48/18.34/12.65
15.32/14.91/8.39
13.88/12.58/7.12
47.6
DLA-34
56.02/54.09
43.13/43.07
36.77/37.56
22.50/22.71/16.73
19.60/17.71/11.45
17.12/16.15/9.92
25.0
Lite-FPN [27]
Onboard
ResNet-18
GTX 2080Ti
17.04/–/–
14.02/–/–
12.23/–/–
88.57
ResNet-34
18.01/–/15.32
15.29/–/10.64
14.28/–/8.59
71.32
DLA-34
19.31/–/–
16.19/–/–
15.47/–/–
42.37
Ours
Roadside
ResNet-50
GTX 1080Ti
91.34/–
79.36/–/51.30
41.18
Figure 9 illustrates visualization results on SVLD-3D test set. Different views and types of vehicles in SVLD-3D test set are tested, with occlusion of environment and other vehicles. Vehicles are widely distributed in the scene. From Table 4 and Fig. 9, detected vehicles all response with a red circle in the heatmaps. It can be seen that our network achieves real-time 3D vehicle detection and is adaptive to occlusion and small vehicles.
3D vehicle information which can be obtained by different methods is compared in Table 5. It can be seen that our method obtains not only 3D bounding boxes but also 3D centroids, dimensions, and 3D locations, with real-time performance.
Table 5
Comparison of 3D vehicle information obtained by different methods
Method
BBox
Centroid
Dimension
Location
MonoGRNet [13]
\(\checkmark \)
 
\(\checkmark \)
\(\checkmark \)
Deep3DBox [19]
\(\checkmark \)
   
GS3D [22]
\(\checkmark \)
 
\(\checkmark \)
 
RTM3D [25]
\(\checkmark \)
\(\checkmark \)
\(\checkmark \)
 
SMOKE [15]
\(\checkmark \)
\(\checkmark \)
\(\checkmark \)
 
KM3D [28]
\(\checkmark \)
 
\(\checkmark \)
 
Lite-FPN [27]
\(\checkmark \)
 
\(\checkmark \)
 
Ours
\(\checkmark \)
\(\checkmark \)
\(\checkmark \)
\(\checkmark \)

3D vehicle localization precision and error of centerLoc3D

Table 6 and Fig. 11 show 3D vehicle localization results and the precision of different scenes and types in SVLD-3D test set, which is calculated by Eq. 17. It can be seen that our network can also achieve good results in 3D vehicle localization, with an average precision of 98%.
Top views of 3D vehicle localization of different frames in SVLD-3D test set are shown in Fig. 10, which also shows that our network has high precision in 3D vehicle localization. Each sub-figure in Fig. 10 represents different frames in different scenes. At the same time, vehicles far from the roadside camera can also be detected and localized.
3D vehicle localization error is calculated by Eq. 18. In Fig. 12, we can see that 3D vehicle localization error increases as the distance grows between vehicles and roadside cameras of different scenes in SVLD-3D test set. The Y-axis error is the largest among the X-, Y-, and Z-axis. The largest error in scene B is larger than that in other scenes. This is due to the fact that the camera pan angle in scene B is closer to \(0^{\circ }\) than in the other scenes, which leads to incomplete vehicle feature learning along the vehicle length direction.

3D vehicle dimension precision and error of centerLoc3D

Table 7 and Fig. 13 show 3D vehicle dimension prediction results and precision of different scenes and types in SVLD-3D test set, which is calculated by Eq. 19. It can be seen that our network can also achieve good results in 3D vehicle dimension prediction, with an average precision of 85%.
In Fig. 14, we can see that 3D vehicle dimension error increases continuously as the distance grows between vehicles and roadside cameras of different scenes in SVLD-3D test set. The vehicle length error is the largest among length, width, and height. Since the driving directions of vehicles are usually parallel to the road direction, partial feature occlusion exists along the vehicle length direction. Therefore, the vehicle dimension prediction error in the length direction is larger than width and height.
3D vehicle dimension prediction error is calculated by Eq. 20. The errors of different methods are shown in Table 8, from which it can be seen that our network has certain advantages in 3D vehicle dimension prediction.
Table 6
3D vehicle localization results and precision on SVLD-3D test set
Vehicle
Type
\({P_{centroid}}\)
\({{\widetilde{P}}_{centorid}}\)
Precision
1
Car
24.927, 88.430, 0.673
24.952, 88.519, 0.760
0.996
2
Car
8.611, 68.119, 0.717
8.585, 67.729, 0.770
0.991
3
Car
\(-\)0.015, 82.595, 0.755
\(-\)0.015, 82.595, 0.755
0.999
4
Car
8.188, 105.823, 0.780
8.217, 105.714, 0.825
0.996
5
Car
11.508, 72.225, 0.731
11.434, 71.608, 0.785
0.984
6
Car
18.829, 62.322, 0.727
18.791, 61.883, 0.790
0.990
7
Truck
21.315, 43.322, 0.958
21.219, 43.156, 0.875
0.990
8
Car
8.538, 38.425, 0.735
8.451, 38.100, 0.730
0.988
9
Car
18.144, 67.893, 0.674
18.165, 68.067, 0.700
0.995
10
Car
0.336, 43.784, 0.730
0.382, 43.959, 0.700
0.993
11
Car
3.812, 46.123, 0.697
3.794, 46.043, 0.710
0.997
12
Car
11.035, 65.584, 0.730
11.261, 65.935, 0.740
0.976
13
Car
0.142, 81.101, 0.721
0.100, 80.344, 0.770
0.984
14
Car
\(-\)14.298, 59.759, 0.662
\(-\)14.297, 59.758, 0.680
0.999
15
Car
\(-\)6.232, 39.097, 0.672
\(-\)6.193, 39.138, 0.750
0.993
16
Car
\(-\)9.671, 38.371, 0.705
\(-\)9.754, 38.703, 0.665
0.978
17
Car
\(-\)6.249, 56.957, 0.681
\(-\)6.275, 56.747, 0.690
0.989
18
Car
\(-\)10.033, 64.324, 0.740
\(-\)10.032, 64.321, 0.740
0.999
19
Car
\(-\)1.770, 53.300, 0.756
\(-\)1.820, 52.683, 0.860
0.975
20
Bus
\(-\)5.174, 71.789, 1.452
\(-\)5.341, 73.016, 1.410
0.936
21
Car
\(-\)7.645, 57.673, 0.800
\(-\)7.593, 57.247, 0.800
0.975
22
Car
0.356, 22.313, 0.739
0.340, 22.388, 0.670
0.994
23
Car
0.862, 37.990, 0.735
0.804, 37.053, 0.765
0.957
24
Car
1.376, 40.059, 0.825
1.412, 40.053, 0.860
0.993
Table 7
3D vehicle dimension prediction results and precision on SVLD-3D test set
Vehicle
Type
\({D_v}\)
\({{\widetilde{D}}_v}\)
Precision
1
Car
3.60, 1.71, 1.37
3.79, 1.70, 1.27
0.860
2
Car
3.26, 1.67, 1.31
3.18, 1.61, 1.25
0.890
3
Car
4.05, 1.76, 1.40
3.92, 1.80, 1.40
0.942
4
Car
4.51, 1.81, 1.47
4.42, 1.88, 1.46
0.935
5
Car
4.43, 1.78, 1.37
4.40, 1.78, 1.48
0.915
6
Car
4.74, 1.80, 1.46
4.33, 1.77, 1.40
0.843
7
Car
4.58, 1.82, 1.45
4.96, 1.86, 1.54
0.845
8
Car
4.50, 1.79, 1.40
4.50, 1.70, 1.36
0.912
9
Car
3.74, 1.64, 1.27
4.07, 1.68, 1.30
0.872
10
Car
4.55, 1.80, 1.42
4.58, 1.68, 1.40
0.910
11
Car
3.57, 1.80, 1.35
4.11, 1.80, 1.38
0.844
12
Car
3.71, 1.76, 1.36
3.90, 1.80, 1.33
0.912
13
Car
3.34, 1.77, 1.32
3.70, 1.76, 1.25
0.838
14
Bus
12.83, 2.71, 2.75
12.00, 2.76, 2.82
0.886
15
Car
4.74, 1.87, 1.48
4.77, 1.83, 1.53
0.939
16
Car
5.00, 1.89, 1.48
4.75, 1.86, 1.56
0.880
17
Car
4.69, 1.84, 1.44
4.60, 1.81, 1.37
0.914
18
Car
4.68, 1.85, 1.43
4.56, 1.81, 1.34
0.885
19
Car
4.64, 1.84, 1.45
4.68, 1.82, 1.50
0.947
20
Bus
12.74, 2.68, 2.62
12.00, 2.76, 2.82
0.838

Ablation study of centerLoc3D

To further validate the effect of the weighted-fusion module and loss with spatial constraint embedding in our network, ablation experiments are conducted on SVLD-3D validation and test set. CenterNet [33] with ResNet-50 [36] is used as our baseline. We added the proposed modules one by one for validation. Ablation results are shown in Table 9. For the improvement column, only \(A{P_{3D}}\) of IoU threshold 0.7 on the test set is listed for comparison. In Table 9, we can see that the results of adding three modules one by one outperform the former model by 6.64%, 2.48%, and 5.98% in \(A{P_{3D}}\) respectively. Therefore, conclusions can be summarized as follows: (1) Weighted-fusion module not only enables the network adaptive to different vehicle sizes but also increases network generalization capability. (2) Spatial constraints of camera calibration and vehicle IoU in loss help accurate 3D bounding box learning.
Figure 15 shows 3D vehicle localization error of ablation study on SVLD-3D test set. Only three views (left, middle and right) of BrnoCompSpeed scenes in SVLD-3D test set are included in this study. In this figure, it can be seen that 3D vehicle localization error from \({M_\textrm{base}}\) to \({M_3}\) in the same scene decreases gradually, which indicates that the designed modules can effectively reduce 3D vehicle localization error.
Table 8
Comparison of 3D vehicle dimension prediction errors of different methods
Method
Length/m
Width/m
Height/m
3DOP [10]
0.504
0.094
0.107
Mono3D [34]
0.582
0.103
0.172
MonoGRNet [13]
0.412
0.084
0.084
MonoGRK [24]
0.403
0.091
0.101
Ours
0.137
0.031
0.030
Table 9
Ablation study with different modules in CenterLoc3D on SVLD-3D validation and test set
Model
Modules
\(A{P_{3D}}(IOU > 0.7)\)
FPS
Improvement of \(A{P_{3D}}\)
Weighted-Fusion
Reprojection
IoU
\({M_\textrm{base}}\)
   
52.52 / 36.20
46.73
\({M_1}\)
\(\checkmark \)
  
57.07 / 42.84
43.23
6.64
\({M_2}\)
\(\checkmark \)
\(\checkmark \)
 
68.38 / 45.32
41.31
2.48
\({M_3}\)
\(\checkmark \)
\(\checkmark \)
\(\checkmark \)
79.36 / 51.30
41.18
5.98

Conclusion

Through experimental validation, CenterLoc3D achieves good performance on 3D vehicle detection, localization, and dimension prediction for roadside surveillance cameras, with \(A{P_{3D}}\) of 51.30%, average 3D localization precision of 98%, average 3D dimension precision of 85% and real-time performance with FPS of 41.18. Our contributions are as follows: (1) A 3D vehicle localization network CenterLoc3D for roadside monocular cameras is proposed, which can directly obtain 3D bounding boxes and 3D dimensions without leveraging 2D detectors. (2) A weighted-fusion strategy is proposed, which can effectively enhance feature extraction and improve generalization. (3) Loss function with constraints of camera calibration and vehicle IoU are embedded in CenterLoc3D, which reduces 3D vehicle localization error. In addition, we also propose a benchmark including a dataset, an annotation tool, and evaluation metrics, which provides a data basis for experimental validation.
However, CenterLoc3D is still needed to be improved for practical and advanced applications. When the camera pan angle is close to \(0^{\circ }\), the features along the vehicle length direction are incomplete, leading to an increase in 3D vehicle localization error. In future work, the dataset needs to be further expanded to contain scenes with more views and more types of vehicles. In the meanwhile, more effective feature extraction modules and loss functions are needed to be designed to further improve 3D vehicle localization precision in roadside monocular traffic scenes.
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Literatur
8.
13.
Zurück zum Zitat Qin Z, Wang J, Lu Y (2019) Monogrnet: a geometric reasoning network for 3d object localization. In: The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Qin Z, Wang J, Lu Y (2019) Monogrnet: a geometric reasoning network for 3d object localization. In: The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19)
15.
Zurück zum Zitat Liu Z, Wu Z, Tóth R (2020) SMOKE: single-stage monocular 3d object detection via keypoint estimation. arXiv preprint arXiv:2002.10111 Liu Z, Wu Z, Tóth R (2020) SMOKE: single-stage monocular 3d object detection via keypoint estimation. arXiv preprint arXiv:​2002.​10111
16.
Zurück zum Zitat Ding M, Huo Y, Yi H, Wang Z, Shi J, Lu Z, Luo P (2020) Learning depth-guided convolutions for monocular 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11672–11681 Ding M, Huo Y, Yi H, Wang Z, Shi J, Lu Z, Luo P (2020) Learning depth-guided convolutions for monocular 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11672–11681
17.
Zurück zum Zitat Zhang Y, Lu J, Zhou J (2021) Objects are different: flexible monocular 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3289–3298 Zhang Y, Lu J, Zhou J (2021) Objects are different: flexible monocular 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3289–3298
18.
Zurück zum Zitat Wang T, Zhu X, Pang J, Lin D (2021) Fcos3d: fully convolutional one-stage monocular 3d object detection. arXiv preprint arXiv:2104.10956 Wang T, Zhu X, Pang J, Lin D (2021) Fcos3d: fully convolutional one-stage monocular 3d object detection. arXiv preprint arXiv:​2104.​10956
20.
Zurück zum Zitat Chabot F, Chaouch M, Rabarisoa J, Teulière C, Chateau T (2017) Deep manta: a coarse-to-fine many-task network for joint 2d and 3d vehicle analysis from monocular image. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1827–1836. https://doi.org/10.1109/CVPR.2017.198 Chabot F, Chaouch M, Rabarisoa J, Teulière C, Chateau T (2017) Deep manta: a coarse-to-fine many-task network for joint 2d and 3d vehicle analysis from monocular image. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1827–1836. https://​doi.​org/​10.​1109/​CVPR.​2017.​198
24.
Zurück zum Zitat Barabanau I, Artemov A, Burnaev E, Murashkin V (2020) Monocular 3d object detection via geometric reasoning on keypoints. In: 15th International Conference on Computer Vision Theory and Applications Barabanau I, Artemov A, Burnaev E, Murashkin V (2020) Monocular 3d object detection via geometric reasoning on keypoints. In: 15th International Conference on Computer Vision Theory and Applications
27.
45.
Zurück zum Zitat Sun P, Kretzschmar H, Dotiwalla X, Chouard A, Patnaik V, Tsui P, Guo J, Zhou Y, Chai Y, Caine B, Vasudevan V, Han W, Ngiam J, Zhao H, Timofeev A, Ettinger S, Krivokon M, Gao A, Joshi A, Zhang Y, Shlens J, Chen Z, Anguelov D (2020) Scalability in perception for autonomous driving: Waymo open dataset. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2443–2451. https://doi.org/10.1109/CVPR42600.2020.00252 Sun P, Kretzschmar H, Dotiwalla X, Chouard A, Patnaik V, Tsui P, Guo J, Zhou Y, Chai Y, Caine B, Vasudevan V, Han W, Ngiam J, Zhao H, Timofeev A, Ettinger S, Krivokon M, Gao A, Joshi A, Zhang Y, Shlens J, Chen Z, Anguelov D (2020) Scalability in perception for autonomous driving: Waymo open dataset. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2443–2451. https://​doi.​org/​10.​1109/​CVPR42600.​2020.​00252
46.
Metadaten
Titel
CenterLoc3D: monocular 3D vehicle localization network for roadside surveillance cameras
verfasst von
Xinyao Tang
Wei Wang
Huansheng Song
Chunhui Zhao
Publikationsdatum
03.01.2023
Verlag
Springer International Publishing
Erschienen in
Complex & Intelligent Systems / Ausgabe 4/2023
Print ISSN: 2199-4536
Elektronische ISSN: 2198-6053
DOI
https://doi.org/10.1007/s40747-022-00962-9

Weitere Artikel der Ausgabe 4/2023

Complex & Intelligent Systems 4/2023 Zur Ausgabe

Premium Partner