1 Introduction
YOLOv3
model.
-
We review the concepts of meta classification and regression, meta fusion, and confidence calibration. We explain how they serve as a general benchmark for evaluating the predictive power of any uncertainty metric developed for object detection.
-
We compare baseline uncertainty measures such as the DNN’s score, well-established ones like Monte-Carlo dropout [Gal17], the output-based uncertainty metrics [SKR20], and finally the gradient-based uncertainty metrics [RRSG21] from inside the DNN. We do so in terms of comparing their standalone performances but also in terms of analyzing their mutual information and how much performance they add to the prediction of the network itself.
2 Related Work
3 Methods
3.1 Uncertainty Quantification Protocols
3.2 Deep Object Detection Frameworks
-
Four localization variables, e.g., \(\hat{\boldsymbol{\xi }} = (\hat{c}_{\min }^j, \hat{r}_{\min }^j, \hat{c}_{\max }^j, \hat{r}_{\max }^j) \in \mathbb {R}^4\) (top-left and bottom-right corner coordinates),
-
Confidence score \(\hat{s} \in \mathcal {S} = (0, 1)\) indicating the probability of an object existing at \(\hat{\xi }^j\), and
-
Class probability distribution \(\hat{\mathbf {p}}^j = (\hat{p}^j_1, \ldots , \hat{p}^j_{N_\mathrm {C}}) \in \mathbb {I}^{N_\mathrm {C}}\).
3.3 Output-Based Uncertainty: MetaDetect
-
the number of candidate boxes \(N^{(j)} \ge 1\) that belong to \(\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })\) (i.e., \(\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })\) belongs to itself; one metric),
-
the predicted box \(\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })\) itself, i.e., the values of the tupleas well as \(\sum _{i \in \mathcal {N}_\mathrm {C}} \hat{p}_i^j \in \mathbb {R}\) whenever class probabilities are not normalized (\(6+N_\mathrm {C}\) metrics),$$\begin{aligned} \left( \hat{c}_{\min }^j, \hat{r}_{\min }^j, \hat{c}_{\max }^j, \hat{r}_{\max }^j, \hat{s}^j, \hat{p}^j_1, \ldots , \hat{p}^j_{N_\mathrm {C}}\right) \in \mathbb {R}^{4} \times \mathcal {S} \times \mathbb {I}^{N_\mathcal {C}}, \end{aligned}$$(7)
-
size \(d=(\hat{r}_{\max }^j-\hat{r}_{\min }^j)\cdot (\hat{c}_{\max }^j-\hat{c}_{\min }^j)\) and circumference \(g=2\cdot (\hat{r}_{\max }^j-\hat{r}_{\min }^j)+2\cdot (\hat{c}_{\max }^j-\hat{c}_{\min }^j)\) (two metrics),
-
\( IoU ^j_{ pb }\): the \( IoU \) of \(\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })\) and the box with the second highest score that was suppressed by \(\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })\). This value is zero if there are no boxes corresponding to \(\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })\) suppressed by the NMS (i.e., \(N^{(j)}=1\); one metric),
-
the minimum, maximum, arithmetic mean, and standard deviation for \((\hat{r}_{\min }^j,\hat{r}_{\max }^j,\hat{c}_{\min }^j,\hat{c}_{\max }^j,\hat{s}^j)\), size d and circumference g from \(\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })\) and all the filtered candidate boxes that were discarded from \(\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })\) in the NMS (\(4 \times 7\) metrics),
-
the minimum, maximum, arithmetic mean, and standard deviation for the \( IoU \) of \(\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })\) and all the candidate boxes corresponding to \(\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })\) that were suppressed in the NMS (four metrics),
-
relative sizes \(rd=d/g\), \(rd_{\max }=d/g_{\min }\), \(rd_{\min }=d/g_{\max }\), \(rd_{\mathrm {mean}}=d/g_{\mathrm {mean}}\), and \(rd_{\mathrm {std}}=d/g_{\mathrm {std}}\) (five metrics),
-
the maximal \( IoU \) of \(\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })\) and all ground truth boxes in \(\overline{\mathbf {y}}\); this is not an input to a meta model but serves as the ground truth provided to the respective loss function.
3.4 Gradient-Based Uncertainty for Object Detection
4 Experimental Setup
4.1 Databases, Models, and Metrics
Dataset | \(\mathcal {D}^\mathrm {VOC12}_\mathrm {train}\) | \(\mathcal {D}^\mathrm {VOC12}_\mathrm {test}\) | \(\mathcal {D}^\mathrm {COCO17}_\mathrm {train}\) | \(\mathcal {D}^\mathrm {COCO17}_\mathrm {val}\) | \(\mathcal {D}^\mathrm {KITTI}_\mathrm {train}\) | \(\mathcal {D}^\mathrm {KITTI}_\mathrm {eval}\) |
---|---|---|---|---|---|---|
Size | 14,805 | 4,952 | 118,287 | 5,002 | 5,481 | 2,000 |
YOLOv3
[RF18] re-implementation in PyTorch
. We have trained our model from scratch on each of the three training splits under dropout with a probability 0.5 between the last and the second-to-last convolution layer in each of the regression heads. As meta classification and meta regression models, we employ the gradient boosting models in [CG16] with standard settings.4.2 Implementation Details
YOLOv3
detection heads. Dropout is active during training with a rate of 0.5 and also during MC inference, where we take standard deviations over 30 dropout samples for each of the \(4 + 1 + N_\mathrm {C}\) instance features of all output boxes. The MetaDetect metrics introduced in Sect. 3.3 are computed from a score threshold of \(\varepsilon _s = 0.0\), as it has been found in [SKR20] that lower thresholds lead to better performance in meta classification and meta regression.YOLOv3
architecture at the same candidate score threshold as for the MetaDetect metrics. As there are three detection heads, corresponding to a \(76 \times 76\) (“S”), a \(38 \times 38\) (“M”), and a \(19 \times 19\) (“L”) cell grid, we also compute gradient uncertainty metrics individually for each detection head. We use this distinction in our notation and, for example, indicate the set of parameters from the last layer (\(T\)) of the detection head producing the \(76 \times 76\) cell grid (\(\mathrm {S}\)) by \(\boldsymbol{\theta }(T, \mathrm {S})\). Moreover, as indicated in Sect. 3.4, we also exploit the split of the loss function in (5). Each of the computed \(2 \times 3 \times 3 = 18\) gradients per box results in the 6 uncertainty metrics presented in (9) giving a total of 108 gradient uncertainty metrics per bounding box. Due to the resulting computational expense, we only compute gradient metrics for output boxes with score values \(\hat{s} \ge 10^{-4}\) and regard only those boxes for all of the following experiments.4.3 Experimental Setup and Results
VOC | COCO | KITTI | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
\( AuROC \) | \( AuPR \) | \( AuROC \) | \( AuPR \) | \( AuROC \) | \( AuPR \) | ||||||
0.917 | \(\mathrm {std}_\mathrm {MC}(\hat{s})\) | 0.658 | \(\hat{s}\) | 0.830 | \(\hat{s}\) | 0.623 | \(\hat{s}\) | 0.966 | \(\hat{s}\) | 0.969 | \(\hat{s}\) |
0.959 | \(m_\mathrm {mean}^{\boldsymbol{\theta }(T, \mathrm {S})}(J_s)\) | 0.740 | \(m_{\max }^{\boldsymbol{\theta }(T, \mathrm {S})}(J_p)\) | 0.875 | \(\mathrm {std}_\mathrm {MC}(\hat{s})\) | 0.664 | \(\mathrm {std}_\mathrm {MC}(\hat{s})\) | 0.978 | \(\hat{c}_{\min }\) | 0.976 | \(\hat{c}_{\min }\) |
0.966 | \(m_{\max }^{\boldsymbol{\theta }(T, \mathrm {S})}(J_p)\) | 0.769 | \(\mathrm {std}_\mathrm {MC}(\hat{s})\) | 0.886 | \(m_{\max }^{\boldsymbol{\theta }(T, \mathrm {L})}(J_p)\) | 0.685 | \(m_{\max }^{\boldsymbol{\theta }(T, \mathrm {S})}(J_p)\) | 0.981 | \(\mathrm {std}_\mathrm {MC}(\hat{s})\) | 0.980 | \(m_{1}^{\boldsymbol{\theta }(T-1, \mathrm {L})}(J_p)\) |
0.971 | \(m_{\mathrm {std}}^{\boldsymbol{\theta }(T, \mathrm {M})}(J_p)\) | 0.784 | \(m_{\mathrm {mean}}^{\boldsymbol{\theta }(T, \mathrm {M})}(J_p)\) | 0.893 | \(m_{\max }^{\boldsymbol{\theta }(T, \mathrm {S})}(J_p)\) | 0.704 | \(m_{\max }^{\boldsymbol{\theta }(T-1, \mathrm {M})}(J_p)\) | 0.983 | \(m_{\mathrm {std}}^{\boldsymbol{\theta }(T-1, \mathrm {L})}(J_p)\) | 0.981 | \(m_{\max }^{\boldsymbol{\theta }(T, \mathrm {M})}(J_p)\) |
0.972 | \(\hat{c}_{\min }\) | 0.790 | \(m_{\mathrm {mean}}^{\boldsymbol{\theta }(T-1, \mathrm {S})}(J_\xi )\) | 0.899 | \(m_{\max }^{\boldsymbol{\theta }(T-1, \mathrm {M})}(J_p)\) | 0.710 | \(m_{\max }^{\boldsymbol{\theta }(T, \mathrm {L})}(J_p)\) | 0.984 | \(m_{\max }^{\boldsymbol{\theta }(T, \mathrm {M})}(J_\xi )\) | 0.982 | \(\hat{r}_{\max }\) |
0.974 | \(\hat{s}\) | 0.795 | \(\hat{c}_{\min }\) | 0.900 | \(m_{\max }^{\boldsymbol{\theta }(T, \mathrm {M})}(J_p)\) | 0.713 | \(\mathrm {std}_\mathrm {MC}(\hat{c}_{\min })\) | 0.985 | \(\hat{r}_{\max }\) | 0.983 | \(\sum _i \hat{p}_i\) |
0.975 | \(m_{\mathrm {mean}}^{\boldsymbol{\theta }(T, \mathrm {S})}(J_\xi )\) | 0.799 | \(\mathrm {std}_\mathrm {MC}(\hat{c}_{\min })\) | 0.901 | \(m_{\max }^{\boldsymbol{\theta }(T-1, \mathrm {L})}(J_s)\) | 0.715 | \(m_{\max }^{\boldsymbol{\theta }(T, \mathrm {M})}(J_p)\) | 0.985 | \(m_{\max }^{\boldsymbol{\theta }(T, \mathrm {S})}(J_p)\) | 0.984 | \(\mathrm {std}_\mathrm {MC}(\hat{s})\) |
0.975 | \(\hat{r}_{\min }\) | 0.802 | \(m_{\mathrm {mean}}^{\boldsymbol{\theta }(T, \mathrm {S})}(J_\xi )\) | 0.902 | \(\mathrm {std}_\mathrm {MC}(\hat{c}_{\min })\) | 0.716 | \(\mathrm {std}_\mathrm {MC}(\hat{r}_{\max })\) | 0.986 | \(\hat{p}_1\) | 0.984 | \(m_{2}^{\boldsymbol{\theta }(T, \mathrm {S})}(J_\xi )\) |
0.976 | \(\hat{c}_{\max }\) | 0.803 | \(m_{\max }^{\boldsymbol{\theta }(T-1, \mathrm {M})}(J_p)\) | 0.902 | \(m_{\mathrm {mean}}^{\boldsymbol{\theta }(T, \mathrm {S})}(J_s)\) | 0.717 | \(m_{2}^{\boldsymbol{\theta }(T, \mathrm {S})}(J_p)\) | 0.986 | \(m_{\mathrm {mean}}^{\boldsymbol{\theta }(T-1, \mathrm {L})}(J_\xi )\) | 0.984 | \(m_{\mathrm {mean}}^{\boldsymbol{\theta }(T-1, \mathrm {L})}(J_\xi )\) |
0.976 | MD+G+MC | 0.803 | MD+G+MC | 0.904 | MD+G+MC | 0.720 | MD+G+MC | 0.986 | MD+G+MC | 0.984 | MD+G+MC |