Top

Published in:

Open Access 2022 | OriginalPaper | Chapter

Uncertainty Quantification for Object Detection: Output- and Gradient-Based Approaches

Authors : Tobias Riedlinger, Marius Schubert, Karsten Kahl, Matthias Rottmann

Published in: Deep Neural Networks and Data for Automated Driving

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Patentsearch

Off

Abstract

Safety-critical applications of deep neural networks require reliable confidence estimation methods with high predictive power. However, evaluating and comparing different methods for uncertainty quantification is oftentimes highly context-dependent. In this chapter, we introduce flexible evaluation protocols which are applicable to a wide range of tasks with an emphasis on object detection. In this light, we investigate uncertainty metrics based on the network output, as well as metrics based on a learning gradient, both of which significantly outperform the confidence score of the network. While output-based uncertainty is produced by post-processing steps and is computationally efficient, gradient-based uncertainty, in principle, allows for localization of uncertainty within the network architecture. We show that both sources of uncertainty are mutually non-redundant and can be combined beneficially. Furthermore, we show direct applications of uncertainty quantification by improving detection accuracy.

1 Introduction

Deep artificial neural networks (DNNs) employed for tasks such as object detection or semantic segmentation yield by construction probabilistic predictions on feature data, such as camera images. Modern deep object detectors [RHGS15, LAE+16, LGG+17, RF18] predict bounding boxes for depicted objects of a given set of learned classes. The so-called “score” (sometimes also called objectness or confidence score) indicates the probability that a bounding box provided by the DNN contains an object. Throughout this chapter, we sometimes use the term “confidence” instead of “score” to refer to quantities that represent the probability of a detection being correct. The term confidence in a broader sense is meant to reflect an estimated degree of competency of the model when evaluating an input.

Reliable and accurate probabilistic predictions are desirable for many applications, such as medical diagnosis or automated driving. It is well-known that predictions of DNNs often tend to be statistically miscalibrated [SZS+14, GSS15, GPSW17], i.e., the score computed by the DNN is not representative for the relative frequency of correct predictions. For an illustration of this, see Fig. 1, where we compare different sources of uncertainty in terms of their accuracy conditioned to confidence bins for a YOLOv3 model.

For individual DNN predictions, confidences that are not well-calibrated are misleading and constitute a reliability issue. Over-confident predictions might cause inoperability of a perception system due to producing non-existent instances (false positives / FP). Perhaps even more adverse, under-confidence might lead to false negative (FN) predictions, for instance overlooking pedestrians or other road users which could lead to dangerous situations or even fatalities. Given that the set of available data features is fixed, the sources of uncertainty in machine learning (and deep learning) can be divided into two types [HW21], referring to their primary source [Gal17, KG17]. Whereas aleatoric uncertainty refers to the inherent and irreducible stochasticity of the distribution the data stems from, epistemic uncertainty refers to the reducible part of the uncertainty. The latter originates from the finite size of a random data sample used for training, as well as from the choice of model and learning algorithm.

Up to now, there exist a moderate number of established methods in the field of deep learning that have demonstrated to properly estimate the uncertainty of DNNs. In [KG17], DNNs for semantic segmentation are equipped with additional regression output that model heteroscedastic (in machine learning also called predictive) uncertainty, with the aim to capture aleatoric uncertainty. For training, the uncertainty output is integrated into the usual empirical cross-entropy loss. A number of adaptations for different modern DNN architectures originate from that work. From a theoretical point of view, Bayesian DNNs [DL90, Mac92], in which the model weights are treated as random variables, yield an attractive framework for capturing the epistemic uncertainty. In practice, the probability density functions corresponding to the model weights are estimated via Markov chain Monte-Carlo, which currently is infeasible already for a moderate number of model weights, hence being inapplicable in the setting of object detection. As a remedy, variational inference approaches have been considered, where the model weights are sampled from predefined distributions, which are often modeled in simplified ways, e.g., assuming mutual independence. A standard method for the estimation of epistemic uncertainty based on variational inference is Monte-Carlo (MC) dropout [SHK+14, GG16]. Algorithmically, this method performs several forward passes under dropout at inference time. Deep ensemble sampling [LPB17] follows a similar idea. Here, separately trained models with the same architecture are deployed to produce a probabilistic inference.

In semantic segmentation, Rottmann et al. [RCH+20] introduced the tasks of meta classification and regression that are both applied as post-processing to the output of DNNs. Meta classification refers to the task of classifying a prediction as true positive (TP) or FP. Meta regression is a more object detection-specific task with the aim to predict the intersection-over-union ($ IoU $) of a prediction with its ground truth directly. These tasks are typically learned with the help of ground truth, but afterwards performed in the absence of ground truth, only on the basis of uncertainty measures stemming from the output of the DNN. The framework developed in [RCH+20], called MetaSeg, was extended into multiple directions: Time-dynamic uncertainty quantification for semantic segmentation [MRG20] and instance segmentation [MRV+21]; a study on the influence of image resolution [RS19]; controlled reduction of FN and meta fusion, the latter utilizes the uncertainty metrics to increase the performance of the DNN [CRH+19]; out-of-distribution detection [BCR+20, ORF20, CRG21]; and active learning [CRGR21]. Last but not least, a similar approach solely based on the output of the DNN has been developed for object detection in [SKR20] (called MetaDetect), however, equipped with different uncertainty metrics specifically designed for the task of object detection. This approach was also used in an uncertainty-aware sensor fusion approach for object detection based on camera images and RaDAR point clouds; see [KRBG21].

In [ORG18], it was proposed to utilize a learning gradient, replacing the usually needed ground truth labels by the prediction of the network itself, which is supposed to contain epistemic uncertainty information. Since the investigation of gradient metrics for image classification tasks, that method was considered in different applications as well. Gradient metrics were compared to other uncertainty quantification methods in [SFKM20]. It was found that gradient uncertainty is strongly correlated with the softmax entropy of classification networks. In natural language understanding, gradient metrics turned out to be beneficial; see [VSG19]. Therein, gradient metrics and deep ensemble uncertainty were aggregated and yield well-calibrated confidences on out-of-distribution data. Recently, gradient metrics have been developed for object detection in [RRSG21]. In order to demonstrate the predictive power of uncertainty metrics, the framework of meta classification and regression provided by MetaDetect was used.

In this chapter, we review the works published in [SKR20, RRSG21] and put them into a common context. More precisely, our contributions are as follows:

We review the concepts of meta classification and regression, meta fusion, and confidence calibration. We explain how they serve as a general benchmark for evaluating the predictive power of any uncertainty metric developed for object detection.
We review output-based uncertainty metrics developed in [SKR20], as well as gradient-based ones from inside the network [RRSG21].
We compare baseline uncertainty measures such as the DNN’s score, well-established ones like Monte-Carlo dropout [Gal17], the output-based uncertainty metrics [SKR20], and finally the gradient-based uncertainty metrics [RRSG21] from inside the DNN. We do so in terms of comparing their standalone performances but also in terms of analyzing their mutual information and how much performance they add to the prediction of the network itself.

In this section, we gather and discuss previous research related to the present work.

Epistemic and aleatoric uncertainty in object detection: In recent years, there has been increasing interest in methods to properly estimate or quantify uncertainty in DNNs. Aleatoric uncertainty, the type of uncertainty resulting from the data generating process, has been proposed to be captured by providing additional regression output to the network architecture for learning uncertainty directly from the training dataset [KG17]. Since the initial proposition, this approach has been adapted to, among others, the object detection setting [LDBK18, CCKL19, HZW+19, LHK+21, HSW20, CCLK20]. The latter approaches add further regression variables for aleatoric uncertainty which are assigned to their bounding box localization variables, thereby learning localization as Gaussian distributions. An alternative approach to quantifying localization uncertainty is to learn for each predicted box the corresponding $ IoU $ with additional regression variables [JLM+18].

Several methods have been proposed that aim to capture epistemic uncertainty, i.e., “reducible” kinds of uncertainty. Monte-Carlo (MC) dropout [SHK+14, GG16] employs forward passes under dropout and has become one of the standard tools for estimating epistemic uncertainty. As such, MC dropout has been investigated also for deep object detection [MNDS18, KD19, HSW20]. Similarly to MC dropout, deep ensemble methods [LPB17] employ separately trained networks (“experts”) to obtain variational inference, an approach which also has found adaptations in deep object detection [LGRB20].

Uncertainty calibration: A large variety of methods have been proposed to rectify the intrinsic confidence miscalibration of modern deep neural networks by introducing post-processing methods (see, e.g., [GPSW17] for a comparison). Prominent examples of such methods include histogram binning [ZE02], isotonic regression [ZE01], Bayesian binning [NCH15], Platt scaling [Pla99, NMC05], or temperature scaling [HVD14]. Alternatively, calibrated confidences have been obtained by implementing calibration as an optimization objective [KP20] via a suitable loss term. In the realm of object detection, the authors of [NZV18] found that temperature scaling improves calibration. Also, natural extensions to make calibration localization-dependent have been proposed [KKSH20].

Meta classification and meta regression: Meta classification denotes the task of discriminating FP and FN predictions based on uncertainty or confidence information, an idea that has been explored initially in [HG17]. In case a real-valued metric (e.g., different versions of the $ IoU $ which is in [0, 1]) can be assigned to the quality of a prediction, meta regression denotes the task of predicting this quantity in the same spirit as meta classification. In contrast to the IoUNet framework introduced in [JLM+18], meta classification and meta regression do not require architectural changes or dedicated training of the respective DNN. In [RRSG21], it has been found that meta classifiers tend to yield well-calibrated confidences by default as opposed to the confidence score of DNNs. Applications of meta classification and meta regression have since been introduced for several different disciplines [CRH+19, RS19, MRG20, RMC+20, RCH+20, MRV+21, KRBG21], including semantic image segmentation, video instance segmentation, and object detection. In all of these applications, the central idea is to use a simple classification model to learn the map between uncertainty and the binary labels TP/FP. A related idea trying to exploit meta classification was introduced as “meta fusion” in [CRH+19]. The ability to discriminate TPs against FPs allows for an increase in the neural network’s sensitivity, thereby producing an increase in FPs which are subsequently detected by a meta classifier decreasing the total number of errors made.

3 Methods

3.1 Uncertainty Quantification Protocols

Various methods for uncertainty or confidence estimation exist. While confidence scores are directly produced by a DNN, oftentimes other uncertainty-related quantities, such as softmax margin or entropy, can be generated from the network output. In variational inference (e.g., MC dropout, deep ensembles, or batch norm uncertainty), for a single network input, several predictions are made. Their variance or standard deviation is then interpreted as a measure of prediction uncertainty. Oftentimes, a common strategy for different kinds of uncertainty is missing and comparison between methods is highly context-dependent. Moreover, different methods to estimate uncertainty may produce mutually redundant information which, for lack of comparability, cannot be established experimentally. We propose uncertainty aggregation methods as unifying strategy for different tasks.

Meta classification: In classification tasks, an input is categorized as one class out of a finite number of classes. Evaluation usually takes a binary value of either true or false. Meta classification can then be understood as the task of predicting the label true or false from uncertainty metrics. To this end, a lightweight classification model (e.g., logistic regression, a shallow neural network, or a gradient boosting classifier) is fitted to the training dataset, where uncertainty metrics can be computed and the actual, true label for each sample can be established. This can be regarded as learning a binary decision rule based on uncertainty information. On an evaluation dataset, such a meta classification model can be evaluated in terms of classification metrics like the area under the receiver operating characteristic $ AuROC $ or the area under precision-recall curve $ AuPR $ [DG06]. Different sources of uncertainty information, thus, serve as co-variables for meta classification models. Their performance can be regarded as a unified evaluation of confidence information. Network-intrinsic confidence scores naturally fit into this framework and can be either directly evaluated as any meta classification model or also serve as a single co-variable for such a model. In particular, different sources can be combined by fitting a meta classification model on the combined set of uncertainty metrics. This allows for the investigation of mutual redundancy between uncertainty information. Meta classification can be applied to any setting where a prediction can be marked as true or false and has served as a baseline for image classification [HG17], but also the tasks of semantic segmentation (“MetaSeg”) [RCH+20] and object detection (“MetaDetect”) [SKR20] can be subject to meta classification. In semantic segmentation, segments can be classified as true or false based on their intersection-over-union ($ IoU $) with the ground truth being above a certain threshold. Similarly, bounding boxes are ordinarily declared as true or false predictions based on their maximum $ IoU $ with any ground truth box (see Sect. 3.2).

Meta fusion: Obtaining well-performing meta classifiers allows for their incorporation in the form of decision rules (replacing the standard score threshold) into the original prediction pipeline. Doing so is especially useful in settings where the DNN prediction is based on positive (foreground/detection) and negative (background) predictions, such as semantic segmentation and object detection. Implementing meta classifiers into such a framework is called meta fusion, whereby the improvement of the detection of false predictions over the DNN confidence is supplemented by increased prediction sensitivity of the DNN (see Fig. 2). Greater sensitivity tends to reduce false negative (FN) predictions and increase true (TP) and false positive (FP) predictions. The meta classifier’s task is then to identify the additional FPs as such and filter them from the prediction which leads to a net increase in prediction performance of the combination DNN with the meta classifier over the standalone DNN baseline. Such an increase can be evaluated utilizing metrics usually employed to assess the performance of a DNN in the respective setting, e.g., mean $ IoU $ in semantic segmentation [CRH+19] or mean average precision ($ mAP $) in object detection [RRSG21].

Calibration: Confidence estimates with high predictive power (as measured in meta classification) have useful applications, e.g., in meta fusion where detection accuracy can be improved by investing uncertainty information. In addition to this advantage, there is another aspect to the probabilistic estimation of whether the corresponding prediction is correct or not. Confidence calibration tries to gauge the accuracy of the frequentistic interpretation, for example, from 100 examples each with a confidence of about $\tfrac{1}{4}$, statistically we expect about 25 of those examples to be true predictions.

This is often investigated by dividing the confidence range into bins of equal width and computing the accuracy of predictions conditioned to the respective bin (see Fig. 3). For calibrated confidences, the resulting distribution should then be close to the ideally calibrated diagonal. Meta classification is closely related to post-processing calibration methods in that a map is learned on a validation dataset which yields confidences with improved calibration. In particular, meta classification with the confidence score as the only co-variable is (up to the explicit optimization objective) isotonic regression as introduced in [ZE02].

Meta regression: In a regression task, a continuous variable (e.g., in $\mathbb {R}$) is estimated given an input. By the same logic as in meta classification, whenever the quality of the prediction of a DNN can be expressed as a real number, we may fit a simple regression model (e.g., linear regression, shallow neural network, or a gradient boosting regression model) on a training dataset where both uncertainty metrics and the actual prediction quality can be computed. This is called a meta regression model which maps uncertainty information to an (e.g., $\mathbb {R}$-valued) estimation of the prediction quality of the DNN (see Fig. 4). The relationship between actual and predicted quality can, again, be evaluated on a dedicated dataset in terms of the coefficient of determination $R^2$, a well-established quality metric for regression models. Meta regression shares many features as an evaluation protocol as meta classification, perhaps with the exception that network confidence scores fit less directly into the meta regression logic. However, a meta regression model based on such a score still follows the same design. We note that fitting non-linear regression models (e.g., a neural network or a gradient boosting as opposed to a linear regression) on one variable tends to not significantly improve meta regression performance. The $ IoU $ or modifications thereof in semantic segmentation [RCH+20] or object detection [SKR20] are suitable quality measures to be used in meta regression.

3.2 Deep Object Detection Frameworks

We develop output- and gradient-based uncertainty metrics for the task of 2D bounding box (“object”) detection on camera images $\mathbf {x} \in \mathbb {I}^{H \times W \times C}$, with $\mathbb {I} = [0, 1]$. A DNN tailored to this task usually produces a fixed number $N_{\mathrm {out}} \in \mathbb {N}$ of output boxes

$$\begin{aligned} \mathfrak {O}(\boldsymbol{x}; \boldsymbol{\theta }) = \left( \mathfrak {O}^1(\boldsymbol{x}; \boldsymbol{\theta }), \ldots , \mathfrak {O}^{N_{\mathrm {out}}}(\boldsymbol{x}; \boldsymbol{\theta })\right) \in (\mathbb {R}^4 \times \mathcal {S} \times \mathbb {I}^{N_\mathrm {C}})^{N_{\mathrm {out}}}, \end{aligned}$$

(1)

where $\boldsymbol{\theta } \in \mathbb {R}^{N_{\mathrm {DNN}}}$ with $N_{\mathrm {DNN}} \in \mathbb {N}$ denotes the number of weights in the respective DNN architecture and $N_\mathrm {C} \in \mathbb {N}$ is the fixed number of classes (the set of classes is denoted by $\mathcal {N}_\mathrm {C} = \{1, \ldots , N_\mathrm {C}\}$) to be learned. Each output box $\mathfrak {O}^j(\mathbf {x}; \boldsymbol{\theta }) = (\hat{\xi }^j, \hat{s}^j, \hat{p}^j) \in \mathbb {R}^4 \times \mathcal {S} \times \mathbb {I}^{N_\mathrm {C}}$ for $j \in \mathcal {N}_{\mathrm {out}} = \{1, \ldots , N_{\mathrm {out}}\}$ is encoded by

Four localization variables, e.g., $\hat{\boldsymbol{\xi }} = (\hat{c}_{\min }^j, \hat{r}_{\min }^j, \hat{c}_{\max }^j, \hat{r}_{\max }^j) \in \mathbb {R}^4$ (top-left and bottom-right corner coordinates),
Confidence score $\hat{s} \in \mathcal {S} = (0, 1)$ indicating the probability of an object existing at $\hat{\xi }^j$, and
Class probability distribution $\hat{\mathbf {p}}^j = (\hat{p}^j_1, \ldots , \hat{p}^j_{N_\mathrm {C}}) \in \mathbb {I}^{N_\mathrm {C}}$.

The predicted class of $\mathfrak {O}^j(\mathbf {x}; \boldsymbol{\theta })$ is determined to be $\hat{\kappa }^j = \mathrm {arg\, max}_{k \in \mathcal {N}_\mathrm {C}} \, \hat{p}_k^j$. The $N_{\mathrm {out}}$ boxes subsequently undergo filtering mechanisms resulting in $|\mathcal {N}_{\mathbf {x}}| = \hat{N}_{\mathbf {x}}$ detected instances $\mathcal {N}_{\mathbf {x}} \subseteq \mathcal {N}_{\mathrm {out}}$

$$\begin{aligned} \hat{\mathbf {y}} = \left( \mathfrak {B}^1(\mathbf {x}; \boldsymbol{\theta }), \ldots , \mathfrak {B}^{\hat{N}_{\mathbf {x}}}(\mathbf {x}; \boldsymbol{\theta })\right) \in \mathbb {R}^{\hat{N}_{\mathbf {x}} \times (4 + 1 + N_{\mathrm {C}})}. \end{aligned}$$

(2)

Commonly, the much smaller number $\hat{N}_{\mathbf {x}} \ll N_{\mathrm {out}}$ of predicted boxes is the result of two mechanisms. First, score thresholding, i.e., discarding any $\mathfrak {O}^j(\mathbf {x}; \boldsymbol{\theta })$ with $s^j < \varepsilon _s$ for some fixed threshold $\varepsilon _s > 0$, is performed as an initial distinction between “foreground” and “background” boxes. Afterward, the non-maximum suppression (NMS) algorithm is used to reduce boxes with the same class and significant mutual overlap to only one box. This way, for boxes that are likely indicating the same object in $\mathbf {x}$, there is only one representative in $\hat{y}$. By “overlap”, we usually mean the intersection-over-union ($ IoU $) between two bounding boxes $\hat{\boldsymbol{\xi }}^j$ and $\hat{\boldsymbol{\xi }}^k$ which is defined as

$$\begin{aligned} IoU \left( \hat{\boldsymbol{\xi }}^j, \hat{\boldsymbol{\xi }}^k\right) = \frac{\left| \hat{\boldsymbol{\xi }}^j \,\cap \, \hat{\boldsymbol{\xi }}^k\right| }{\left| \hat{\boldsymbol{\xi }}^j \,\cup \, \hat{\boldsymbol{\xi }}^k\right| }, \end{aligned}$$

(3)

i.e., the ratio of their area of intersection and their joint area. The $ IoU $ between two boxes is always in $[0, 1]$, where 0 means no overlap and 1 means the boxes have identical localization. One also rates the localization quality of a prediction in terms of the maximum $ IoU $ it has with any ground truth box on $\mathbf {x}$ which has the same class.

The NMS algorithm is based on what we call “candidate boxes”. An output instance $\mathfrak {O}^k(\mathbf {x}; \boldsymbol{\theta })$ is a candidate box for another output instance $\mathfrak {O}^j(\mathbf {x}; \boldsymbol{\theta })$ if it

has a sufficiently high score ($\hat{s}^k \ge \varepsilon _s$ above a chosen, fixed threshold $\varepsilon _s \ge 0$),

has the same class as $\mathfrak {O}^j(\mathbf {x}; \boldsymbol{\theta })$, i.e., $\hat{\kappa }^k = \hat{\kappa }^j$,

has sufficient $ IoU $ with $\mathfrak {O}^j(\mathbf {x}; \boldsymbol{\theta })$, i.e., $ IoU (\hat{\boldsymbol{\xi }}^k, \hat{\boldsymbol{\xi }}^j) \ge \varepsilon _{ IoU }$ for some fixed threshold $\varepsilon _{ IoU } \ge 0$ (oftentimes $\varepsilon _{ IoU } = \tfrac{1}{2}$).

We then denote by

$$\begin{aligned} \mathrm {cand}\left[ \mathfrak {O}^j(\mathbf {x}; \boldsymbol{\theta })\right] := \left\{ \mathfrak {O}^k(\mathbf {x}; \boldsymbol{\theta }): \mathfrak {O}^k(\mathbf {x}; \boldsymbol{\theta }) \text { is a candidate box for } \mathfrak {O}^j(\mathbf {x}; \boldsymbol{\theta })\right\} \end{aligned}$$

(4)

the set of candidate boxes for $\mathfrak {O}^j(\mathbf {x}; \boldsymbol{\theta })$. In the NMS algorithm, all output boxes $\mathfrak {O}(\mathbf {x}; \boldsymbol{\theta })$ are sorted according to their score in descending order. Iteratively, the box $\mathfrak {O}^j(\mathbf {x}; \boldsymbol{\theta })$ with the highest score is selected as a prediction and $\mathrm {cand}[\mathfrak {O}^j(\mathbf {x}; \boldsymbol{\theta })]$ is removed from the ranking of output boxes. This procedure is repeated until there are no boxes left. Note that NMS usually follows score thresholding so this stage may be reached quickly, depending on $\varepsilon _s$.

DNN object detection frameworks are usually trained in a supervised manner on images with corresponding labels or ground truth. The latter usually consist of a number $N_{\mathbf {x}}$ of ground truth instances $\overline{\mathbf {y}} = (\overline{\mathbf {y}}^1, \ldots , \overline{\mathbf {y}}^{N_{\mathbf {x}}}) \in (\mathbb {R}^4 \times \mathcal {N}_\mathrm {C})^{N_{\mathbf {x}}}$ consisting of similar data as output boxes which we denote by the bar ($\overline{\cdot }$). In particular, each ground truth instance $\overline{\mathbf {y}}^j$ has a specified localization $\overline{\boldsymbol{\xi }}^j = (\overline{c}^j_{\min }, \overline{r}^j_{\min }, \overline{c}^j_{\max }, \overline{r}^j_{\max }) \in \mathbb {R}^4$ and category $\kappa ^j \in \mathcal {N}_\mathrm {C}$. From the network output $\mathfrak {O}(\mathbf {x}; \boldsymbol{\theta })$ on an image $\mathbf {x}$ and the ground truth $\overline{\mathbf {y}}$, a loss function $J(\mathfrak {O}(\mathbf {x}; \boldsymbol{\theta }), \overline{\mathbf {y}})$ is computed and iteratively minimized over $\boldsymbol{\theta }$ by stochastic gradient descent or some of its variants. In most object detection frameworks, $J$ splits up additively into

$$\begin{aligned} J = J_{\boldsymbol{\xi }} + J_s + J_{\mathbf {p}}, \end{aligned}$$

(5)

with $J_{\boldsymbol{\xi }}$ punishing localization mistakes, $J_s$ punishing incorrect score assignments to boxes (incorrect boxes with high score and correct boxes with low score), and $J_{\mathbf {p}}$ punishing an incorrect class probability distribution, respectively. In standard gradient descent optimization, the weights $\boldsymbol{\theta }$ are then updated by the following rule:

$$\begin{aligned} \boldsymbol{\theta } \leftarrow \boldsymbol{\theta } - \eta \, \nabla _{\boldsymbol{\theta }} J \left( \mathfrak {O}(\mathbf {x}; \boldsymbol{\theta }), \overline{\mathbf {y}}\right) , \end{aligned}$$

(6)

where $\eta > 0$ is the either fixed or variable learning rate parameter. The learning gradient $\mathbf {g}(\mathbf {x}, \boldsymbol{\theta }, \overline{\mathbf {y}}) := \nabla _{\boldsymbol{\theta }} J (\mathfrak {O}(\mathbf {x}; \boldsymbol{\theta }), \overline{\mathbf {y}})$ will play a central role for the gradient-based uncertainty metrics which will be introduced in Sect. 3.4.

3.3 Output-Based Uncertainty: MetaDetect

In this section, we construct uncertainty metrics for every $\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })\in \hat{\mathbf {y}}(\boldsymbol{x}, \boldsymbol{\theta })$. We do so in two stages, first by introducing the general metrics that can be obtained from the object detection pipeline. Second, we extend this by additional metrics that can be computed when using MC dropout. We consider a predicted bounding box $\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })\in \hat{\mathbf {y}}(\boldsymbol{x}, \boldsymbol{\theta })$ and its corresponding filtered candidate boxes $\mathfrak {O}^k(\mathbf {x}; \boldsymbol{\theta })\in \mathrm {cand}[\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })]$ that were discarded by the NMS.

The number of corresponding candidate boxes $\mathfrak {O}^k(\mathbf {x}; \boldsymbol{\theta })\in \mathrm {cand}[\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })]$ filtered by the NMS intuitively gives rise to the likelihood of observing a true positive. The more candidate boxes $\mathfrak {O}^k(\mathbf {x}; \boldsymbol{\theta })$ belong to $\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })$, the more likely it seems that $\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })$ is a true positive. We denote by $N^{(j)}$ the number of candidate boxes $\mathfrak {O}^k(\mathbf {x}; \boldsymbol{\theta })$ belonging to $\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })$, but suppressed by NMS. We increment this number by 1 and also count in $\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })$.

For a given image $\boldsymbol{x}$, we have the set of predicted bounding boxes $\hat{\mathbf {y}}(\boldsymbol{x}, \boldsymbol{\theta })$ and the ground truth $\overline{\mathbf {y}}$. As we want to calculate values that represent the quality of the prediction of the neural network, we first have to define uncertainty metrics for the predicted bounding boxes in $\hat{\mathbf {y}}(\boldsymbol{x}, \boldsymbol{\theta })$. For each $\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })\in \hat{\mathbf {y}}(\boldsymbol{x}, \boldsymbol{\theta })$, we define the following quantities:

the number of candidate boxes $N^{(j)} \ge 1$ that belong to $\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })$ (i.e., $\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })$ belongs to itself; one metric),
the predicted box $\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })$ itself, i.e., the values of the tuple
$$\begin{aligned} \left( \hat{c}_{\min }^j, \hat{r}_{\min }^j, \hat{c}_{\max }^j, \hat{r}_{\max }^j, \hat{s}^j, \hat{p}^j_1, \ldots , \hat{p}^j_{N_\mathrm {C}}\right) \in \mathbb {R}^{4} \times \mathcal {S} \times \mathbb {I}^{N_\mathcal {C}}, \end{aligned}$$

(7)
as well as $\sum _{i \in \mathcal {N}_\mathrm {C}} \hat{p}_i^j \in \mathbb {R}$ whenever class probabilities are not normalized ($6+N_\mathrm {C}$ metrics),
size $d=(\hat{r}_{\max }^j-\hat{r}_{\min }^j)\cdot (\hat{c}_{\max }^j-\hat{c}_{\min }^j)$ and circumference $g=2\cdot (\hat{r}_{\max }^j-\hat{r}_{\min }^j)+2\cdot (\hat{c}_{\max }^j-\hat{c}_{\min }^j)$ (two metrics),
$ IoU ^j_{ pb }$: the $ IoU $ of $\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })$ and the box with the second highest score that was suppressed by $\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })$. This value is zero if there are no boxes corresponding to $\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })$ suppressed by the NMS (i.e., $N^{(j)}=1$; one metric),
the minimum, maximum, arithmetic mean, and standard deviation for $(\hat{r}_{\min }^j,\hat{r}_{\max }^j,\hat{c}_{\min }^j,\hat{c}_{\max }^j,\hat{s}^j)$, size d and circumference g from $\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })$ and all the filtered candidate boxes that were discarded from $\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })$ in the NMS ($4 \times 7$ metrics),
the minimum, maximum, arithmetic mean, and standard deviation for the $ IoU $ of $\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })$ and all the candidate boxes corresponding to $\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })$ that were suppressed in the NMS (four metrics),
relative sizes $rd=d/g$, $rd_{\max }=d/g_{\min }$, $rd_{\min }=d/g_{\max }$, $rd_{\mathrm {mean}}=d/g_{\mathrm {mean}}$, and $rd_{\mathrm {std}}=d/g_{\mathrm {std}}$ (five metrics),
the maximal $ IoU $ of $\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })$ and all ground truth boxes in $\overline{\mathbf {y}}$; this is not an input to a meta model but serves as the ground truth provided to the respective loss function.

Altogether, this results in $47+N_\mathrm {C}$ uncertainty metrics which can be aggregated further in a meta classifier or a meta regression model.

We now elaborate on how to calculate uncertainty metrics for every $\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })\in \hat{\mathbf {y}}(\boldsymbol{x}, \boldsymbol{\theta })$ when using MC dropout. To this end, we consider the bounding box $\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })\in \hat{\mathbf {y}}(\boldsymbol{x}, \boldsymbol{\theta })$ that was predicted without dropout and then we observe under dropout K times the output of the same anchor box that produced $\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })$ and denote them by $\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })_1,...,\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })_K$. For these $K+1$ boxes $\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta }),\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })_1,...,\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })_K$, we calculate the standard deviation for the localization variables, the objectness score, and class probabilities. This is done for every $\mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })\in \hat{\mathbf {y}}(\boldsymbol{x}, \boldsymbol{\theta })$ and results in $4 + 1 + N_\mathrm {C}$ additional dropout uncertainty metrics which we denote by $\mathrm {std}_\mathrm {MC}(\cdot )$ for each of the respective $4 + 1 + N_\mathrm {C}$ box features. Executing this procedure for all available test images, we end up with a structured dataset. Each row represents exactly one predicted bounding box and the columns are given by the registered metrics. After defining a training/test splitting of this dataset, we learn meta classification ($ IoU >0.5$ vs. $ IoU \le 0.5$) and meta regression (quantitative $ IoU $ prediction) from the training part of this data. All the presented metrics, except for the true $ IoU $, can be computed without the knowledge of the ground truth. We now want to explore to which extent they are suited for the tasks of meta classification and meta regression.

3.4 Gradient-Based Uncertainty for Object Detection

In the setting provided in Sect. 3.2, a DNN learns from a data point $(\mathbf {x}, \overline{\mathbf {y}}) \in \mathbb {I}^{H \times W \times C} \times (\mathbb {R}^4 \times \mathcal {N}_{\mathrm {C}})^{N_{\mathbf {x}}}$ by computing the gradient $\mathbf {g}(\mathbf {x}, \boldsymbol{\theta }, \overline{\mathbf {y}})$. The latter is subsequently used to iteratively adjust the current weights $\boldsymbol{\theta }$, for example, by the formula given in (6). As such, the quantity $\mathbf {g}(\mathbf {x}, \boldsymbol{\theta }, \overline{\mathbf {y}})$ can be interpreted as learning stress on $\boldsymbol{\theta }$ induced by the training data $(\mathbf {x}, \overline{\mathbf {y}})$. We explain next how a related quantity gives rise to confidence information.

Some intuition about gradient uncertainty: Generally, the loss function $J$ measures the “closeness” of output boxes to the ground truth $\overline{\mathbf {y}}$ and $\mathbf {g}(\mathbf {x}, \boldsymbol{\theta }, \overline{\mathbf {y}})$ expresses the induced change in $\boldsymbol{\theta }$ for any deviation from the ground truth. If we assume that $\hat{\mathbf {y}}(\mathbf {x}; \boldsymbol{\theta })$ is close to $\overline{\mathbf {y}}$, only little change in $\boldsymbol{\theta }$ will be required in an update step. We, therefore, expect $\mathbf {g}(\mathbf {x}, \boldsymbol{\theta }, \overline{\mathbf {y}})$ to be of comparably small magnitude. A DNN which is already “well-trained” in the ordinary sense is expected to express high confidence when its prediction is correct for all practical purposes. Conversely, if the network output deviates from the ground truth significantly, learning on $(\mathbf {x}, \overline{\mathbf {y}})$ induces a larger adjustment in $\boldsymbol{\theta }$, leading to a gradient of larger magnitude. In this situation, the confidence in this prediction should be small.

The gradient $\mathbf {g}(\mathbf {x}, \boldsymbol{\theta }, \overline{\mathbf {y}})$ is not realizable as a measure of uncertainty or confidence, since it uses the ground truth $\overline{\mathbf {y}}$, which is not accessible during inference time. An approach to circumvent this shortcoming was presented in [ORG18], in that the ground truth $\overline{\mathbf {y}}$ was replaced by the prediction of the DNN $\hat{\mathbf {y}}$ on $\mathbf {x}$. We format the prediction so that it has the same format as the ground truth. Particularly, in the initial application to image classification, the softmax output of the DNN was collapsed by an $\mathrm {arg\,max}$ to the predicted class before insertion into $J$. With respect to bounding box localization, there is no adjustment required. Additionally, when computing

$$\begin{aligned} \mathbf {g}\left( \mathbf {x}, \boldsymbol{\theta }, \hat{\mathbf {y}}\right) = \nabla _{\boldsymbol{\theta }} J\left( \mathfrak {O}(\mathbf {x}; \boldsymbol{\theta }), \hat{\mathbf {y}}\right) , \end{aligned}$$

(8)

we disregard the dependency of $\hat{\mathbf {y}}$ on $\boldsymbol{\theta }$, thereby sticking to the design motivation from above. The quantity in (8) represents the weight adjustment induced in the DNN when learning that its prediction on $\mathbf {x}$ is correct. Uncertainty or confidence information is distilled from $\mathbf {g}(\mathbf {x}, \boldsymbol{\theta }, \hat{\mathbf {y}})$ by taking it to a scalar. To this end, it is natural to employ norms, e.g., $|\!| \cdot |\!|_1$ or $|\!| \cdot |\!|_2$, but we also use other maps such as minimal and maximal entry as well as the arithmetic mean and standard deviation over the gradient’s entries (see (9)). Gradient uncertainty metrics allow for some additional flexibility. Regarding (8), we see that if $J$ splits into additive terms (such as in the object detection setting, see (5)), gradient metrics allow the extraction of gradient uncertainty from each individual contribution. Additionally, we can restrict the variables for which we compute partial derivatives. One might, for example, be interested in computing the gradient $\mathbf {g}(\mathbf {x}, \boldsymbol{\theta }_{\ell }, \hat{\mathbf {y}})$ for the weights $\boldsymbol{\theta }_\ell $ from only one network layer $\ell $. In principle, this also allows for computing uncertainty metrics for individual convolutional filters or even individual weights, thus, also offering a way of localizing uncertainty within the DNN architecture.

$$\begin{aligned} \begin{aligned} m_1^{\boldsymbol{\theta }}(J) = |\!| \mathbf {g}(\mathbf {x}, \boldsymbol{\theta }, \hat{\mathbf {y}}) |\!|_1, \quad m_2^{\boldsymbol{\theta }}(J) =&\, |\!| \mathbf {g}(\mathbf {x}, \boldsymbol{\theta }, \hat{\mathbf {y}}) |\!|_2, \quad m_{\max }^{\boldsymbol{\theta }}(J) = \max (\mathbf {g}(\mathbf {x}, \boldsymbol{\theta }, \hat{\mathbf {y}})), \\[0.3em] m_{\min }^{\boldsymbol{\theta }}(J) = \min (\mathbf {g}(\mathbf {x}, \boldsymbol{\theta }, \hat{\mathbf {y}})), \quad m_{\mathrm {std}}^{\boldsymbol{\theta }}(J) =&\, \mathrm {std}(\mathbf {g}(\mathbf {x}, \boldsymbol{\theta }, \hat{\mathbf {y}})), \quad m_{\mathrm {mean}}^{\boldsymbol{\theta }}(J) = \mathrm {mean}(\mathbf {g}(\mathbf {x}, \boldsymbol{\theta }, \hat{\mathbf {y}})). \end{aligned} \end{aligned}$$

(9)

Application to instance-based settings: The approach outlined above requires some adaptation in order to produce meaningful results for settings in which the network output $\mathfrak {O}(\mathbf {x}; \boldsymbol{\theta })$ and the ground truth $\overline{\mathbf {y}}$ consist of several distinct instances, e.g., as in object detection. In such a situation, each component $\mathfrak {O}^j(\mathbf {x}; \boldsymbol{\theta })$ needs to be assigned individual confidence metrics. If we want to estimate the confidence for $\hat{\mathbf {y}}^j := \mathfrak {B}^j(\mathbf {x}; \boldsymbol{\theta })$, we may regard $\hat{\mathbf {y}}^j$ as the ground truth replacement entering the second slot of $J$. Then, the first argument of $J$ needs to be adjusted according to $\hat{\mathbf {y}}^j$. This is necessary because the loss $J(\mathfrak {O}(\mathbf {x}; \boldsymbol{\theta }), \hat{\mathbf {y}})$ expresses the mistakes made by all the output boxes, i.e., the prediction of instances which actually appear in the ground truth $\overline{\mathbf {y}}$ but which have nothing to do with $\hat{\mathbf {y}}^j$ would be punished. The corresponding gradient would, therefore, misleadingly adjust $\boldsymbol{\theta }$ toward forgetting to see such ground truth instances. Thus, it is a natural idea to identify $\mathrm {cand}[\hat{\mathbf {y}}^j]$ in the network output and only enter those boxes into $J$ which likely indicate one and the same instance on $\mathbf {x}$. We define the corresponding gradient

$$\begin{aligned} \mathbf {g}^{\mathrm {cand}}\left( \mathbf {x}, \boldsymbol{\theta }, \hat{\mathbf {y}}^j\right) := \nabla _{\boldsymbol{\theta }} J\left( \mathrm {cand}[\hat{\mathbf {y}}^j](\mathbf {x}, \boldsymbol{\theta }), \hat{\mathbf {y}}^j\right) , \end{aligned}$$

(10)

which serves as the basis of gradient uncertainty in instance-based settings. The flexibility mentioned in the previous paragraph carries over to this definition and will be exploited in our experiments.

4 Experimental Setup

In this section, we explain our choice of datasets, models, and metrics as well as experimental setup and implementation details.

4.1 Databases, Models, and Metrics

We investigate three different object detection datasets in the Pascal VOC [EVGW+15], the MS COCO [LMB+14], and the KITTI vision benchmark [GLSU15]. Pascal VOC $\mathcal {D}^\mathrm {VOC12}$ and MS COCO $\mathcal {D}^\mathrm {COCO17}$ are vision datasets containing images of a wide range of mostly everyday scenarios with 20 and 80 annotated classes, respectively. They both possess training and dedicated evaluation splits for testing and constitute established benchmarks in object detection. In order to test our methods in driving scenarios, we also investigate the KITTI dataset $\mathcal {D}^\mathrm {KITTI}$ which shows different German urban environments and comes with annotated training images. From the annotated training set, we take 2,000 randomly chosen images for evaluation. An overview of the sizes of the respective training and evaluation datasets can be found in Table 1.

Table 1

Dataset split sizes for training and testing

Dataset	$\mathcal {D}^\mathrm {VOC12}_\mathrm {train}$	$\mathcal {D}^\mathrm {VOC12}_\mathrm {test}$	$\mathcal {D}^\mathrm {COCO17}_\mathrm {train}$	$\mathcal {D}^\mathrm {COCO17}_\mathrm {val}$	$\mathcal {D}^\mathrm {KITTI}_\mathrm {train}$	$\mathcal {D}^\mathrm {KITTI}_\mathrm {eval}$
Size	14,805	4,952	118,287	5,002	5,481	2,000

The experiments we present here have been carried out based on a YOLOv3 [RF18] re-implementation in PyTorch. We have trained our model from scratch on each of the three training splits under dropout with a probability 0.5 between the last and the second-to-last convolution layer in each of the regression heads. As meta classification and meta regression models, we employ the gradient boosting models in [CG16] with standard settings.

For the evaluation of our meta classification models, we use cross-validated area-under-curve metrics ($ AuROC $ and $ AuPR $ [DG06]) instead of accuracy, since the former measure the classification quality independently of any threshold, whereas accuracy only reflects the decisions made for the classification probability $0.5$. For cross-validation, we use 10 individual image-wise splits of uncertainty metrics for the training of meta classifiers and the respective complementary split for evaluation. This particular procedure is employed whenever cross-validation is used in the following experiments. Especially in the VOC and KITTI dataset, $ AuROC $ values are mostly over $0.9$, so we regard $ AuPR $ as a second classification metric. Object detection quality in meta fusion experiments will on the one hand be evaluated in terms of cross-validated mean average precision ($ mAP $) [EVGW+15]. However, as the authors of [RF18] recognized $ mAP $ as computed in the VOC challenge is insensitive to FPs, which is why, on the other hand, we compute the cross-validated mean $F_1$ score ($ mF _1$) as well. For the precision-recall curves of each class, we compute individual $F_1$ scores which are sensitive to FPs and we average them over all classes as a complementary metric to $ mAP $. Confidence calibration is usually evaluated in terms of expected or maximum calibration error ($ MCE $) [NCH15], computed over bins of width 0.1. As the expected calibration error is sensitive to the box count per bin, we instead regard in addition to the $ MCE $ the average calibration error ($ ACE $) [NZV18]. Meta regression quality is evaluated by the usual coefficient of determination $R^2$.

4.2 Implementation Details

Our object detection models receive a dropout layer between the second-to-last and the last layers in each of the three YOLOv3 detection heads. Dropout is active during training with a rate of 0.5 and also during MC inference, where we take standard deviations over 30 dropout samples for each of the $4 + 1 + N_\mathrm {C}$ instance features of all output boxes. The MetaDetect metrics introduced in Sect. 3.3 are computed from a score threshold of $\varepsilon _s = 0.0$, as it has been found in [SKR20] that lower thresholds lead to better performance in meta classification and meta regression.

Gradient uncertainty metrics were computed for the weights in the second-to-last (“$T - 1$”) and the last (“T”) convolutional layers in the YOLOv3 architecture at the same candidate score threshold as for the MetaDetect metrics. As there are three detection heads, corresponding to a $76 \times 76$ (“S”), a $38 \times 38$ (“M”), and a $19 \times 19$ (“L”) cell grid, we also compute gradient uncertainty metrics individually for each detection head. We use this distinction in our notation and, for example, indicate the set of parameters from the last layer ($T$) of the detection head producing the $76 \times 76$ cell grid ($\mathrm {S}$) by $\boldsymbol{\theta }(T, \mathrm {S})$. Moreover, as indicated in Sect. 3.4, we also exploit the split of the loss function in (5). Each of the computed $2 \times 3 \times 3 = 18$ gradients per box results in the 6 uncertainty metrics presented in (9) giving a total of 108 gradient uncertainty metrics per bounding box. Due to the resulting computational expense, we only compute gradient metrics for output boxes with score values $\hat{s} \ge 10^{-4}$ and regard only those boxes for all of the following experiments.

4.3 Experimental Setup and Results

Starting from pre-trained models, we compute the DNN output, as well as MC dropout, MetaDetect, and gradient uncertainty metrics on each evaluation dataset (see Table 1). For each output box, we also compute the maximum $ IoU $ with the respective ground truth such that the underlying relation can be leveraged. Before fitting any model, we perform NMS on the output boxes which constitutes the relevant case for our meta fusion investigation. In the latter, we are primarily interested in finding true boxes which are usually discarded due to their low assigned score. We compare the performance of different sets of uncertainty metrics in different disciplines. In the following, “Score” denotes the network’s confidence score, “MC” stands for MC dropout, and “MD” the MetaDetect uncertainty metrics as discussed in Sect. 3.3. Moreover, “G” denotes the full set of gradient uncertainty metrics from Sect. 3.4 and the labels “MD+MC”, “G+MC”, “MD+G”, and “MD+G+MC” stand for the respective unions of MC dropout, MetaDetect, and gradient uncertainty metrics.

Meta classification: We create 10 splits of the dataset of uncertainty metrics and computed $ IoU $ by randomly dividing it 10 times in half. We assign the TP label to those examples which have $ IoU > 0.5$ and the FP label otherwise. On each split, we fit a gradient boosting classifier to the respective set of uncertainty metrics (see Fig. 5) on one half of the data. The resulting model is used to predict a classification probability on the uncertainty metrics of the second half of the data and vice-versa. This results in meta classification predictions on all data points. We evaluate these in terms of the previously introduced area-under-curve metrics under usage of the TP/FP labels generated from the ground truth, where we obtain averages and standard deviations from the 10 splits shown in Fig. 5. Performance measured in either of the two area-under-curve metrics suggests that the models situated in the top-right separate TP from FP best. In all datasets, the meta classifiers MD, G, and combinations thereof outperform the network confidence by a large margin (upwards of 6 $ AuROC $ percent points (ppts) on $\mathcal {D}_\mathrm {test}^\mathrm {VOC12}$, 2.5 on $\mathcal {D}_\mathrm {val}^\mathrm {COCO17}$, and 1.5 on $\mathcal {D}_\mathrm {eval}^\mathrm {KITTI}$), with MC dropout being among the weakest meta classifiers. Note that for both $\mathcal {D}_\mathrm {test}^\mathrm {VOC12}$ (top) and $\mathcal {D}_\mathrm {val}^\mathrm {COCO17}$ (center), the standard deviation bars are occluded by the markers due to the size of the range. On $\mathcal {D}_\mathrm {eval}^\mathrm {KITTI}$, we can see some of the error bars and the models forming a strict hierarchy across the two metrics with overlapping MD+MC and G+MC. Overall, the three sources of uncertainty MC, MD, and G show a significant degree of non-redundancy across the three datasets with mutual boosts of up to 7 $ AuPR $ ppts on $\mathcal {D}_\mathrm {test}^\mathrm {VOC12}$, 4.5 on $\mathcal {D}_\mathrm {val}^\mathrm {COCO17}$, and 1.1 on $\mathcal {D}_\mathrm {eval}^\mathrm {KITTI}$.

Meta fusion: We use the 10 datasets of uncertainty metrics and resulting confidences as in meta classification in combination with the bounding box information as new post-NMS output of the network. Thresholding on the respective confidence on decision thresholds $\varepsilon _\mathrm {dec} \in (0, 1)$ with a fixed step width of 0.01 reproduces the baseline object detection pipeline in the case of the score. The network prediction with assigned confidence can then be evaluated in terms of the aforementioned class-averaged performance measures for which we obtain averages and standard deviations from the 10 splits, except for the deterministic score baseline. In our comparison, we focus on the score baseline, MC dropout baseline, output- and gradient-based uncertainty metrics (MD and G), and the full model based on all available uncertainty metrics (G+MD+MC). We show the resulting curves of $ mAP $ over $ mF _1$ in Fig. 6 on the right, where we see that the maximum $ mF _1$ value is attained by the confidence score. In addition, we observe a sharp bend in the score curve resulting from a large count of samples with a score between 0 and 0.01, where the interpolation to the threshold $\varepsilon _s = 0$ reaches to 0.912 $ mAP $ at a low 0.61 $ mF _1$ (outside the plotting range). In terms of $ mAP $, however, the score is surpassed by the meta fusion models G and in particular the meta classification models involving MD (maximum $ mAP $ close to 0.92 at around 0.81 $ mF _1$). The raw MetaDetect confidences seem to slightly improve over G+MD+MC in terms of $ mAP $ by about 0.3 ppts., which may be due to overfitting of the gradient boosting model, but could also result from the altered confidence ranking which is initially performed when computing mean average precision. Meta fusion based on MC dropout uncertainty shows the worst performance of all four methods considered in this test.

Calibration: From the generated meta classification probabilities and the score, calibration errors can be computed indicating their calibration as binary classifiers and compared; see Fig. 7 with standard deviations indicated by error bars. In addition, we show the Platt-scaled confidence score as a reference for a post-processing calibration method. Naturally, $ ACE $ is smaller than $ MCE $. The meta classification models show consistently far-smaller calibration errors (maximum $ MCE $ of 0.11 on $\mathcal {D}^\mathrm {KITTI}_\mathrm {eval}$) than the confidence score (minimum $ MCE $ of 0.1 on $\mathcal {D}^\mathrm {COCO17}_\mathrm {val}$) with comparatively small miscalibration on the COCO validation dataset (center). We also observe that meta classifiers are also better calibrated than the Platt-scaled confidence score with the smallest margin on $\mathcal {D}^\mathrm {KITTI}_\mathrm {eval}$. Note from the $ MCE $ that meta classification models show generally good calibration with a maximum $ ACE $ around 0.05 as indicated in Sect. 3 and by the example of Fig. 3.

Meta regression: By an analogous approach to meta classification, we can fit gradient boosting regression models in a cross-validated fashion to sets of uncertainty metrics and the true maximum $ IoU $ with the ground truth. The resulting predicted $ IoU $ values can be evaluated in terms of $R^2$ which is illustrated in Fig. 6 on the left. Note once again that error bars for the standard deviation are covered by the markers. We observe an overall similar trend of the metrics to the meta classification performance in that combined models tend to perform better (boosts of up to 6 $R^2$ ppts.) indicating mutual non-redundancy among the different sources of uncertainty. The models based on MD and G individually show significant improvements over the confidence score (over 5 $R^2$ ppts. on all datasets) and a slight improvement over MC (up to 3 $R^2$ ppts.) for all investigated datasets.

Table 2

Cumulative parameter ranking in terms of their importance for $ AuROC $ and $ AuPR $ obtained from a greedy search for the first 9 chosen parameters. Metrics indicated are averages obtained from cross-validation. We show the performance of the models with all metrics as co-variables for reference on each of the datasets $\mathcal {D}^\mathrm {VOC12}_\mathrm {test}$, $\mathcal {D}^\mathrm {COCO17}_\mathrm {val}$, and $\mathcal {D}^\mathrm {KITTI}_\mathrm {eval}$

VOC				COCO				KITTI
$ AuROC $		$ AuPR $		$ AuROC $		$ AuPR $		$ AuROC $		$ AuPR $
0.917	$\mathrm {std}_\mathrm {MC}(\hat{s})$	0.658	$\hat{s}$	0.830	$\hat{s}$	0.623	$\hat{s}$	0.966	$\hat{s}$	0.969	$\hat{s}$
0.959	$m_\mathrm {mean}^{\boldsymbol{\theta }(T, \mathrm {S})}(J_s)$	0.740	$m_{\max }^{\boldsymbol{\theta }(T, \mathrm {S})}(J_p)$	0.875	$\mathrm {std}_\mathrm {MC}(\hat{s})$	0.664	$\mathrm {std}_\mathrm {MC}(\hat{s})$	0.978	$\hat{c}_{\min }$	0.976	$\hat{c}_{\min }$
0.966	$m_{\max }^{\boldsymbol{\theta }(T, \mathrm {S})}(J_p)$	0.769	$\mathrm {std}_\mathrm {MC}(\hat{s})$	0.886	$m_{\max }^{\boldsymbol{\theta }(T, \mathrm {L})}(J_p)$	0.685	$m_{\max }^{\boldsymbol{\theta }(T, \mathrm {S})}(J_p)$	0.981	$\mathrm {std}_\mathrm {MC}(\hat{s})$	0.980	$m_{1}^{\boldsymbol{\theta }(T-1, \mathrm {L})}(J_p)$
0.971	$m_{\mathrm {std}}^{\boldsymbol{\theta }(T, \mathrm {M})}(J_p)$	0.784	$m_{\mathrm {mean}}^{\boldsymbol{\theta }(T, \mathrm {M})}(J_p)$	0.893	$m_{\max }^{\boldsymbol{\theta }(T, \mathrm {S})}(J_p)$	0.704	$m_{\max }^{\boldsymbol{\theta }(T-1, \mathrm {M})}(J_p)$	0.983	$m_{\mathrm {std}}^{\boldsymbol{\theta }(T-1, \mathrm {L})}(J_p)$	0.981	$m_{\max }^{\boldsymbol{\theta }(T, \mathrm {M})}(J_p)$
0.972	$\hat{c}_{\min }$	0.790	$m_{\mathrm {mean}}^{\boldsymbol{\theta }(T-1, \mathrm {S})}(J_\xi )$	0.899	$m_{\max }^{\boldsymbol{\theta }(T-1, \mathrm {M})}(J_p)$	0.710	$m_{\max }^{\boldsymbol{\theta }(T, \mathrm {L})}(J_p)$	0.984	$m_{\max }^{\boldsymbol{\theta }(T, \mathrm {M})}(J_\xi )$	0.982	$\hat{r}_{\max }$
0.974	$\hat{s}$	0.795	$\hat{c}_{\min }$	0.900	$m_{\max }^{\boldsymbol{\theta }(T, \mathrm {M})}(J_p)$	0.713	$\mathrm {std}_\mathrm {MC}(\hat{c}_{\min })$	0.985	$\hat{r}_{\max }$	0.983	$\sum _i \hat{p}_i$
0.975	$m_{\mathrm {mean}}^{\boldsymbol{\theta }(T, \mathrm {S})}(J_\xi )$	0.799	$\mathrm {std}_\mathrm {MC}(\hat{c}_{\min })$	0.901	$m_{\max }^{\boldsymbol{\theta }(T-1, \mathrm {L})}(J_s)$	0.715	$m_{\max }^{\boldsymbol{\theta }(T, \mathrm {M})}(J_p)$	0.985	$m_{\max }^{\boldsymbol{\theta }(T, \mathrm {S})}(J_p)$	0.984	$\mathrm {std}_\mathrm {MC}(\hat{s})$
0.975	$\hat{r}_{\min }$	0.802	$m_{\mathrm {mean}}^{\boldsymbol{\theta }(T, \mathrm {S})}(J_\xi )$	0.902	$\mathrm {std}_\mathrm {MC}(\hat{c}_{\min })$	0.716	$\mathrm {std}_\mathrm {MC}(\hat{r}_{\max })$	0.986	$\hat{p}_1$	0.984	$m_{2}^{\boldsymbol{\theta }(T, \mathrm {S})}(J_\xi )$
0.976	$\hat{c}_{\max }$	0.803	$m_{\max }^{\boldsymbol{\theta }(T-1, \mathrm {M})}(J_p)$	0.902	$m_{\mathrm {mean}}^{\boldsymbol{\theta }(T, \mathrm {S})}(J_s)$	0.717	$m_{2}^{\boldsymbol{\theta }(T, \mathrm {S})}(J_p)$	0.986	$m_{\mathrm {mean}}^{\boldsymbol{\theta }(T-1, \mathrm {L})}(J_\xi )$	0.984	$m_{\mathrm {mean}}^{\boldsymbol{\theta }(T-1, \mathrm {L})}(J_\xi )$
0.976	MD+G+MC	0.803	MD+G+MC	0.904	MD+G+MC	0.720	MD+G+MC	0.986	MD+G+MC	0.984	MD+G+MC

Parameter importance: We investigate which parameters are most important for meta classification in order to find the best performance with as few metrics as possible. We pursue a greedy approach and evaluate single-metric models at first in terms of $ AuROC $ and $ AuPR $ in the cross-validated manner used before. We then fix the best-performing single metric and investigate two-metric combinations and gather the resulting metrics ranking with their area-under-curve score in Table 2. In addition to the nested models, we also show the model using all metrics which is also depicted in Fig. 5. Note that the numbers represent cross-validated averages, where we choose not to show standard deviations as we are mainly interested in seeing which metrics combine to yield well-separating meta classifiers. On all datasets, we find that a total of 9 metrics suffices to come close to or reach the performance of the best model in question. The network confidence appears as the most informative single metric 5 out of 6 times. Another metric that leads to early good improvements is the related MC dropout standard deviation of the confidence score. Other noteworthy metrics are the left border coordinate $\hat{c}_{\min }$ and gradient metrics for the class probability contribution $J_p$. Overall, gradient metrics contribute a large amount of information to $\hat{s}$ and $\mathrm {std}_\mathrm {MC}(\hat{s})$ which is indicative to meta classification performance.

Conclusions and Outlook In this chapter, we have outlined two largely orthogonal approaches to quantifying epistemic uncertainty in deep object detection networks, the output-based MetaDetect approach, and gradient-based uncertainty metrics. Both compare favorably to the network confidence score and MC dropout in terms of meta classification and meta regression with the additional benefit that meta classifiers give rise to well-calibrated confidences irrespective of the set of metrics they are based on. Moreover, different sources of uncertainty information lead to additional boosts when combined, indicating mutual non-redundancy. We have seen that the meta fusion approach enables us to trade uncertainty information for network performance in terms of mean average precision with the best meta fusion models involving the MetaDetect metrics. In terms of raw meta classification capabilities, well-performing models can be achieved by only regarding a small fraction of all metrics, with gradient-based uncertainty metrics contributing highly informative components. The design of well-performing meta classifiers and meta regression models opens up possibilities for further applications. On the one hand, the implementation of meta classification models into an active learning cycle could potentially lead to a drastic decrease in data annotation costs. On the other hand, meta classification models may be applicable beneficially in determining the labeling quality of datasets and detecting labeling mistakes.

Acknowledgements

The research leading to these results is funded by the German Federal Ministry for Economic Affairs and Energy within the project “Methoden und Maßnahmen zur Absicherung von KI-basierten Wahrnehmungsfunktionen für das automatisierte Fahren (KI-Absicherung)”. The authors would like to thank the consortium for the successful cooperation. Furthermore, we gratefully acknowledge financial support from the state Ministry of Economy, Innovation and Energy of Northrhine Westphalia (MWIDE) and the European Fund for Regional Development via the FIS.NRW project BIT-KI, grant no. EFRE-0400216. The authors gratefully acknowledge the Gauss Centre for Supercomputing e.V. (www.gauss-centre.eu) for funding this project by providing computing time through the John von Neumann Institute for Computing (NIC) on the GCS Supercomputer JUWELS at Jülich Supercomputing Centre (JSC).

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

previous chapter Confidence Calibration for Object Detection and Segmentation

next chapter Detecting and Learning the Unknown in Semantic Segmentation

[BCR+20]

D. Brüggemann, R. Chan, M. Rottmann, H. Gottschalk, S. Bracke, Detecting out of distribution objects in semantic segmentation of street scenes, in Proceedings of the European Safety and Reliability Conference (ESREL), virtual conference (November 2020), pp. 3023–3030

[CCKL19]

J. Choi, D. Chun, H. Kim, H.-J. Lee, Gaussian YOLOv3: an accurate and fast object detector using localization uncertainty for autonomous driving, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Long Beach, CA, USA, June 2019), pp. 502–511

[CCLK20]

J. Choi, D. Chun, H.-J. Lee, H. Kim, Uncertainty-based object detector for autonomous driving embedded platforms, in Proceedings of the IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), virtual conference (August 2020), pp. 16–20

[CG16]

T. Chen, C. Guestrin, XGBoost: a scalable tree boosting system, in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) (August 2016), pp. 785–794

[CRG21]

R. Chan, M. Rottmann, H. Gottschalk, Entropy maximization and meta classification for out-of-distribution detection in semantic segmentation, in Proceedings of the IEEE International Conference on Computer Vision (ICCV), virtual conference (October 2021), pp. 5128–5137

[CRGR21]

P. Colling, L. Roese-Koerner, H. Gottschalk, M. Rottmann, MetaBox+: a new region based active learning method for semantic segmentation using priority maps, in Proceedings of the International Conference on Pattern Recognition Applications and Methods (ICPRAM), virtual conference (February 2021), pp. 51–62

[CRH+19]

R. Chan, M. Rottmann, F. Hüger, P. Schlicht, H. Gottschalk, Controlled false negative reduction of minority classes in semantic segmentation, in 2020 International Joint Conference on Neural Networks (IJCNN) (IEEE, July 2020), pp. 1–8

[DG06]

J. Davis, M. Goadrich, The relationship between precision-recall and ROC curves, in Proceedings of the International Conference on Machine Learning (ICML) (Pittsburgh, PA ,USA, June 2006), pp. 233–240

[DL90]

J.S. Denker, Y. LeCun, Transforming neural-net output levels to probability distributions, in Proceedings of the Conference on Neural Information Processing Systems (NIPS/NeurIPS) (Denver, CO, USA, November 1990), pp. 853–859

[EVGW+15]

M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, A. Zisserman, The Pascal visual object classes challenge: a retrospective. Int. J. Comput. Vis. (IJCV) 111(1), 98–136 (2015)CrossRef

[Gal17]

Y. Gal, Uncertainty in Deep Learning. Dissertation (University of Cambridge, 2017)

[GG16]

Y. Gal, Z. Ghahramani, Dropout as a Bayesian approximation: representing model uncertainty in deep learning, in Proceedings of the International Conference on Machine Learning (ICML) (New York, NY, USA, June 2016), pp. 1050–1059

[GLSU15]

A. Geiger, P. Lenz, C. Stiller, R. Urtasun, The KITTI Vision Benchmark Suite (2015). [Online; accessed 2021-11-18]

[GPSW17]

C. Guo, G. Pleiss, Y. Sun, K.Q. Weinberger, On calibration of modern neural networks, in Proceedings of the International Conference on Machine Learning (ICML) (Sydney, NSW, Australia, August 2017), pp. 1321–1330

[GSS15]

I. Goodfellow, J. Shlens, C. Szegedy, Explaining and harnessing adversarial examples, in Proceedings of the International Conference on Learning Representations (ICLR) (San Diego, CA, USA, 2015), pp. 1–11

[HG17]

D. Hendrycks, K. Gimpel, A baseline for detecting misclassified and out-of-distribution examples in neural networks, in Proceedings of the International Conference on Learning Representations (ICLR) (Toulon, France, April 2017), pp. 1–12

[HSW20]

A. Harakeh, M. Smart, S.L. Waslander, BayesOD: a Bayesian approach for uncertainty estimation in deep object detectors, in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), virtual conference (May 2020), pp. 87–93

[HVD14]

G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network, in Proceedings of the Conference on Neural Information Processing Systems (NIPS/NeurIPS) Workshops (Montréal, QC, Canada, December 2014), pp. 1–9

[HW21]

Eyke Hüllermeier, Willem Waegeman, Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods. Mach. Learn. 110, 457–506 (2021)MathSciNetCrossRef

[HZW+19]

Y. He, C. Zhu, J. Wang, M. Savvides, X. Zhang, Bounding box regression with uncertainty for accurate object detection, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Long Beach, CA, USA, May 2019), pp. 2888–2897

[JLM+18]

B. Jiang, R. Luo, J. Mao, T. Xiao, Y. Jiang, Acquisition of localization confidence for accurate object detection, in Proceedings of the European Conference on Computer Vision (ECCV) (Munich, Germany, September 2018), pp. 784–799

[KD19]

F. Kraus, K. Dietmayer, Uncertainty estimation in one-stage object detection, in Proceedings of the IEEE Intelligent Transportation Systems Conference (ITSC) (Auckland, New Zealand, October 2019), pp. 53–60

[KG17]

A. Kendall, Y. Gal, What uncertainties do we need in Bayesian deep learning for computer vision? in Proceedings of the Conference on Neural Information Processing Systems (NIPS/NeurIPS) (Long Beach, CA, USA, December 2017), pp. 5574–5584

[KKSH20]

F. Küppers, J. Kronenberger, A. Shantia, A. Haselhoff, Multivariate confidence calibration for object detection, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, virtual conference (June 2020), pp. 1322–1330

[KP20]

A. Kumar, B. Poole, On implicit regularization in $\beta $-VAEs, in Proceedings of the International Conference on Machine Learning (ICML) (July 2020), pp. 5480–5490

[KRBG21]

K. Kowol, M. Rottmann, S. Bracke, H. Gottschalk, YOdar: uncertainty-based sensor fusion for vehicle detection with camera and radar sensors, in Proceedings of the International Conference on Agents and Artificial Intelligence (ICAART), virtual conference (February 2021), pp. 177–186

[LAE+16]

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A.C. Berg, SSD: single shot multibox detector, in Proceedings of the European Conference on Computer Vision (ECCV) (Amsterdam, The Netherlands, October 2016), pp. 21–37

[LDBK18]

M. Truong Le, F. Diehl, T. Brunner, A. Knol,, Uncertainty estimation for deep neural object detectors in safety-critical applications, in Proceedings of the IEEE Intelligent Transportation Systems Conference (ITSC) (Maui, HI, USA, November 2018), pp. 3873–3878

[LGG+17]

T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object detection, in Proceedings of the IEEE International Conference on Computer Vision (ICCV) (Venice, Italy, October 2017), pp. 2980–2988,

[LGRB20]

Z. Lyu, N. Gutierrez, A. Rajguru, W.J. Beksi, Probabilistic object detection via deep ensembles, in Proceedings of the European Conference on Computer Vision (ECCV), virtual conference (August 2020), pp. 67–75

[LHK+21]

Y. Lee, J.-w. Hwang, H.-I. Kim, K. Yun, J. Park, S. Ju Hwang, Localization Uncertainty Estimation for Anchor-Free Object Detectors (September 2021), pp. 1–10. arXiv:2006.15607

[LMB+14]

T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. Lawrence Zitnick, Microsoft COCO: common objects in context, in Proceedings of the European Conference on Computer Vision (ECCV) (Zurich, Switzerland, September 2014), pp. 740–755

[LPB17]

B. Lakshminarayanan, A. Pritzel, C. Blundell, Simple and scalable predictive uncertainty estimation using deep ensembles, in Proceedings of the Conference on Neural Information Processing Systems (NIPS/NeurIPS) (Long Beach, CA, USA, December 2017), pp. 6402–6413

[Mac92]

D.J.C. MacKay, A practical Bayesian framework for backpropagation networks. Neural Comput. 4(3), 448–472 (1992)CrossRef

[MNDS18]

D. Miller, L. Nicholson, F. Dayoub, N. Sünderhauf, Dropout sampling for robust object detection in open-set conditions, in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (Brisbane, QLD, Australia, May 2018), pp. 3243–3249

[MRG20]

K. Maag, M. Rottmann, H. Gottschalk, Time-dynamic estimates of the reliability of deep semantic segmentation networks, in Proceedings of the IEEE International Conference on Tools With Artificial Intelligence (ICTAI) virtual conference (November 2020), pp. 502–509

[MRV+21]

K. Maag, M. Rottmann, S. Varghese, F. Hüger, P. Schlicht, H. Gottschalk, Improving video instance segmentation by light-weight temporal uncertainty estimates, in Proceedings of the International Joint Conference on Neural Networks (IJCNN) (July 2021), pp. 1–8

[NCH15]

M. Pakdaman Naeini, G. Cooper, M. Hauskrecht, Obtaining well calibrated probabilities using Bayesian binning, in Proceedings of the AAAI Conference on Artificial Intelligence (Austin, TX, USA, January 2015), pp. 2901–2907

[NMC05]

A. Niculescu-Mizil, R. Caruana, Predicting good probabilities with supervised learning, in Proceedings of the International Conference on Machine Learning (ICML) (Bonn, Germany, August 2005), pp. 625–632

[NZV18]

L. Neumann, A. Zisserman, A. Vedaldi, Relaxed softmax: efficient confidence auto-calibration for safe pedestrian detection, in Proceedings of the Conference on Neural Information Processing Systems (NIPS/NeurIPS) Workshops (Montréal, QC, Canada, December 2018), pp. 1–8

[ORF20]

P. Oberdiek, M. Rottmann, G.A. Fink, Detection and retrieval of out-of-distribution objects in semantic segmentation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), virtual conference (June 2020), pp. 1331–1340

[ORG18]

P. Oberdiek, M. Rottmann, H. Gottschalk, Classification uncertainty of deep neural networks based on gradient information, in Proceedings of the IAPR TC3 Workshop on Artificial Neural Networks in Pattern Recognition (ANNPR) (Siena, Italy, September 2018), pp. 113–125

[Pla99]

J. Platt, Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods, in Advances in Large Margin Classifiers. ed. by A.J. Smola, P. Bartlett, B. Schölkopf, D. Schuurmans (MIT Press, 1999), pp. 61–74

[RCH+20]

M. Rottmann, P. Colling, T.-P. Hack, R. Chan, F. Hüger, P. Schlicht, H. Gottschalk, Prediction error meta classification in semantic segmentation: detection via aggregated dispersion measures of softmax probabilities, in Proceedings of the International Joint Conference on Neural Networks (IJCNN), virtual conference (July 2020), pp. 1–9

[RF18]

J. Redmon, A. Farhadi, Yolov3: An Incremental Improvement (April 2018), pp. 1–6. arXiv:1804.02767

[RHGS15]

S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: towards real-time object detection with region proposal networks, in Proceedings of the Conference on Neural Information Processing Systems (NIPS/NeurIPS) (Montréal, QC, Canada, December 2015), pp. 91–99

[RMC+20]

M. Rottmann, K. Maag, R. Chan, F. Hüger, P. Schlicht, H. Gottschalk, Detection of false positive and false negative samples in semantic segmentation, in Proceedings of the Conference on Design, Automation and Test in Europe (DATE) (Grenoble, France, March 2020), pp. 1351–1356

[RRSG21]

T. Riedlinger, M. Rottmann, M. Schubert, H. Gottschalk, Gradient-Based Quantification of Epistemic Uncertainty for Deep Object Detectors (July 2021), pp. 1–20. arXiv:2107.04517

[RS19]

M. Rottmann, M. Schubert, Uncertainty measures and prediction quality rating for the semantic segmentation of nested multi resolution street scene images, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (Long Beach, CA, USA, June 2019), pp. 1361–1369

[SFKM20]

N. Ståhl, G. Falkman, A. Karlsson, G. Mathiason, Evaluation of uncertainty quantification in deep learning, in Proceedings of the International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU), virtual conference (June 2020), pp. 556–568

[SHK+14]

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(56), 1929–1958 (2014)MathSciNetMATH

[SKR20]

M. Schubert, K. Kahl, M. Rottmann, Metadetect: uncertainty quantification and prediction quality estimates for object detection, in 2021 International Joint Conference on Neural Networks (IJCNN) (IEEE, 2021), pp. 1–10

[SZS+14]

C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, R. Fergus, Intriguing properties of neural networks, in Proceedings of the International Conference on Learning Representations (ICLR) (Banff, AB, Canada, December 2014), pp. 1–10

[VSG19]

V. Thanvantri Vasudevan, A. Sethy, A. Roshan Ghias, Towards better confidence estimation for neural models, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Brighton, UK, May 2019), pp. 7335–7339

[ZE01]

B. Zadrozny, C. Elkan, Obtaining calibrated probability estimates from decision trees and Naive Bayesian classifiers, in Proceedings of the International Conference on Machine Learning (ICML) (Williamstown, MA, USA, June 2001), pp. 609–616

[ZE02]

B. Zadrozny, C. Elkan, Transforming classifier scores into accurate multiclass probability estimates, in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) (Edmonton, AB, Canada, July 2002), pp. 694–699

Title: Uncertainty Quantification for Object Detection: Output- and Gradient-Based Approaches
Authors: Tobias Riedlinger
Marius Schubert
Karsten Kahl
Matthias Rottmann
Publisher: Springer International Publishing
Book: Deep Neural Networks and Data for Automated Driving
Print ISBN: 978-3-031-01232-7

Electronic ISBN: 978-3-031-01233-4

Copyright Year: 2022
DOI: https://doi.org/10.1007/978-3-031-01233-4_9

Springer Professional

Uncertainty Quantification for Object Detection: Output- and Gradient-Based Approaches

Abstract

1 Introduction

3 Methods

3.1 Uncertainty Quantification Protocols

3.2 Deep Object Detection Frameworks

3.3 Output-Based Uncertainty: MetaDetect

3.4 Gradient-Based Uncertainty for Object Detection

4 Experimental Setup

4.1 Databases, Models, and Metrics

4.2 Implementation Details

4.3 Experimental Setup and Results

Acknowledgements

Premium Partner

VOC				COCO				KITTI
\( AuROC \)		\( AuPR \)		\( AuROC \)		\( AuPR \)		\( AuROC \)		\( AuPR \)
0.917	\(\mathrm {std}_\mathrm {MC}(\hat{s})\)	0.658	\(\hat{s}\)	0.830	\(\hat{s}\)	0.623	\(\hat{s}\)	0.966	\(\hat{s}\)	0.969	\(\hat{s}\)
0.959	\(m_\mathrm {mean}^{\boldsymbol{\theta }(T, \mathrm {S})}(J_s)\)	0.740	\(m_{\max }^{\boldsymbol{\theta }(T, \mathrm {S})}(J_p)\)	0.875	\(\mathrm {std}_\mathrm {MC}(\hat{s})\)	0.664	\(\mathrm {std}_\mathrm {MC}(\hat{s})\)	0.978	\(\hat{c}_{\min }\)	0.976	\(\hat{c}_{\min }\)
0.966	\(m_{\max }^{\boldsymbol{\theta }(T, \mathrm {S})}(J_p)\)	0.769	\(\mathrm {std}_\mathrm {MC}(\hat{s})\)	0.886	\(m_{\max }^{\boldsymbol{\theta }(T, \mathrm {L})}(J_p)\)	0.685	\(m_{\max }^{\boldsymbol{\theta }(T, \mathrm {S})}(J_p)\)	0.981	\(\mathrm {std}_\mathrm {MC}(\hat{s})\)	0.980	\(m_{1}^{\boldsymbol{\theta }(T-1, \mathrm {L})}(J_p)\)
0.971	\(m_{\mathrm {std}}^{\boldsymbol{\theta }(T, \mathrm {M})}(J_p)\)	0.784	\(m_{\mathrm {mean}}^{\boldsymbol{\theta }(T, \mathrm {M})}(J_p)\)	0.893	\(m_{\max }^{\boldsymbol{\theta }(T, \mathrm {S})}(J_p)\)	0.704	\(m_{\max }^{\boldsymbol{\theta }(T-1, \mathrm {M})}(J_p)\)	0.983	\(m_{\mathrm {std}}^{\boldsymbol{\theta }(T-1, \mathrm {L})}(J_p)\)	0.981	\(m_{\max }^{\boldsymbol{\theta }(T, \mathrm {M})}(J_p)\)
0.972	\(\hat{c}_{\min }\)	0.790	\(m_{\mathrm {mean}}^{\boldsymbol{\theta }(T-1, \mathrm {S})}(J_\xi )\)	0.899	\(m_{\max }^{\boldsymbol{\theta }(T-1, \mathrm {M})}(J_p)\)	0.710	\(m_{\max }^{\boldsymbol{\theta }(T, \mathrm {L})}(J_p)\)	0.984	\(m_{\max }^{\boldsymbol{\theta }(T, \mathrm {M})}(J_\xi )\)	0.982	\(\hat{r}_{\max }\)
0.974	\(\hat{s}\)	0.795	\(\hat{c}_{\min }\)	0.900	\(m_{\max }^{\boldsymbol{\theta }(T, \mathrm {M})}(J_p)\)	0.713	\(\mathrm {std}_\mathrm {MC}(\hat{c}_{\min })\)	0.985	\(\hat{r}_{\max }\)	0.983	\(\sum _i \hat{p}_i\)
0.975	\(m_{\mathrm {mean}}^{\boldsymbol{\theta }(T, \mathrm {S})}(J_\xi )\)	0.799	\(\mathrm {std}_\mathrm {MC}(\hat{c}_{\min })\)	0.901	\(m_{\max }^{\boldsymbol{\theta }(T-1, \mathrm {L})}(J_s)\)	0.715	\(m_{\max }^{\boldsymbol{\theta }(T, \mathrm {M})}(J_p)\)	0.985	\(m_{\max }^{\boldsymbol{\theta }(T, \mathrm {S})}(J_p)\)	0.984	\(\mathrm {std}_\mathrm {MC}(\hat{s})\)
0.975	\(\hat{r}_{\min }\)	0.802	\(m_{\mathrm {mean}}^{\boldsymbol{\theta }(T, \mathrm {S})}(J_\xi )\)	0.902	\(\mathrm {std}_\mathrm {MC}(\hat{c}_{\min })\)	0.716	\(\mathrm {std}_\mathrm {MC}(\hat{r}_{\max })\)	0.986	\(\hat{p}_1\)	0.984	\(m_{2}^{\boldsymbol{\theta }(T, \mathrm {S})}(J_\xi )\)
0.976	\(\hat{c}_{\max }\)	0.803	\(m_{\max }^{\boldsymbol{\theta }(T-1, \mathrm {M})}(J_p)\)	0.902	\(m_{\mathrm {mean}}^{\boldsymbol{\theta }(T, \mathrm {S})}(J_s)\)	0.717	\(m_{2}^{\boldsymbol{\theta }(T, \mathrm {S})}(J_p)\)	0.986	\(m_{\mathrm {mean}}^{\boldsymbol{\theta }(T-1, \mathrm {L})}(J_\xi )\)	0.984	\(m_{\mathrm {mean}}^{\boldsymbol{\theta }(T-1, \mathrm {L})}(J_\xi )\)
0.976	MD+G+MC	0.803	MD+G+MC	0.904	MD+G+MC	0.720	MD+G+MC	0.986	MD+G+MC	0.984	MD+G+MC

Springer Professional

Abstract

1 Introduction

2 Related Work

3 Methods

3.1 Uncertainty Quantification Protocols

3.2 Deep Object Detection Frameworks

3.3 Output-Based Uncertainty: MetaDetect

3.4 Gradient-Based Uncertainty for Object Detection

4 Experimental Setup

4.1 Databases, Models, and Metrics

4.2 Implementation Details

4.3 Experimental Setup and Results

Acknowledgements

Premium Partner