Top

Published in:

Open Access 2022 | OriginalPaper | Chapter

Detecting and Learning the Unknown in Semantic Segmentation

Authors : Robin Chan, Svenja Uhlemeyer, Matthias Rottmann, Hanno Gottschalk

Published in: Deep Neural Networks and Data for Automated Driving

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Patentsearch

Off

Abstract

Semantic segmentation is a crucial component for perception in automated driving. Deep neural networks (DNNs) are commonly used for this task, and they are usually trained on a closed set of object classes appearing in a closed operational domain. However, this is in contrast to the open world assumption in automated driving that DNNs are deployed to. Therefore, DNNs necessarily face data that they have never encountered previously, also known as anomalies, which are extremely safety-critical to properly cope with. In this chapter, we first give an overview about anomalies from an information-theoretic perspective. Next, we review research in detecting unknown objects in semantic segmentation. We present a method outperforming recent approaches by training for high entropy responses on anomalous objects, which is in line with our theoretical findings. Finally, we propose a method to assess the occurrence frequency of anomalies in order to select anomaly types to include into a model’s set of semantic categories. We demonstrate that those anomalies can then be learned in an unsupervised fashion which is particularly suitable in online applications.

1 Introduction

Recent developments in deep learning have enabled scientists and practitioners to advance in a broad field of applications that were intractable before. To this end, deep neural networks (DNNs) are mostly employed which are usually trained in a supervised fashion with closed-world assumption. However, when those algorithms are deployed to real-world applications, e.g., artificial intelligence (AI) systems used for perception in automated driving, they often operate in an open-world setting where they have to face diversity of the real world. Consequently, DNNs are likely exposed to data which is “unknown” to them and therefore possibly beyond their capabilities to process. For this reason, having methods at hand, that indicate when a DNN is operating outside of its learned domain to seek for human intervention, is of utmost importance in safety-critical applications.

A generic term for such a task is anomaly detection, which is generally defined as recognizing when something departs from what is regarded as normal or common. More precisely, identifying anomalous examples during inference, i.e., new examples that are “extreme” in some sense as they lie in low density regimes or even outside of the training data distribution, is commonly referred to as out-of-distribution (OoD) or novelty detection in deep learning terminology. The latter is in close connection to the task of identifying anomalous examples in training data, which is contrarily known as outlier detection; a term originating from classical statistics to determine whether observational data is polluted. Those outlined notions are often used interchangeably in deep learning literature. Throughout this chapter, we will stick to the general term anomaly and only specify when distinguishing is relevant.

For the purpose of anomaly detection, plenty of methods, ranging from classical statistical ones (see Sect. 2) to deep-learning-specific ones (see Sect. 4) have been developed in the past. Nowadays for the most challenging computer vision tasks tackled by deep learning, where both the model weights and output are of high dimension (in the millions), specific approaches to anomaly detection are mandatory.

Classical methods such as density estimation fail due to the curse of dimensionality. Early approaches identify outliers based on the distance to their neighbors [KNT00, RRS00], i.e., they are looking for sparse neighborhoods. Other methods consider relative densities to handle clusters of different densities, e.g., by comparing one instance either to its k-nearest neighbors [BKNS00] or using an $\varepsilon $-neighborhood as reference set [PKGF03]. However, the concept of neighborhoods becomes meaningless in high dimensions [AHK01]. More advanced approaches for high-dimensional data compute outlier degrees based on angles instead of distances [KSZ08] or even identify lower-dimensional subspaces [AY01, KKSZ09].

In deep-learning-driven computer vision applications, novelties are typically regarded as more relevant than outliers. In semantic segmentation, i.e., pixel-level image classification, novelty detection may even refer to a number of sub-tasks. On the one hand, we might be concerned with the detection of semantically anomalous objects. This is also known as anomaly segmentation in the case of semantic segmentation. On the other hand, we also might be concerned with the detection of changed environmental conditions that are novel. The latter may be effects of a domain shift and include change in weather, time of day, seasonality, location and time. In this chapter, we focus only on semantically novel objects as anomalies.

In general, an important capability of AI systems is to identify the unknown. However, when striving for improved self-reflection capabilities, anomaly detection is not sufficient. Another important capability for real-world deployment of AI systems is to realize that some specific concept appears over and over again and potentially constitutes a new (or novel) object class. Incremental learning refers to the task of learning new classes, however, especially in semantic segmentation, mostly in a strictly supervised or semi-supervised fashion where data for the new class is labeled with ground truth [MZ19, CMB+20]. This is accompanied by an enormous data collection and annotation effort. In contrast to supervised incremental learning, humans may recognize a novelty of a given class that appears over and over again very well, such that in the end a single feedback might be sufficient to assign a name to a novel class. For the task of image classification, [HZ21] provides an unsupervised extension of the semantic space, while for segmentation there exist only approaches for supervised extension of the semantic space via incremental learning.

In this chapter, we first introduce anomaly detection from an information-based perspective in Sect. 2. We provide theoretical evidence that the entropy is a suitable quantity for anomaly detection, particularly in semantic segmentation. In Sect. 3, we review recent developments in the fields of anomaly detection and unsupervised learning of new classes. We give an overview on existing methods, both in the context of image classification and semantic segmentation. In this setting, we present an approach to train semantic segmentation DNNs for high entropy on anomaly data in Sect. 4. We compare our proposed approach against other established and recent state-of-the-art anomaly segmentation methods and empirically show the effectiveness of entropy maximization in identifying unknown objects. Lastly, we propose an unsupervised learning technique for novel object classes in Sect. 5. Further, we provide an outlook how the latter approach can be combined with entropy maximization to handle the unknown at run time in automated driving.

2 Anomaly Detection Using Information and Entropy

Anomaly detection is a common routine in any data analysis task. Before training a statistical model on data, the data should be investigated whether the underlying distribution generating the data is polluted by anomalies. In this context, anomalies can generally be understood as samples that do not fit into a distribution. Such anomalous samples can, e.g., be generated in the data recording process either by extreme observations, by errors in recording and transmission, or by the fusion of datasets that use different systems of units. Most common for the detection of anomalies in statistics is the inspection of maximum and minimum values for each feature, or simple univariate visualization via box-whisker plots or histograms.

More sophisticated techniques are applied in multivariate anomaly detection. Here, anomalous samples do not necessarily have to contain extreme values for single features, but rather an untypical combination of them. One of the application areas for multivariate anomaly detection is, e.g., fraud detection.

In both outlined cases, an anomaly $\mathbf {z}\in \mathbb {R}^d$ can be qualified as an observation that occurs at a location of extremely low density of the underlying distribution $ {\mathrm p_{}}(\mathbf {z})$ or, equivalently, has an exceptionally high value of the information

$$\begin{aligned} I(\mathbf {z})=-\log {\mathrm p_{}}(\mathbf {z}) ~. \end{aligned}$$

(1)

Here, two problems occur: First, it is generally not specified what is considered as exceptionally high. Second, ${\mathrm p_{}}(\mathbf {z})$ and thereby $I(\mathbf {z})$ are generally unknown. Regarding the latter issue, however, the estimate $\hat{I}(\mathbf {z})=-\log \hat{\mathrm p_{}}(\mathbf {z})$ can be used which in turn relies on estimating $\hat{\mathrm p_{}}(\mathbf {z})$ from data associated to the probability density function ${\mathrm p_{}}(\mathbf {z})$. Estimation approaches for $\hat{\mathrm p_{}}(\mathbf {z})$ can be distinguished between parametric and non-parametric ones.

The Mahalanobis distance [Mah36] is the best known parametric method for anomaly detection which is based on information of the multivariate normal distribution N. In fact, if $\mathbf {z}\sim N(\boldsymbol{\mu },\mathbf {\Sigma })$ with mean $\boldsymbol{\mu }\in \mathbb {R}^d$ and positive definite covariance matrix $\mathbf {\Sigma } \in \mathbb {R}^{d \times d}$, then

$$\begin{aligned} I(\mathbf {z})&= -\log \left( \frac{1}{(2\pi )^{d/2} (\det \mathbf {\Sigma })^{1/2}} \exp \left( - \frac{1}{2} (\mathbf {z}-\boldsymbol{\mu })^\mathsf {T}\mathbf {\Sigma }^{-1}(\mathbf {z}-\boldsymbol{\mu }\right) \right) \end{aligned}$$

(2)

$$\begin{aligned}&= \frac{d}{2}\log (2\pi ) + \frac{1}{2} \log (\det \mathbf {\Sigma }) + \frac{1}{2}(\mathbf {z}-\boldsymbol{\mu })^\mathsf {T}\mathbf {\Sigma }^{-1}(\mathbf {z}-\boldsymbol{\mu }) =\frac{1}{2}\mathsf { d}_\mathbf {\Sigma }(\mathbf {z},\boldsymbol{\mu })^2+c ~, \end{aligned}$$

(3)

where

$$\begin{aligned} \mathsf {d}_\mathbf {\Sigma } := \sqrt{ (\mathbf {z}-\boldsymbol{\mu })^\mathsf {T}\mathbf {\Sigma }^{-1}(\mathbf {z}-\boldsymbol{\mu }) } \end{aligned}$$

(4)

denotes the Mahalanobis distance. The estimation $\hat{I}(\mathbf {z})$ is obtained by replacing $\boldsymbol{\mu }$ and $\mathbf {\Sigma }$ by the arithmetic mean $\hat{\boldsymbol{\mu }}$ and the empirical covariance matrix $\hat{\mathbf {\Sigma }}$, respectively, and likewise $\mathsf {d}_\mathbf {\Sigma }(\mathbf {z}, \boldsymbol{\mu }) $ by the empirical Mahalanobis distance $\mathsf {d}_{\hat{\mathbf {\Sigma }}}(\mathbf {z},\hat{\boldsymbol{\mu }})$.

In contrast, non-parametric techniques of anomaly detection rely on non-parametric techniques to estimate ${\mathrm p_{}}(\mathbf {z})$. Here, a large variety of methods from histograms, kernel estimators and many others exist [Kle09]. We note, however, that the non-parametric estimation of densities and information generally suffers from the curse of dimensionality. To alleviate the latter issue in anomaly detection, estimation of information is often combined with techniques of dimensionality reduction, such as, e.g., principal component analysis [HTF07] or autoencoders [SY14].

When using non-linear dimensionality reduction with autoencoders, densities obtained in the latent space depend on the encoder and not only on the data itself. This points towards a general problem in anomaly detection. If ${\mathrm p_{}}(\mathbf {z})$ is the density of a random quantity $\mathbf {z}$ and $\mathbf {z}'=\boldsymbol{\phi }(\mathbf {z})$ is an equivalent encoding of the data $\mathbf {z}$ using a bijective and differentiable mapping $\boldsymbol{\phi }:\mathbb {R}^d\mapsto \mathbb {R}^d$, the change of variables formula [Rud87, AGLR21]

$$\begin{aligned} {\mathrm p_{}}(\mathbf {z}') = {\mathrm p_{}}(\mathbf {z}) \cdot |\det \left( \nabla _{\mathbf {z}'} \mathbf {z} \right) | = {\mathrm p_{}}(\mathbf {z}) \cdot |\det \left( \nabla _{\mathbf {z}} \boldsymbol{\phi }^{-1} (\mathbf {z}) \right) | \end{aligned}$$

(5)

implies that the information of $\mathbf {z}'$ is

$$\begin{aligned} I(\mathbf {z}')=-\log \left( {\mathrm p_{}}(\mathbf {z})\right) -\log \left( \left| \det \left( \nabla _\mathbf {z} \boldsymbol{\phi }^{-1}(\mathbf {z})\right) \right| \right) \; , \end{aligned}$$

(6)

where $\nabla _\mathbf {z} \boldsymbol{\phi }^{-1}(\mathbf {z})$ denotes the Jacobian matrix of the inverse function $\boldsymbol{\phi }^{-1}$. Thus, whenever a high value of $I(\mathbf {z})$ indicates an anomaly, there always exists another equivalent representation of the data $\mathbf {z'}$, where the information $I(\mathbf {z}')$ is low. In other words, if $\mathbf {z}$ is remote from other instances $\mathbf {z}_j$ of a dataset and therefore considered an anomaly, there will be a transformation $\mathbf {z}'=\boldsymbol{\phi }(\mathbf {z})$ that brings $\mathbf {z}'$ right into the center of the data $\mathbf {z}_j'=\boldsymbol{\phi }(\mathbf {z)}_j$. In fact, via the Rosenblatt transformation [Ros52] any representation $\mathbf {z}$ of the data can be expressed via a representation $\mathbf {z}'=\boldsymbol{\phi }(\mathbf {z})$ where $I(\mathbf {z}')$ is constant over all data points. This stresses the importance to understand that an anomaly always refers to probability and encoding of the data $\mathbf {z}$. This is true for both the original data and its approximated lower-dimensional representation.

As a side remark, autoencoders designed from neural networks have been very successfully applied in anomaly detection. Encoder and decoder networks possess the universal approximation property [Cyb89]. Furthermore, common training losses like the reconstruction error are invariant under a change of the representation on latent spaces. Therefore, additional insights seem to be required to explain the empirical success of anomaly detection with autoencoders which is, however, not the scope of this chapter.

Another way of looking at the issue of anomaly detection in the context of different representations of same data is an explicit choice of a reference measure. This reference measure represents to which extent, or how likely, data is contaminated by potential anomalies. Suppose we can associate the probability density ${\mathrm p_{}}^\text {anom}(\mathbf {z})$ to the reference measure, then we can base anomaly detection on the quotient of densities, i.e., the odds $\frac{{\mathrm p_{}}(\mathbf {z})}{{\mathrm p_{}}^\text {anom}(\mathbf {z})}$, and apply a threshold whenever this ratio is low or, equivalently, when the relative information

$$\begin{aligned} I^\text {rel}(\mathbf {z}) := -\log \left( \frac{{\mathrm p_{}}(\mathbf {z})}{{\mathrm p_{}}^\text {anom}(\mathbf {z})} \right) = I(\mathbf {z})-I^\text {anom}(\mathbf {z}) \end{aligned}$$

(7)

is high. We note that the relative information is independent under changes of the representation $\mathbf {z}'=\boldsymbol{\phi }(\mathbf {z})$ as the $-\log |\det (\nabla _\mathbf {z}\boldsymbol{\phi }^{-1}(\mathbf {z}))|$ term from (6) occurs once with positive sign in $I(\mathbf {z}')$ and once with negative sign in $-I^\text {anom}(\mathbf {z}')$ and therefore cancels. Thus, the choice of a reference measure and the choice of a representation for the data is largely equivalent.

In practical situations, ${\mathrm p_{}}^\text {anom}(\mathbf {z})$ is often represented by some data $\{\mathbf {z}_i^\text {anom}\}_{i\in \mathcal {T}'}$ that are either simulated or drawn from some data source of known anomalies. A binary classifier $\hat{\mathrm p_{}}(\text {anom}|\mathbf {z})$ can then be trained on basis of proper data $\{\mathbf {z}_i\}_{i\in \mathcal {T}}$ and anomalous data $\{\mathbf {z}_i^\text {anom}\}_{i\in \mathcal {T}'}$. The assumed prior probability ${\mathrm p_{}}(\text {anom})$ for anomalies, i.e., the degree of contamination, acts as a threshold for the estimated odds. Equivalently, the estimate of the relative information

$$\begin{aligned} \hat{I}^\text {rel}(\mathbf {z})&=-\log \left( \frac{\hat{\mathrm p_{}}(\mathbf {z})}{\hat{\mathrm p_{}}^\text {anom}(\mathbf {z})}\right) {\mathop {=}\limits ^{\text {Bayes' Theorem}}} -\log \left( \frac{\hat{\mathrm p_{}}(\text {non-anom}|\mathbf {z}){\mathrm p_{}}(\mathbf {z})}{{\mathrm p_{}}(\text {non-anom})}\cdot \frac{{\mathrm p_{}}(\text {anom})}{\hat{\mathrm p_{}}(\text {anom}|\mathbf {z}){\mathrm p_{}}(\mathbf {z})}\right) \end{aligned}$$

(8)

$$\begin{aligned}&= -\log \left( \frac{1-\hat{\mathrm p_{}}(\text {anom}|\mathbf {z})}{\hat{\mathrm p_{}}(\text {anom}|\mathbf {z})} \cdot \frac{{\mathrm p_{}}(\text {anom})}{1-{\mathrm p_{}}(\text {anom})}\right) \end{aligned}$$

(9)

$$\begin{aligned}&=-\log \left( \frac{1-\hat{\mathrm p_{}}(\text {anom}|\mathbf {z})}{\hat{\mathrm p_{}}(\text {anom}|\mathbf {z})}\right) - \log \left( \frac{{\mathrm p_{}}(\text {anom})}{1-{\mathrm p_{}}(\text {anom})}\right) \end{aligned}$$

(10)

$$\begin{aligned}&=-\log \left( \frac{1-\hat{\mathrm p_{}}(\text {anom}|\mathbf {z})}{\hat{\mathrm p_{}}(\text {anom}|\mathbf {z})}\right) + c \end{aligned}$$

(11)

with the prior $\log $-odds $c=-\log \left( \frac{{\mathrm p_{}}(\text {anom})}{1-{\mathrm p_{}}(\text {anom})}\right) $ being a parameter controlling the threshold for the binary classifier $\hat{\mathrm p_{}}(\text {anom}|\mathbf {z})$.

If specifying what is an exceptionally high value for the information $I(\mathbf {z})$ or relative information $I^\text {rel}(\mathbf {z})$, the distinction between the detection of outliers in the training data and the detection of novelties during inference has to be taken into account. In outlier detection, observations, which have high (relative) information but which are in agreement with the extreme value of the (relative) information

$$\begin{aligned} I^\text {max}=\max _{i\in \mathcal {T}}I(\mathbf {z}_i)~~\text {or}~~I^\text {rel max}=\max _{i\in \mathcal {T}}I^\text {rel}(\mathbf {z}_i), \end{aligned}$$

(12)

are usually intentionally not eliminated. An outlier $\mathbf {z}$ for the level of significance $0<\alpha <1$ can then be detected using the condition

$$\begin{aligned} {\mathrm P}_{\{\mathbf {z}_i\}_{i\in \mathcal {T}}}(I^\text {max}>I(\mathbf {z}))\le \alpha ~~\text {or}~~{\mathrm P}_{\{\mathbf {z}_i\}_{i\in \mathcal {T}}}(I^\text {rel max}>I^\text {rel}(\mathbf {z}))\le \alpha . \end{aligned}$$

(13)

Note again that the distribution of $I^\text {rel}(\mathbf {z}_j)$ has to be estimated to derive the associated distribution for the extreme values, see, e.g., [DHF07], and also $I^\text {rel}(\mathbf {z})$ requires the estimation $\hat{\mathrm p_{}}(\mathbf {z})$ or $\hat{\mathrm p_{}}^\text {anom}(\mathbf {z})$. Therefore, a quantification of the epistemic uncertainty is essential for a proper outlier detection. Given the already mentioned problems of density estimation in high dimension, epistemic uncertainties may play a major role, unless a massive amount of data is available.

For the case of novelty detection taking place at inference, a comparison of the information of a single instance $I^\text {rel}(\mathbf {z})$ with the usual distribution of information ${\mathrm P}_{\mathbf {z}_i}$ seems to be in order, which leads to the novelty criterion for level of significance $0<\alpha <1$

$$\begin{aligned} {\mathrm P}_{\mathbf {z}_i}(I(\mathbf {z}_i)>I(\mathbf {z}))\le \alpha ~~\text {or}~~{\mathrm P}_{\mathbf {z}_i}(I^\text {rel}(\mathbf {z}_i)>I^\text {rel}(\mathbf {z}))\le \alpha . \end{aligned}$$

(14)

As a variant to this criterion, $I^\text {rel}(\mathbf {z}_i)$ could also be replaced by the extreme value statistics over the number of inferences alleviating the problem of generating false novelties by multiple testing. What has been stated on the necessity to quantify the epistemic uncertainty for the case of outlier detection equally applies for novelty detection.

While anomaly detection is generally seen as a sub-field of unsupervised learning, some specific effects occur in the case of novelty detection in supervised learning. During the phase of inference, the data $\mathbf {z}=(y,\mathbf {x})$ contain an unobserved component $y\in \mathcal {S}$, which, e.g., represent the instance’s label in a classification problem for the classes contained in $ \mathcal {S}$. Using the decomposition ${\mathrm p_{}}(\mathbf {z})={\mathrm p_{}}(y,\mathbf {x})={\mathrm p_{}}(y|\mathbf {x}){\mathrm p_{}}(\mathbf {x})$, one obtains the (relative) information from

$$\begin{aligned} I(\mathbf {z})=I(y|\mathbf {x})+I(\mathbf {x}),~~\text {or}~~I^\text {rel}(\mathbf {z})=I(y|\mathbf {x})+I^\text {rel}(\mathbf {x}) -I^\text {anom}(y|\mathbf {x}), \end{aligned}$$

(15)

where $I(y|\mathbf {x})=-\log ( {\mathrm p_{}}(y|\mathbf {x})), I^\text {anom}(y|\mathbf {x})=-\log ( {\mathrm p_{}}^\text {anom}(y|\mathbf {x}))$ is the conditional information on the right hand side. Often, for the data of the reference measure ${\mathrm p_{}}^\text {anom}(\mathbf {z})$, the labels are not contained in $\mathcal {S}$. In this case, one uses a non-informative conditional distribution ${\mathrm p_{}}^\text {anom}(y|\mathbf {x})=\frac{1}{|\mathcal {S}|}$. If this is done, the last term of (15) becomes a constant that can be integrated into a threshold parameter.

The (relative) information cannot be computed without knowing y. Therefore, the conditional expectation is used as an unbiased estimate, yielding the expected information

$$\begin{aligned} EI(\mathbf {x})=\mathbb {E}_{y\sim {\mathrm p_{}}(y|\mathbf {x})}(I^\text {rel}(\mathbf {z}))= E(\mathbf {x})+I^\text {rel}(\mathbf {x})+b^\text {rel}, \end{aligned}$$

(16)

where $E(\mathbf {x})=\sum _{y\in \mathcal {S}} {\mathrm p_{}}(y|\mathbf {x}) I(y|\mathbf {x})$ is the expected information, or entropy, of the conditional distribution ${\mathrm p_{}}(y|\mathbf {x})$ and $b^\text {rel}$ is zero for the information and equal to $-\log (|\mathcal {S}|)$ for the relative information with non-informative conditional distribution ${\mathrm p_{}}^\text {anom}(y|\mathbf {x})$. Note that $E(\mathbf {x})$ is bounded by $\log (|\mathcal {S}|)$. Therefore, under normal circumstances, the term $I^\text {rel}(\mathbf {x})$ will outweigh $E(\mathbf {x})$ by far. However, in problems like semantic segmentation, each component of $\mathbf {x}$ is assigned a label from $\mathcal {S}$. This implies solving $|\mathcal {I}|$ classification problems, where $\mathcal {I}$ denotes the pixel space of $\mathbf {x}$, thus the maximum value for $E(\mathbf {x})$ yields $|\mathcal {I}|\log (|\mathcal {S}|)$.

Therefore, the first term in (16) contains significant contributions, especially in situations where $|\mathcal {I}|$ is large. The second term, $I^\text {rel}(\mathbf {x})$ loses importance under the hypothesis that the probability of the inputs $\mathbf {x}$ does not vary greatly. Despite this hypothesis could be supported by fair sampling strategies, it requires further critical evaluation. But at least to a significant part, the expected information as an anomaly measure with regard to instance $\mathbf {x}$ is given by a dispersion measure, namely the entropy of the conditional probability. As the entropy can be well estimated using a supervised machine learning approach to estimate $\hat{{\mathrm p_{}}}(y|\mathbf {x})$ from the data $\{\mathbf {z}_j\}_{j\in \mathcal {T}}$, this part of the information is well accessible in contrast to $I^\text {rel}(\mathbf {x})$, which requires density estimation in high dimension.

Lastly in this section, let us give a remark on the role of anomaly data $\{\mathbf {z}_j^\text {anom}\}_{j\in \mathcal {T}'} = \{\mathbf {x}_j^\text {anom}\}_{j\in \mathcal {T}'}$. If such data is available, it is desirable to train the machine learning model $\hat{{\mathrm p_{}}}(y|\mathbf {x})$ to produce high values for $E(\mathbf {x}_j^\text {anom})$ so that the tractable part of the expected information $EI(\mathbf {x})$ shows good separation properties. This requirement can be inserted to the loss function, as it has been proposed in [HAB19, HMKS19] for classification. In fact, as the entropy $E(\mathbf {x})$ is maximized by the uniform (non-informative) label distribution ${\mathrm p_{}}(y|\mathbf {x}_j^\text {anom})=\frac{1}{|\mathcal {S}|}$, the aforementioned loss will favor this prediction on anomalous inputs $\{\mathbf {x}_j^\text {anom}\}_{j\in \mathcal {T}'}$. In this chapter, in Sect. 4, we will extend this approach to the computer vision task of semantic segmentation, after having reviewed related works based on deep learning in the following section.

After the introduction to anomaly detection from a theoretical point of view, we now turn to anomaly detection in deep learning. In this section, we review research in the direction of detecting and learning unknown objects in semantic segmentation.

3.1 Anomaly Detection in Semantic Segmentation

An emerging body of work explores the detection of anomalous inputs on image data, where the task is more commonly referred to as anomaly or out-of-distribution (OoD) detection. Anomaly detection was first tackled in the context of image classification by introducing post-processing techniques applied to softmax probabilities to adjust the confidence values produced by a classification model [HG17, LLLS18, LLS18, HAB19, MH20]. These methods have proven to successfully lower confidence scores for anomalous inputs at image-level, which is why they were also adapted to anomaly detection in semantic segmentation [ACS19, BSN+19], i.e., to anomaly segmentation by treating each single pixel in an image as a potential anomaly. Although those methods represent good baselines, they usually do not generalize well to segmentation, e.g., due to the high prediction uncertainties at object boundaries. The latter problem can, however, be mitigated by using segment-wise prediction quality estimates [RCH+20], an approach which has also demonstrated to indicate anomalous regions within an image [ORF20].

Recent works have proposed more dedicated solutions to anomaly segmentation. Among the resulting methods, many originate from uncertainty quantification. The intuition is that anomalous regions in an image correlate with high uncertainty. In this regard, early approaches estimate uncertainty using Bayesian deep learning, treating model parameters as distributions instead of point estimates [Mac92, Nea96]. Due to the computational complexity, approximations are mostly preferred in practice, which comprise, e.g., Monte-Carlo dropout [GG16], stochastic batch normalization [AAM+19], or an ensemble of neural networks [LPB17, GDS20]; with some of them also being extended to semantic segmentation in [BKC17, KG17, MG19]. Even when using approximations, Bayesian models still tend to be computationally expensive. Thus, they are not well suited to real time semantic segmentation which is required for safe automated driving.

This is why tackling anomaly segmentation with non-Bayesian methods are more favorable from a practitioner’s point of view. Some approaches therefore include tuning a previously trained model to the task of anomaly detection, by either modifying its architecture or exploiting additional data. In [DT18], anomaly scores are learned by adding a separate branch to the neural network. In [HMD19, MH20] the network architecture is not changed but auxiliary outlier data, which is disjoint from the actual training data, is induced into the training process to learn anomalies. The latter idea motivated several works in anomaly segmentation [BSN+19, BKOŠ19, JRF20, CRG21]. Nonetheless, such models have to cope with multiple tasks, hence possibly leading to a performance loss with respect to the original semantic segmentation task [VGV+21]. Moreover, when including outlier datasets in the training process, it cannot be guaranteed that the chosen outlier data is a good proxy for all possible anomalies.

Another recent line of works performs anomaly segmentation via generative models that reconstruct original input images. These methods assume that reconstructed images will better preserve the appearance of known image regions than that of unknown ones. Anomalous regions are then identified by means of pixel-wise discrepancies between the original and reconstructed image. Thus, such an approach is specifically designed to anomaly segmentation and has been extensively studied in [CM15, MVD17, LNFS19, XZL+20, LHFS21, BBSC21]. The main benefit of these approaches is that they do not require any OoD training data, allowing them to generalize to all possible anomalous objects. However, all these methods are limited by the integrated discrepancy module, i.e., the module that identifies relevant differences between the original and reconstructed image. In complex scenes, such as street scene images for automated driving, this might be a challenging task due to the open world setting.

Regarding the dataset landscape, only few anomaly segmentation datasets exist. The LostAndFound dataset [PRG+16] is a prominent example which contains anomalous objects in various streets in Germany while sharing the same setup as Cityscapes [COR+16]. LostAndFound, however, considers children and bicycles as anomalies, even though they are part of the Cityscapes training set. This was filtered and refined in Fishyscapes [BSN+19]. Another anomaly segmentation dataset accompanies the CAOS benchmark [HBM+20], which considers three object classes from BDD100k [YCW+20] as anomalies. Both, Fishyscapes and CAOS, try to mitigate low diversity by complementing their real images with synthetic data.

Efforts to provide anomalies in real images have been made in [LNFS19] by sourcing and annotating street scene images from the web and in [LHFS21, SKGK20] by capturing and annotating images with small objects placed on the road. Just recently, the datasets published alongside the SegmentMeIfYouCan benchmark [CLU+21] build upon those works, particularly contributing to broad diversity of anomalous street scenes as well as objects.

3.2 Incremental Learning in Semantic Segmentation

Building upon the detection of anomalies, training data can be enriched in order to learn novel classes. To avoid training from scratch, several approaches tackle the task of incremental or even continuous learning, which can be understood as adapting to continuously evolving environments. Besides learning novel classes, incremental learning also encompasses adapting to alternative tasks or other domains. A comprehensive framework to compare these different learning scenarios is provided in [vdVT19].

When learning novel classes, the primary issue incremental learning approaches face is a loss of the original performance on previously learned classes, that is commonly known as catastrophic forgetting [MC89]. To overcome this problem, a model needs to be both, “stable” and “plastic”, i.e., the model needs to retain its original knowledge while being able to adapt to new environments. The complexity of meeting these requirements at the same time is called the stability-plasticity-dilemma [AR05]. In this regard, proposed solution strategies can be separated into three categories, which are either based on architecture, regularization, or rehearsal. Most of these methods have been applied to image classification first.

Architecture strategies employ separate models for each sequential incremental learning task, combined with a selector to determine which model will be used for inference [PUUH01, CSMDB19, ARC20]. However, these approaches suffer from data imbalances, consequently standard classification algorithms tend to favor the majority class. Approaches to mitigate skewed data distributions are usually based on over- or undersampling. Another line of works, such as [RRD+16, RPR20], employ “growing” models, i.e., enlarging the model capacity by increasing the number of model parameters for more complex tasks. In [ACT17], the authors propose an automated approach to select the proper task-specific model at test time. More efficient approaches were introduced in [GK16, YYLH18], that restrict the adaptation of parameters to relevant parts of the model in terms of the new task. The Self-Net [MCE20] is made up of an autoencoder that learns low-dimensional representations of the models belonging to previously learned tasks. By that, retaining existing knowledge via approximating the old weights instead of saving them directly is accompanied with an implicit storage compression. The incremental adaptive deep model developed in [YZZ+19] enables capacity scalability and sustainability by exploiting the fast convergence of shallow models at the initial stage and afterwards utilizing the power of deep representations gradually. Other procedures perform continuous learning, e.g., using a random-forest [HCP+19], an incrementally growing DNN, retaining a basic backbone [SAR20], or nerve pruning and synapse consolidation [PTJ+21].

Regularization strategies can be further distinguished between weight regularization, i.e., measuring the importance of weights, and distillation, i.e., transferring a model’s knowledge into another. The former identifies parameters with great impact on the original tasks that are suppressed to be updated. Elastic weight consolidation (EWC) [KPR+17] is one representative method, evaluating weight importance based on the Fisher information matrix, while the synaptic intelligence (SI) method [ZPG17] calculates the cumulative change of Euclidean distance after retraining the model. Both regularization methods were further enhanced, e.g., by combining them [CDAT18, AM19], or by including unlabeled data [ABE+18]. Another idea to maintain model stability was adapted in [ZCCY21, FAML19], updating gradients based on orthogonal constraints. Bayesian neural networks are applied in [LKJ+17] to approximate a Gaussian distribution of the parameters from a single to a combined task.

Distillation is a regularization method, where the knowledge of an old model can be drawn into a new model to partly overcome catastrophic forgetting. Knowledge distillation, proposed in [HVD14], was originally invented to transfer knowledge from a complex into a simple model. The earliest approach, which applies knowledge distillation to incremental learning, is called learning without forgetting (LwF) [LH18]. A combination of knowledge distillation and EWC was proposed in [SCL+18]. Further approaches based on distillation loss are e.g., [JJJK18, YHW+19, KBJC19, LLSL19].

Rehearsal or pseudo-rehearsal-based methods, which were already proposed in [Rob95], mitigate catastrophic forgetting by allowing the model to review old knowledge whenever it learns new tasks. While rehearsal-based methods retain a subset of the old training data, pseudo-rehearsal strategies construct a generator during retraining, which learns to produce pseudo-data as similar to the old training data as possible. Hence, they provide the advantages of rehearsal even if the previously learned information is unavailable. Methods reusing old data are, e.g., incremental classifier and representation learning (iCaRL) [RKSL17], which simultaneously learns classifiers and feature representation, or the method presented in [CMJM+18], which proposes a representative memory. The bias correction (BiC) method [WCW+19] keeps old data in a similar manner, but handles the data imbalance differently. Most pseudo-rehearsal approaches include generative adversarial networks (GANs) [OOS17, WCW+18, MSC+19, OPK+19] or a variational autoencoder (VAE) [SLKK17]. The method presented in [HPL+18] combines distillation and retrospective (DR), whereby baseline approaches such as LwF are outperformed by a large margin.

Only few works exist, such as [KBDFs20, MZ21, TTA19], that adapt incremental learning techniques to semantic segmentation. They adjust knowledge distillation, using no or only a small portion of old data, respectively. One challenge of continuous learning for semantic segmentation is that images may contain unseen as well as known classes. Hence, annotations that are restricted to some task assign a great amount of pixels to a background class, exhibiting a semantic distribution shift. The authors of [CMB+20] provide a framework that mitigates biased predictions towards this background class. While existing approaches require supervision, we employ incremental learning in a semi-supervised fashion, as we do not have access to any ground truth including novelties.

4 Anomaly Segmentation

The task of anomaly detection in the context of semantic segmentation, i.e., identifying anomalies at pixel-level, is commonly known as anomaly segmentation. For this task several approaches have been proposed that are either based on uncertainty quantification, generative models, or training strategies specifically tailored to anomaly detection. In this chapter, we will first review some of those well-established methods and, subsequently, report a performance comparison with respect to their capability of identifying anomalies. In particular, we will demonstrate empirically that entropy maximization yields great performance on this segmentation task, which is in accordance to the statement of the entropy’s importance from the information-based perspective as presented in Sect. 2.

4.1 Methods

Let $\mathbf {x} \in \mathbb {I}^{H \times W \times 3}, \mathbb {I}= [0,1]$, denote (normalized) color images of resolution $H \times W$. Feeding those images to a semantic segmentation network $\mathbf {F}: \mathbb {I}^{H \times W \times 3} \rightarrow \mathbb {R}^{H \times W \times S} $, the model produces pixel-wise class scores $\mathbf {y} = (y_{i,s})_{i\in \mathcal {I}, s\in \mathcal {S}} = \mathbf {F}(\mathbf {x}) \in \mathbb {R}^{H \times W \times S}$, with the set of pixel locations denoted by $\mathcal {I} = \{ 1, \ldots , H\} \times \{ 1, \ldots , W \}$ and the set of trained (hence known) classes denoted by $\mathcal {S} = \{ 1, \ldots , S \}$. The corresponding predicted segmentation mask is given by $\mathbf {m} = (m_i)_{i \in \mathcal {I}} \in \{ 1,\ldots , S \}^{H \times W}$, where for $ m_i = \arg \max _{s\in \mathcal {S}} y_{i,s} ~ \forall ~ i\in \mathcal {I}$ the maximum a-posteriori probability principle is applied. Regarding the task of anomaly segmentation, the ultimate goal is then to obtain a score map $\mathbf {a} = (a_i)_{i\in \mathcal {I}} \in \mathbb {R}^{H \times W}$ that indicates the presence of an anomaly at each pixel location $i \in \mathcal {I}$ within image $\mathbf {x}$, i.e., the higher the score the more likely there should be an anomaly.

Each of the methods employed in this section provides such score maps. Their underlying segmentation networks (DeepLabV3+, [CZP+18]) are all trained on Cityscapes [COR+16], i.e., objects not included in the set of Cityscapes object classes are considered as anomalies since they have not been seen during training and thus are unknown. The anomaly detection methods, however, differ in the way how the scores are obtained, which is why we briefly introduce the different techniques in the following.

Maximum softmax probability: The most commonly-used baseline for anomaly detection at image level is thresholding at the maximum softmax probability (MSP) [HG17]. Therefore, this method assumes that anomalies are attached a low confidence or, equivalently, high uncertainty. Using MSP in anomaly segmentation, the score map is computed via

$$\begin{aligned} a_i = 1 - \max _{s\in \mathcal {S}} \mathbf {softmax}(\mathbf {y}_{i}) \quad \forall ~ i\in \mathcal {I} ~. \end{aligned}$$

(17)

ODIN: A simple extension to improve MSP is applying temperature scaling as well as adding perturbations, which is known as out-of-distribution detector for Neural networks (ODIN) [LLS18]. In more detail, let $t \in \mathbb {R}_{>0}$ be a hyperparameter for temperature scaling and let $\varepsilon \in \mathbb {R}_{\ge 0}$ be a hyperparameter for the perturbation magnitude. Then, the input $\mathbf {x}$ is modified as

$$\begin{aligned} \tilde{\mathbf {x}} = (\tilde{x}_i)_{i \in \mathcal {I}} \quad \text {with} \quad \tilde{x}_i = x_i - \varepsilon \ \mathrm {sign} \left( -\frac{\partial }{\partial x_i} \log \max _{s\in \mathcal {S}} \mathbf {softmax}\left( \frac{\mathbf {y}_{i}}{t} \right) \right) \quad \forall ~ i\in \mathcal {I} ~, \end{aligned}$$

(18)

yielding prediction $\tilde{\mathbf {y}} = F(\tilde{\mathbf {x}})$ for which thresholding is applied at the MSP, i.e., the anomaly score map is given by

$$\begin{aligned} a_i = 1 - \max _{s\in \mathcal {S}} \mathbf {softmax}(\tilde{\mathbf {y}}_i) \quad \forall ~ i\in \mathcal {I} ~. \end{aligned}$$

(19)

Mahalanobis distance: This anomaly detection approach estimates how well latent features fit to those observed in the training data. Let $(L-1)$ denote the penultimate layer of a network $\mathbf {F}$ with L layers. In [LLLS18] the authors have shown that training a softmax classifier fits a class-conditional Gaussian distribution for the output features $\mathbf {f}_{L-1}$. Hence, under that assumption

$$\begin{aligned} {\mathrm P}\left( \mathbf {y}^{(L-1)}_{i} ~\Big |~ \overline{y}_{i,s} = 1 \right) = N \left( \mathbf {y}^{(L-1)}_{i} ~\Big |~ \boldsymbol{\mu }_s, \mathbf {\Sigma }_s \right) \quad \forall ~ i\in \mathcal {I} ~, \end{aligned}$$

(20)

where $\mathbf {y}^{(L-1)} = \mathbf {f}_{L-1}(\mathbf {x}) \in \mathbb {R}^{H \times W \times C_{L-1}}$ denotes the feature map of the penultimate layer given input $\mathbf {x}$, and $\overline{\mathbf {y}}$ the corresponding one-hot encoded final target. The minimal Mahalanobis distance $d_{\mathbf {\Sigma }_s}(\mathbf {x}, \boldsymbol{\mu }_s)$ is then an obvious choice for an anomaly score map

$$\begin{aligned} a_i = \min _{s\in \mathcal {S}} d_{\mathbf {\Sigma }_s}(\mathbf {x}, \boldsymbol{\mu }_s) = \min _{s\in \mathcal {S}} ( \mathbf {y}^{(L-1)}_{i} - \boldsymbol{\mu }_s )^\mathsf {T}\mathbf {\Sigma }_s^{-1} ( \mathbf {y}^{(L-1)}_{i} - \boldsymbol{\mu }_s ) \quad \forall ~ i\in \mathcal {I} ~, \end{aligned}$$

(21)

cf. (2). Note that the class means $\boldsymbol{\mu }_s \in \mathbb {R}^{C_{L-1}}$ and class covariances $\mathbf {\Sigma }_s \in \mathbb {R}^{C_{L-1} \times C_{L-1}}$ are generally unknown, but can be estimated by means of the training dataset.

Monte-Carlo dropout: In semantic segmentation, Monte Carlo dropout represents the most prominent technique to approximate Bayesian neural networks. According to [MG19], (epistemic) uncertainty is measured as the mutual information which might serve as anomaly score map, i.e.,

$$\begin{aligned} a_i = - \sum _{s \in \mathcal {S}} \left( \frac{1}{R} \sum _{r \in \mathcal {R}} {\mathrm p_{}}_{i,s}^{(r)} \right) \log \left( \frac{1}{R} \sum _{r \in \mathcal {R}} {\mathrm p_{}}_{i,s}^{(r)} \right) - \frac{1}{R} \sum _{s \in \mathcal {S}} \sum _{r \in \mathcal {R}} {\mathrm p_{}}_{i,s}^{(r)} \log {\mathrm p_{}}_{i,s}^{(r)} \quad \forall ~ i\in \mathcal {I}, \end{aligned}$$

(22)

with $\mathbf {p}_{i}^{(r)} = (p_{i,s}^{(r)})_{s\in \mathcal {S}} = \mathbf {softmax}(\mathbf {y}_{i}^{(r)})$ in the sampling round $r \in \mathcal {R} = \{ 1,\ldots R\}$. Typically, $8 \le R \le 12$.

Void classifier: Neural networks can be trained to output confidences for the presence of anomalies [DT18]. One approach in this context is adding an extra class to the set $\mathcal {S}$ of previously trained classes of a semantic segmentation network, which then also requires annotated anomaly data to learn from. To this end, the void class in Cityscapes is a popular choice as proxy for all possible anomaly data [BSN+19], in particular if the segmentation model was originally trained on Cityscapes. Thus, the softmax output of the additional class $s=S+1$ represents the anomaly score map, i.e.,

$$\begin{aligned} a_i = \mathrm {softmax}_{s=S+1}(\mathbf {y}'_{i}) \quad \forall ~ i\in \mathcal {I} ~, \end{aligned}$$

(23)

where $\mathbf {y}' = (\mathbf {y}'_i)_{i\in \mathcal {I}} = (y'_{i,s})_{i\in \mathcal {I}, s\in \{1,\ldots , S+1\}} = \mathbf {F}'(\mathbf {x}), \mathbf {F}': \mathbb {I}^{H \times W \times 3} \rightarrow \mathbb {R}^{H \times W \times (S+1)}$.

Learned embedding density: Let $\mathbf {f}_{\ell }(\mathbf {x}) \in \mathbb {R}^{H_\ell \times W_\ell \times C_\ell }$ denote the feature map, or equivalently feature embedding, at layer $\ell \in \mathcal {L}=\{1,\ldots ,L\}$ of a semantic segmentation network. By employing normalizing flows, the true distribution of features $\mathbf {p}(\mathbf {f}_{\ell }(\mathbf {x})) \in \mathbb {I}^{H_\ell \times W_\ell }$, where $\mathbf {x} \in \mathcal {X}^\mathrm {train}$ is drawn from the training dataset, can be trained via maximum likelihood, i.e., normalizing flows learn to produce the approximation $\hat{\mathbf {p}}(\mathbf {f}_{\ell }(\mathbf {x})) \approx \mathbf {p}(\mathbf {f}_{\ell }(\mathbf {x}))$ [BSN+19]. At test time, the negative log-likelihood measures how well features of a test sample fit to the feature distribution observed in the training data, yielding the anomaly score map

$$\begin{aligned} \mathbf {a} = \mathbf {up}^\mathrm {lin} \left( -\boldsymbol{\log } ~ \hat{\mathbf {p}}(\mathbf {f}_{\ell }(\mathbf {x})) \right) \quad \text {(}\boldsymbol{\log }~\text {applies }\log ~\text {element-wise)} \end{aligned}$$

(24)

with $\mathbf {up}^\mathrm {lin}: \mathbb {R}^{H_\ell \times W_\ell } \rightarrow \mathbb {R}^{H\times W}$ denoting (bi-)linear upsampling.

Image resynthesis: After obtaining the predicted segmentation mask $\mathbf {m} \in \{ 1,\ldots ,S\}^{H \times W}$, $\mathbf {m} = (m_i)_{i\in \mathcal {I}}$, this output can be further processed by a generative model $\mathbf {G}:\{ 1,\ldots ,S\}^{H \times W} \rightarrow \mathbb {I}^{H \times W \times 3} $ aiming to reconstruct the original input image, i.e., $\mathbf {x}'=\mathbf {G}(\mathbf {m}) \approx \mathbf {x}$. This process is also called image resynthesis, and the intuition is that reconstruction quality for anomalous objects is worse than for those on which the generative model is trained on. To determine pixel-wise anomalies, a discrepancy network [LNFS19] $\mathbf {D}:\{ 1,\ldots ,S\}^{H \times W} \times \mathbb {I}^{H \times W \times 3} \times \mathbb {I}^{H \times W \times 3} \rightarrow \mathbb {R}^{H \times W}$ can then be employed, which classifies whether one pixel is anomalous or not, based on information provided by $\mathbf {m}, \mathbf {x}'$, and $\mathbf {x}$. Here, $\mathbf {D}$ is trained on intentionally triggered classification mistakes that are produced by flipping classes on predicted segmentation masks. The anomaly score map is given by the output of the discrepancy network, i.e.,

$$\begin{aligned} \mathbf {a} = \mathbf {D}(\mathbf {m}, \mathbf {x}', \mathbf {x}) = \mathbf {D}(\mathbf {m}, \mathbf {G}(\mathbf {m}), \mathbf {x}) ~. \end{aligned}$$

(25)

SynBoost: The image resynthesis approach is limited by the employed discrepancy module $\mathbf {D}$. In [BBSC21], the authors proposed to extend the discrepancy network by incorporating further inputs based on uncertainty, such as the pixel-wise softmax entropy

$$\begin{aligned} H_i(\mathbf {x}) = -\sum _{s \in \mathcal {S}} \mathrm {softmax}_s(\mathbf {y}_{i}) \log (\mathrm {softmax}_s(\mathbf {y}_{i})) \quad \forall ~ i\in \mathcal {I}, \end{aligned}$$

(26)

and the pixel-wise softmax probability margin

$$\begin{aligned} M_i(\mathbf {x}) = 1 - \max _{s\in \mathcal {S}} \left( \mathbf {softmax}(\mathbf {y}_{i}) \right) + \max _{s\in \mathcal {S} \setminus \{ m_i\}} \left( \mathbf {softmax}(\mathbf {y}_{i}) \right) \quad \forall ~ i\in \mathcal {I} ~. \end{aligned}$$

(27)

Furthermore, $\mathbf {D}$ is trained on anomaly data provided by the Cityscapes void class. Thus, the anomaly score map is given by

$$\begin{aligned} \mathbf {a} = \mathbf {D}(\mathbf {m}, \mathbf {x}', \mathbf {x}, \mathbf {H}(\mathbf {x}), \mathbf {M}(\mathbf {x})) = \mathbf {D}(\mathbf {m}, \mathbf {G}(\mathbf {m}), \mathbf {x}, \mathbf {H}(\mathbf {x}), \mathbf {M}(\mathbf {x})) ~. \end{aligned}$$

(28)

with $\mathbf {H}(\mathbf {x}) = (H_i(\mathbf {x}))_{i\in \mathcal {I}}$ and $\mathbf {M}(\mathbf {x}) = (M_i(\mathbf {x}))_{i\in \mathcal {I}}$.

Entropy maximization: A desirable property of semantic segmentation networks is that they attach high prediction uncertainty to novel objects. To this end, the softmax entropy, see (26), is one intuitive uncertainty measure. The segmentation network can be trained for high entropy on anomalous inputs via the multi-criteria training objective [CRG21]

$$\begin{aligned} J^\mathrm {total} = (1-\lambda ) \mathbb {E}_{(\overline{\mathbf {y}}, \mathbf {x}) \sim \mathcal {D}} [J^\mathrm {CE}(\mathbf {F}(\mathbf {x}), \overline{\mathbf {y}})] + \lambda \mathbb {E}_{\mathbf {x} \sim \mathcal {D^\mathrm {anom}}} [J^\mathrm {anom}(\mathbf {F}(\mathbf {x}))]~, \end{aligned}$$

(29)

where $\mathcal {D}$ denotes non-anomaly training data (labels available) and $\mathcal {D^\mathrm {anom}}$ denotes anomaly training data (no labels available). In this approach, the COCO dataset [LMB+14] represents a set of so-called known unknowns, which is used as proxy for $\mathcal {D^\mathrm {anom}}$ with the aim to represent all possible anomaly data. Moreover, $\lambda \in \mathbb {I}$ is a hyperparameter controlling the impact of each single loss function on the overall loss $J^\mathrm {total}$. For non-anomaly data, the loss function is chosen to be the commonly-used cross-entropy $J^\mathrm {CE}$, while for anomaly data, i.e., for known unknowns, we have

$$\begin{aligned} J^\mathrm {anom}(\mathbf {F}(\mathbf {x})) = - \frac{1}{H \cdot W} \sum _{i \in \mathcal {I}} \frac{1}{S} \sum _{s \in \mathcal {S}} \log \mathrm {softmax}_s(\mathbf {y}_{i})~, \quad \mathbf {x} \sim \mathcal {D^\mathrm {anom}} ~. \end{aligned}$$

(30)

Therefore, minimizing $J^\mathrm {anom}$ is equivalent to maximizing the softmax entropy since both reach their optimum when the softmax probabilities are uniformly distributed, i.e., $\mathrm {softmax}_s(\mathbf {y}_{i}) = \frac{1}{S} ~ \forall ~ s\in \mathcal {S}, i\in \mathcal {I}$. After training, the anomaly score map is then given by the (normalized) softmax entropy

$$\begin{aligned} a_i = \frac{1}{\log S} H_i(\mathbf {x}) = - \frac{1}{\log S}\sum _{s \in \mathcal {S}} \mathrm {softmax}_s(\mathbf {y}_{i}) \log (\mathrm {softmax}_s(\mathbf {y}_{i})) \quad \forall ~ i\in \mathcal {I} ~. \end{aligned}$$

(31)

From an information-based point of view, the entropy contains significant contribution to the expected information. This particularly applies for instance predictions in semantic segmentation, which motivates the entropy maximization approach for the detection of unknown objects, cf. Sect. 2.

4.2 Evaluation and Comparison of Anomaly Segmentation Methods

Discriminating between anomaly and non-anomaly is essentially a binary classification problem. In order to evaluate the pixel-wise anomaly detection capability, we use the receiver operating characteristic (ROC) curve as well as the precision recall (PR) curve. While for the ROC curve the true positive rate is plotted against the false positive rate at varying thresholds, in the PR curve precision is plotted against recall at varying thresholds. Note that we consider anomalies as the positive class, i.e., correctly identified anomaly pixels are considered as true positive. In both curves, the degree of separability is then measured by the area under the curve (AUC), where better separability corresponds to a higher AUC.

The main difference between these two performance metrics is how they cope with class imbalance. While the ROC curve incorporates the number of true negatives (for the computation of the false positive rate), in PR curves true negatives are ignored and, consequently, more emphasis is put on finding the positive class. With the anomaly score maps as defined in Sect. 4.1, in our case, finding the positive class corresponds to identifying anomalies.

As evaluation datasets, we use LostAndFoundNoKnown [BSN+19] and RoadObstacle21 [CLU+21], which are both part of the public SegmentMeIfYouCan anomaly segmentation benchmark.¹ LostAndFoundNoKnown consists of 1043 road scene images where obstacles are placed on the road. This dataset is a subset of the prominent LostAndFound dataset [PRG+16] but considers only obstacles from object classes which are disjoint to those in the Cityscapes labels [COR+16]. More precisely, images with humans and bicycles are removed such that the remaining obstacles in the dataset also represent anomalies to models trained on Cityscapes. Similar scenes can be found in RoadObstacle21. That dataset was published alongside the SegmentMeIfYouCan benchmark and contains 327 road obstacle scene images with diverse road surfaces as well as diverse types of anomalous objects. Both datasets restrict the region of interest to the road where anomalies appear. This task is extremely safety-critical as it is mandatory in automated driving to make sure that the drivable area is free of any hazard. All anomaly segmentation methods introduced in the preceding Sect. 4.1 are suited to be evaluated on these datasets. We provide a visual comparison of anomaly scores produced by the tested methods in Fig. 1. We report numerical results in Fig. 2 and in the corresponding Table 1.

In general, we observe that anomaly detection methods originally designed for image classification, including MSP, ODIN and Mahalanobis, do not generalize well to anomaly segmentation. As the Mahalanobis distance is based on statistics of the Cityscapes dataset, the anomaly detection is likely to suffer from performance loss under domain shift. The same holds for Monte Carlo dropout and learned embedding density, particularly resulting in poor performance in RoadObstacle21, where various road surfaces are available. Therefore, those methods potentially act as domain shift classifier rather than as detector of unknown objects.

The detection methods based on autoencoders, namely image resynthesis and SynBoost, show to be better suited for the task of anomaly segmentation, clearly being superior to all the approaches that already have been discussed. Autoencoders are limited by their discrepancy module, and we observe that anomaly detection performance significantly benefits from incorporating uncertainty measures, as done by SynBoost. Only entropy maximization reaches similar anomaly segmentation performance, even outperforming SynBoost in RoadObstacle21. This again can be explained by the diversity of road surfaces, which detrimentally affects the discrepancy module.

As a final remark, we draw attention to the use of anomaly data. The void classifier follows the same intuition as entropy maximization by including known unknowns, but cannot reach nearly as good anomaly segmentation performance. We conclude that the COCO dataset is better suited as proxy for anomalous objects than the Cityscapes unlabeled objects. Moreover, the results of that method empirically demonstrate the impact of the entropy in anomaly segmentation, which is in accordance to the statement of the entropy’s importance from the information perceptive described in Sect. 2.

Table 1

Pixel-wise anomaly detection performance on the datasets LostAndFoundNoKnown and RoadObstacle21, respectively. The main evaluation metric represents the area under precision-recall curve (AuPRC). Moreover, the area under receiver operating characteristic (AuROC) and the false positive rate at a true positive rate of 95% (FPR$_{95\text {TPR}}$) are reported for further insights

Method	LostAndFoundNoKnown			RoadObstacle21
Method	AuPRC $\uparrow $	AuROC $\uparrow $	FPR$_{95\text {TPR}}$ $\downarrow $	AuPRC $\uparrow $	AuROC $\uparrow $	FPR$_{95\text {TPR}}$ $\downarrow $
Maximum Softmax	30.1	93.0	33.2	10.0	95.5	17.9
ODIN	52.9	95.1	30.0	11.9	96.0	16.4
Mahalanobis	55.0	97.5	12.9	19.5	95.1	21.7
Monte Carlo Dropout	36.8	92.2	35.5	4.9	83.5	50.3
Void classifier	4.8	79.5	47.0	10.4	89.7	41.5
Embedding density	61.7	98.0	10.4	0.8	81.0	46.4
Image resynthesis	42.7	96.4	17.4	37.5	98.6	4.7
SynBoost	81.7	98.3	4.6	71.3	99.4	3.2
Entropy maximization	77.9	98.0	9.7	76.0	99.7	1.3

4.3 Combining Entropy Maximization and Meta Classification

Meta classification is the task of discriminating between a true positive prediction and a false positive prediction. For semantic segmentation, this idea was originally proposed in [RCH+20]. By means of hand-crafted metrics, which are based on dispersion measures, geometry features, or location information, all derived from softmax probabilities, meta classifiers have shown to reliably identify incorrect predictions at segment level. More precisely, connected components of pixels sharing the same class label are considered as segments in this context, and a false positive segment then corresponds to a segment-wise intersection-over-union ($\mathrm {IoU}$) of 0.

The meta classification approach can straightforwardly be adapted to post-process anomaly segmentation masks. This seems particularly reasonable in combination with entropy maximization. Since entropy maximization generally increases the sensitivity towards predicting anomalies, it is possible that the entropy is also increased at pixels belonging to non-anomalous objects. In the latter case, this would yield false positive anomaly instance predictions, which, however, can be identified and discarded afterwards by meta classification. The concept of trading false-positive detection for anomaly detection performance is motivated by [CRH+20]. Moreover, meta classifiers are expected to considerably benefit from entropy maximization, since in the original work [RCH+20] the entropy as metric has already been observed to be well correlated to the segment-wise $\mathrm {IoU}$.

Table 2

Detection errors at object level for LostAndFound anomalies at different anomaly score / entropy thresholds $\tau $ (to generate anomaly segmentation masks). The quantities false-positives (FP) and false-negatives (FN) are reported at segment level, with anomalies as positive class. The F$_1$ summarizes these quantities into an overall measure. By $\delta $ we denote the performance loss on the original task, which is the semantic segmentation of Cityscapes. In this context, we consider a performance loss of 1% as acceptable, particularly in regard of a significantly improved anomaly detection performance

Anomaly score/entropy threshold	Entropy maximization $+$ thresholding				Entropy maximization $+$ thresholding $+$ meta classifier
$a_i \ge \tau , i\in \mathcal {I}$	FP $\downarrow $	FN $\downarrow $	F$_1$ $\uparrow $	$\delta $ in % $\downarrow $	FP $\downarrow $	FN $\downarrow $	F$_1$ $\uparrow $	$\delta $ in % $\downarrow $
$\tau =0.30$	8,068	191	0.26	0.30	290	308	0.82	0.06
$\tau =0.40$	4,035	289	0.39	0.11	251	359	0.81	0.03
$\tau =0.50$	1,215	415	0.60	0.04	145	447	0.80	0.02
$\tau =0.60$	327	613	0.69	0.02	49	619	0.76	0.02
$\tau =0.70$	135	879	0.61	0.01	21	881	0.63	0.01

In our experiments on LostAndFound [PRG+16], we employ a logistic regression as meta classifier that is applied as a post-processing step on top of softmax probabilities. We observe that the meta classifier is capable of reliably removing false-positive anomaly instance predictions, which in turn significantly improves detection performance of anomalous objects. The meta classification performance is reported in Table 2, a visual example is given in Fig. 3. We note that meta classification is applied to segmentation masks as input. Therefore, the output of the combination of entropy maximization and meta classification does not yield pixel-wise anomaly scores to compare against the methods presented in Sect. 4.1.

The idea of meta classification can even be used to directly identify potential anomalous objects in the semantic segmentation mask, see [ORG18], which will be subject to discussion in the following section about unsupervised learning of unknown objects.

5 Discovering and Learning Novel Classes

If certain types of anomalies appear frequently, it might be reasonable to include them as additional learnable classes of the segmentation model. In this section, we propose an unsupervised method to further process anomaly predictions, all with the goal to produce labels corresponding to novel classes. Afterwards, we will introduce an incremental learning approach to train a model on novel classes by means of the retrieved unsupervised labels.

5.1 Unsupervised Identification and Segmentation of a Novel Class

Consider the dataset $ \mathcal {D^\mathrm {test}} \subseteq \mathcal {X} $ of unlabeled images $ \mathbf {x} = (x_i)_{i\in \mathcal {I}} \in \mathbb {I}^{H \times W \times 3}$, along with a semantic segmentation network $\mathbf {F}: \mathbb {I}^{H \times W \times 3} \rightarrow \mathbb {R}^{H \times W \times S} $ trained on the set of classes $\mathcal {S} = \{ 1, \ldots , S \}$. Moreover, let $\mathbf {a} = (a_i)_{i\in \mathcal {I}} \in \mathbb {R}^{H \times W}$ denote a score map, as introduced in Sect. 4, which assigns the degree of anomaly to each pixel $i \in \mathcal {I}$ in $\mathbf {x}$. Our unsupervised anomaly segmentation technique is a three-step procedure:

1. Image embedding: Image retrieval methods are commonly applied to construct a database of images that are visually related to a given image. On that account, such methods must quantify visual similarities, i.e., to measure the discrepancy or “distance” between images. A simple idea is averaging over the pixel-wise differences. However, this approach is extremely sensitive towards data transformation such as rotation, variation in light, or different resolutions. More advanced approaches make use of visual descriptors that extract the elementary characteristics of the visual contents, e.g., color, shape, or texture. These methods are invariant to data transformation, i.e., they perform well in identifying images representing the same item. If we want to detect different instances of the same category, deep learning methods represent the state-of-the-art. In this regard, convolutional neural networks (CNNs) achieve very high accuracy in image classification tasks. These networks extract features of the images, that are stable regarding transformations as well as the represented object itself, i.e., objects of the same category result in similar feature vectors. We now adapt this idea to identify anomalies that belong to the same class.

Let $\mathcal {K}_{\mathbf {a}|\mathbf {x}}$ denote the set of connected components within $(a_i^{(\tau )})_{i \in \mathcal {I}}, a_i^{(\tau )} := \boldsymbol{1}_{ \{ a_i \ge \tau \} } \forall ~ i\in \mathcal {I}$ for a given threshold $\tau \in \mathbb {R}$, after processing image $\mathbf {x}$. Furthermore, let $\mathcal {K} := \bigcup _{x \in \mathcal {X}} \mathcal {K}_{\mathbf {a}|\mathbf {x}}$ denote the set of all predicted anomaly components in $\mathcal {D^\mathrm {test}}$. For each component $k\in \mathcal {K}_{\mathbf {a}|\mathbf {x}}$, we tailor the input $\mathbf {x}$ to the image crop $\mathbf {x}^{(k)} = (x_i)_{i\in \mathcal {I}'}, \mathcal {I}' \subseteq \mathcal {I}$ by means of the bounding box around $k\in \mathcal {K}_{\mathbf {a}|\mathbf {x}}$. By feeding the crop $\mathbf {x}^{(k)}$ to an image classification network $\mathbf {G}$, we map $\mathbf {x}^{(k)}$ onto its feature vector $\mathbf {g}^{(k)} := \mathbf {G}_{L-1}(\mathbf {x}^{(k)}) \in \mathbb {R}^n$, $n\in \mathbb {N}$ for all $k\in \mathcal {K}$. Here, $\mathbf {G}_{L-1}$ denotes the output of the penultimate layer of $\mathbf {G}$.

2. Dimensionality reduction: Feature vectors extracted by CNNs are usually very high-dimensional. This evokes several problems regarding the clustering of such data. The first issue is known as curse of dimensionality, i.e., the amount of required data explodes with increasing dimensionality. Furthermore, distance metrics become less precise. Dimensionality reduction approaches project the feature vectors onto a low-dimensional representation, either by feature elimination, selection, or extraction. The latter creates new independent features as a combination of the original vectors and can be further distinguished between linear and non-linear techniques. A linear feature extraction approach, named principal component analysis (PCA) [Pea01], aims at decorrelating the components of the vectors by a change of basis, such that they are mostly aligned along the first axes. Thereby, not much information is lost if we drop the last components. A more recent non-linear method is t-distributed stochastic neighbor embedding (t-SNE) [vdMH08], which uses conditional probabilities representing pairwise similarities. Let us consider two feature vectors $\mathbf {g}^{(k)}$, $\mathbf {g}^{(k')}$ with $k,k'\in \mathcal {K}$ and let ${\mathrm p_{}}_{k|k'} \in \mathbb {I}$ denote their similarity under a Gaussian distribution. Employing a Student t-distribution with one degree of freedom in the low-dimensional space then provides a second probability $\mathrm {q}_{k|k'} \in \mathbb {I}$. Hence, t-SNE aims at minimizing the following sum (or Kullback-Leibler divergence) [vdMH08]

$$\begin{aligned} \sum _{k\in \mathcal {K}}\sum _{k'\in \mathcal {K}}{\mathrm p_{}}_{k|k'} \log \left( \frac{{\mathrm p_{}}_{k|k'}}{\mathrm {q}_{k|k'}} \right) \end{aligned}$$

(32)

using gradient descent. We first perform dimensionality reduction via PCA, which is then followed by t-SNE. In our experiments, we observed that this combination of methods improves the effectiveness of mapping anomaly predictions onto a two-dimensional embedding space. Here, the embedding ideally creates neighborhoods of visually related anomalies.

3. Novelty segmentation: If anomalies of the same category are detected more frequently, they are expected to form a bigger cluster in the embedding space. Those clusters can be identified by employing algorithms such as density-based spatial clustering of applications with noise (DBSCAN) [EKSX96]. This algorithm supports the idea of non-supervision since it does not require any information of the potential anomaly data, such as e.g., the number of clusters. Moreover, DBSCAN divides data points into core points, border points, and noise, depending on the size of the neighborhood $\varepsilon \in [0,\infty )$ and the minimal number of a core point’s neighbors $\delta \in \mathbb {N}$.

More precisely, let $\tilde{\mathbf {g}}^{(k)} \in \mathbb {R}^2$ denote the two-dimensional representation of $\mathbf {x}^{(k)}$. Then, $\tilde{\mathbf {g}}^{(k)}$ is considered as a core point, if the corresponding point-wise density $\rho (\tilde{\mathbf {g}}^{(k)}) := |\{ \tilde{\mathbf {g}}^{(k')} ~:~\Vert \tilde{\mathbf {g}}^{(k)} - \tilde{\mathbf {g}}^{(k')}\Vert < \varepsilon , k'\in \mathcal {K}\}| \ge \delta $, i.e., the $\varepsilon $-neighborhood of $\tilde{\mathbf {g}}^{(k)}$ contains at least $\delta $ points including itself. We denote the neighborhood of a core point $\tilde{\mathbf {g}}^{(\mathring{k})}$, which corresponds to a component $\mathring{k} \in \mathcal {K}$, as $ B_{\mathring{k}} := \{ \tilde{\mathbf {g}}^{(k')} : \Vert \tilde{\mathbf {g}}^{(\mathring{k})} - \tilde{\mathbf {g}}^{(k')}\Vert < \varepsilon , k'\in \mathcal {K} \}. $ If $\tilde{\mathbf {g}}^{(k)}$ is not a core point but belongs to a core point’s neighborhood, we call it a border point. Otherwise, i.e., if $\tilde{\mathbf {g}}^{(k)}$ is neither a core point nor within a core point’s neighborhood, we call it noise.

Finally, a cluster $\mathcal {C}_j \subset \mathcal {K}, j \in \mathcal {J} := \{ 1, \ldots , J \}$ of components is formed by merging overlapping neighborhoods $B_{\mathring{k}}$, yielding $J\in \mathbb {N}$ clusters in total. In other words, clusters are formed from connected core points and their neighborhoods’ border points. Given $\rho (\tilde{\mathbf {g}}^{(k)})$, we can determine the cluster density of $\mathcal {C}_j$, e.g.,

https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-01233-4_10/MediaObjects/514228_1_En_10_Equ37_HTML.png

The cluster $\mathcal {C}^*\subset \mathcal {K}$, which is the cluster of highest density given a sufficient cluster size, is then selected to be further processed. To this end, let us consider the predicted segmentation mask $\mathbf {F}(\mathbf {x})=\mathbf {m} = (m_i)_{i\in \mathcal {I}}$, where $m_i = \arg \max _{s\in \mathcal {S}} y_{i,s}, i \in \mathcal {I}$. The pseudo labels $\tilde{\mathbf {y}} = (\tilde{y}_i)_{i\in \mathcal {I}}$ for the originally unlabeled $\mathbf {x}$ are then obtained by setting $\tilde{y}_i = S + 1$ if pixel location i belongs to a component $k \in \mathcal {C}^*$, and $\tilde{y}_i = m_i$ otherwise.

5.2 Class-Incremental Learning

Let $\tilde{\mathcal {Y}}$ denote the set of pseudo labels, then the training data for some novel class $S+1$ can be represented by $\mathcal {D}^{S+1} \subseteq \mathcal {D}^\mathrm {novel} \times \tilde{\mathcal {Y}} $, where $\mathcal {D}^\mathrm {novel}$ denotes the set of previously-unseen images containing novel classes. By extending the semantic segmentation network $\mathbf {F}$ to $\mathbf {F}^+: \mathbb {I}^{H \times W \times 3} \rightarrow \mathbb {R}^{H \times W \times (S+1)}$ and retraining $\mathbf {F}^+$ on $\mathcal {D}^{S+1}$, we perform incremental learning to add a novel and previously unknown class to the semantic space of $\mathbf {F}$.

Regularization: Knowledge distillation is a subcategory of regularization strategies aiming to mitigate a catastrophic forgetting, i.e., these strategies try to mitigate performance loss on the previously-learned classes $\mathcal {S} = \{1, \ldots , S\}$ while learning the additional class $S+1$. In [MZ19], the authors adapted incremental learning techniques to the task of semantic segmentation. Among others, they introduced the overall objective

$$\begin{aligned} J^\mathrm {total}(\mathbf {x},\tilde{\mathbf {y}}) = (1-\lambda ) J^\mathrm {CE}(\mathbf {F}^+(\mathbf {x}), \tilde{\mathbf {y}}) + \lambda J^\mathrm {D}(\mathbf {F}^+(\mathbf {x}), \mathbf {F}(\mathbf {x})), ~~ \lambda \in \mathbb {I}\; , \end{aligned}$$

(33)

where $(\mathbf {x},\tilde{\mathbf {y}}) \in \mathcal {D}^{S+1}$. Here, $J^\mathrm {CE}$ denotes the common cross-entropy loss over the enlarged set of class indices $\mathcal {S}^+ := \{1,\ldots ,S+1\}$ and $J^\mathrm {D}$ the distillation loss. The latter loss is defined as

$$\begin{aligned} J^\mathrm {D}(\mathbf {F}^+(\mathbf {x}), \mathbf {F}(\mathbf {x})) := -\frac{1}{H\cdot W} \sum _{i\in \mathcal {I}}\sum _{s\in \mathcal {S}} \mathrm {softmax}_s(\mathbf {y}_{i})\log (\mathrm {softmax}_s(\mathbf {y}^+_{i})) \end{aligned}$$

(34)

with $\mathbf {y} = \mathbf {F}(\mathbf {x})$ and $\mathbf {y}^+ = \mathbf {F}^+(\mathbf {x})$. Knowledge distillation can be further improved by freezing the weights of the encoder part of $\mathbf {F}^+$ during the training procedure [MZ19].

Rehearsal: If the original training data $\mathcal {D}^\mathrm {train} \subseteq \mathcal {X} \times \mathcal {Y}$ of network $\mathbf {F}$ is available, in incremental learning such data is usually re-integrated into the training set of the extended network $\mathbf {F}^+$, i.e., the training samples are drawn from $\mathcal {D}^\mathrm {train} \cup \mathcal {D}^{S+1}$. To save computational costs of training and to balance the amount of old and new training data, established methods, e.g., [Rob95], only use a subset of $\mathcal {D}^\mathrm {train}$. This subset is typically obtained by randomly sampling a set from $\mathcal {D}^\mathrm {train}$ that matches the size of $|\mathcal {D}^{S+1}|$.

In combination with knowledge distillation, rehearsal strategies can be employed to mitigate a loss of performance on classes that are related to the novel class. This issue may arise e.g., through visual similarity such as between classes like bus and train, or due to class affiliation as in the case of bicycle and rider. Relevant classes can be identified by their frequency of being predicted on the relabeled pixels, i.e.,

$$\begin{aligned} \nu _s^\mathrm {tot} := \sum _{(\mathbf {x},\tilde{\mathbf {y}})\in \mathcal {D}^{S+1}}| \{i\in \mathcal {I}~|~m_i = s~\wedge ~\tilde{y}_i= S+1\}| ~~ \forall s\in \mathcal {S} \; , \end{aligned}$$

(35)

and hence

$$\begin{aligned} \nu _s^\mathrm {rel} := \frac{\nu _s^\mathrm {tot}}{\sum _{s'\in \mathcal {S}}\nu _{s'}^\mathrm {tot}}~~ \forall s\in \mathcal {S} \; . \end{aligned}$$

(36)

The subset of $\mathcal {D}^\mathrm {train}$ is then randomly sampled under the constraint that there are at least $\nu _s^\mathrm {rel}|\mathcal {D}^{S+1}|$ images containing the class s for all $s\in \mathcal {S}$.

5.3 Experiments and Evaluation

In the following experiments, we will employ a DeepLabV3+ [CZP+18] model with an underlying WiderResNet38 [ZSR+19] backbone for semantic segmentation. This network is initially trained on a set of 17 classes, which we will extend by a novel class. The already trained classes are the Cityscapes training classes except pedestrian and rider, i.e., we exclude any human in the training process of our initial semantic segmentation network $\mathbf {F}$.

The initial model was trained on the Cityscapes [COR+16] training data. For the incremental learning process, we use a portion of those data and combine them with our generated disjoint training set $\mathcal {D}^{S+1}$ containing previously unseen images and pseudo labels on novel objects. Here, the images from $\mathcal {D}^{S+1}$ are drawn from the Cityscapes test data. For evaluation purposes, we use the Cityscapes validation data. Hence, during the incremental learning process only known objects are presented to the model except humans and a few instances, such as the ego-car or mountains in an image background, belonging to the Cityscapes void category.

We use the idea of meta classification, similarly as introduced in Sect. 4.3, to rate the prediction quality of predicted semantic segmentation masks. Here, the meta task is to estimate the segment-wise IoU first, see Fig. 4, on which we apply thresholding (at $\tau =0.5$) to determine potential anomalies, cf. [ORF20]. We employ gradient boosting as meta model, which achieves a coefficient of determination of $\mathrm {R}^2 = 82.51\%$ in estimating the segment-wise IoU on the Cityscapes validation split.

In accordance to Sect. 2 and as already observed in Sect. 4.3, the softmax entropy is again one of the main metrics included in the meta model to identify anomalous predictions. Thus, the entropy shows to have great impact on meta classification performance, which, similarly, has also been observed in [CRH+20, CRG21].

Given anomaly segmentation masks, we perform image embedding using the encoder of the image classification network DenseNet201 [HLMW17], that is pretrained on ImageNet [DDS+09]. Next, we reduce the dimensionality of the resulting feature vectors to 50 via PCA and further to 2 by applying t-SNE. In [ORF20], a qualitative and quantitative evaluation of different embedding approaches is provided. Note that t-SNE is non-deterministic, i.e., we obtain slightly different embedding spaces for different runs. In our experiment, employing DBSCAN with parameters $\varepsilon =2.5$ and $\delta =15$ produces a human-cluster including 91 components from 76 different images. The most frequently predicted class of these components are car, motorcycle, and bicycle with $\nu _{11}^\mathrm {rel} = 24.84\%$, $\nu _{15}^\mathrm {rel} = 26.69\%$ and $\nu _{16}^\mathrm {rel} = 33.53\%$, respectively, see Fig. 5.

We train the extended model $\mathbf {F}^+$ as described in Sect. 5.2 for 70 epochs, weighting the loss functions in (33) equally, i.e., $\lambda =0.5$. The extended model shows the ability to retain its initial knowledge by achieving an $\mathrm {mIoU}$ score of $68.24\%$ on the old classes when evaluating on the Cityscapes validation data. This yields a marginal loss of only $0.39\%$ compared to the initial model $\mathbf {F}$. At the same time, $\mathbf {F}^+$ predicts the novel human class with a class $\mathrm {IoU}$ of $41.42\%$, without a single annotated human instance in the training data $\mathcal {D}^{S+1}$. A visual example of our unsupervised novelty segmentation approach is provided in Fig. 6, more details on the numerical evaluation is given in Table 3.

Table 3

Evaluation of the Cityscapes validation split before and after incremental learning the novelty human (highlighted in gray) with knowledge distillation and rehearsal. The classes pedestrian and rider are aggregated to the novel class human. All other classes are treated as background, i.e., they are ignored during training, regarding the data from $\mathcal {D}^{S+1}$

https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-01233-4_10/MediaObjects/514228_1_En_10_Tab3_HTML.png

5.4 Outlook on Improving Unsupervised Learning of Novel Classes

In the preliminary experiments presented in this section, we demonstrated that a semantic segmentation network can be extended by a novel class in an unsupervised fashion. As a basis to start, our unsupervised learning approach requires anomaly segmentation masks. Currently, these are obtained by meta classification [RCH+20], which is, however, not a method specifically tailored for the task of anomaly segmentation. In other words, the obtained masks are possibly inaccurate. To be even more precise on the limitation of plain meta classification, this method is only able to find anomalies when the segmentation model produces a (false positive) object prediction on those anomalies. By design, meta classifiers cannot find overlooked instances, e.g., obstacles on the road which also have been classified as road. As an illustration of this issue, we refer to Fig. 7.

Having now several methods at hand, that we, e.g., introduced in Sect. 4.1, it seems obvious to replace the underlying anomaly segmentation method by a more sophisticated one as future work. In particular, given the decent performance of our unsupervised learning approach relying only on meta classification and the entropy measure as highly beneficial metric for meta classification, combining entropy maximization and meta classification is a promising approach to improve the presented novelty training approach.

Conclusions

Semantic segmentation as a supervised learning task is typically performed by models that operate on a given set containing a fixed number of classes. This is in clear contrast to the open world scenarios to which practitioners contemplate the usage of segmentation models. There are important capabilities that standard segmentation models do not exhibit. Among them is the capability to know when they face an object of a class they have not learned – i.e., to perform anomaly segmentation – as well as the capability to realize that similar objects, presumably of the same (yet unknown) class, appear frequently and should be learned either as a new class or be attributed to an existing one. In this chapter, we have seen first promising results for two tasks, for anomaly segmentation as well as for the detection and unsupervised learning of new classes.

For anomaly segmentation, we considered a number of generic baseline methods stemming from image classification as well as some recent anomaly segmentation methods. Since the latter clearly outperforms the former, this stresses the need for the development of methods specifically designed for anomaly segmentation. We have demonstrated with our entropy maximization method empirically as well as theoretically that good proxies in combination with training on anomaly examples for high entropy are key to equip standard semantic segmentation models with anomaly segmentation capabilities. Particularly on the challenging RoadObstacle21 dataset with diverse street scenarios, entropy maximization yields great performance which is not reached by any other method so far. While there exists a moderate number of datasets for anomaly segmentation, there is clearly still the need of additional datasets. The number of possible unknown object classes not covered by these datasets is evidently enormous. Furthermore, also the vast variety of possible environmental conditions and further domain shifts that may occur, possibly also in combination with unknown objects, continuously demand their exploration.

For detection and unsupervised learning of new classes, we demonstrated in preliminary experiments that a combination of well-established dimensionality reduction and clustering methods along with the advanced uncertainty quantification method for semantic segmentation called MetaSeg is well able to detect unknown classes of which objects appear relatively frequently in a given test set. Indeed, MetaSeg can also be used to define segmentation proposals for pseudo ground-truth of new classes, which can also be learned incrementally by the segmentation model. For the considered scenario of subsequently learning humans within the Cityscapes dataset, this approach yields an $\mathrm {IoU}$ of $41.42\%$ on the novel class without losing performance on the original classes. The proposed methodology may help to incorporate new classes into existing models with low human labeling effort. The necessity for this will occur repeatedly in future. An example are the electric scooters that recently arose in several metropolitan areas across the globe. This is an example for a global phenomenon. However, also local phenomena, such as boat trailers at the coast, could be of interest. Such classes can be initially incorporated into an existing model using our methodology. Afterwards, the initial performance could be further improved with active learning approaches, such as [CRGR21], still requiring only a small amount of human labeling effort. It is also an open question, to which extent the proposed method can be used iteratively to improve the performance on a new class. Also for this track of research, the lack of data for pursuing that task is a limiting factor as of now.

Acknowledgements

The research leading to these results is funded by the German Federal Ministry for Economic Affairs and Energy within the projects “KI Absicherung - Safe AI for Automated Driving”, grant no. 19A19005R, and “KI Delta Learning - Scalable AI for Automated Driving”, grant no. 19A19013Q, respectively. The authors would like to thank the consortia for the successful cooperation. The authors gratefully also acknowledge the Gauss Centre for Supercomputing e.V. (www.gauss-centre.eu) for funding this project by providing computing time through the John von Neumann Institute for Computing (NIC) on the GCS Supercomputer JUWELS at Jülich Supercomputing Centre (JSC).

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

previous chapter Uncertainty Quantification for Object Detection: Output- and Gradient-Based Approaches

next chapter Evaluating Mixture-of-Experts Architectures for Network Aggregation

www.segmentmeifyoucan.com.

[AAM+19]

A. Atanov, A. Ashukha, D. Molchanov, K. Neklyudov, D. Vetrov, Uncertainty estimation via stochastic batch normalization, in Proceedings of the International Symposium on Neural Networks (ISNN) (Moscow, Russia, 2019), pp. 261–269

[ABE+18]

R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, T. Tuytelaars, Memory Aware Synapses: Learning what (not) to forget, in Proceedings of the European Conference on Computer Vision (ECCV) (2018), pp. 139–154

[ACS19]

M. Angus, K. Czarnecki, R. Salay, Efficacy of Pixel-Level OOD Detection for Semantic Segmentation (November 2019), pp. 1–13. arXiv:1911.02897

[ACT17]

R. Aljundi, P. Chakravarty, T. Tuytelaars, Expert gate: lifelong learning with a network of experts, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Honolulu, HI, USA, July 2017), pp. 7120–7129

[AGLR21]

H. Asatryan, H. Gottschalk, M. Lippert, M. Rottmann, A Convenient Infinite Dimensional Framework for Generative Adversarial Learning (January 2021), pp. 1–29. arXiv:2011.12087

[AHK01]

C. Aggarwal, A. Hinneburg, D. Keim, On the surprising behavior of distance metrics in high dimensional spaces, in Proceedings of the International Conference on Database Theory (London, UK, January 2001), pp. 420–434

[AM19]

M. Amer, T. Maul, Reducing Catastrophic Forgetting in Modular Neural Networks by Dynamic Information Balancing (December 2019), pp. 1–15. arXiv:1912.04508

[AR05]

W. Abraham, A. Robins, Memory retention – the synaptic stability versus plasticity dilemma. Trends Neurosci. 28(2), 73–78 (2005)

[ARC20]

S. Agarwal, A. Rattani, C.R. Chowdary, A-iLearn: An adaptive incremental learning model for spoof fingerprint detection. Machine Learning with Applications, 7, 100210 (2020)

[AY01]

C.C. Aggarwal, P.S. Yu, Outlier detection for high dimensional data, in Proceedings of the ACM SIGMOD International Conference on Management of Data (MOD) (May 2001), pp. 37–46

[BBSC21]

G. Di Biase, H. Blum, R. Siegwart, C. Cadena, Pixel-Wise Anomaly Detection in Complex Driving Scenes, in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2021), pp. 16918–16927

[BKC17]

V. Badrinarayanan, A. Kendall, R. Cipolla, Bayesian SegNet: model uncertainty in deep convolutional encoder-decoder architectures for scene understanding, in Proceedings of the British Machine Vision Conference (BMVC) (2017), pp. 1–12

[BKNS00]

M.M. Breunig, H.-P. Kriegel, R.T. Ng, J. Sander, LOF: identifying density-based local outliers. ACM SIGMOD Rec. 29(2), 93–104 (2000)CrossRef

[BKOŠ19]

P. Bevandić, I. Krešo, M. Oršić, S. Šegvić, Simultaneous semantic segmentation and outlier detection in presence of domain shift, in Proceedings of the German Conference on Pattern Recognition (GCPR) (Dortmund, Germany, 2019), pp. 33–47

[BSN+19]

H. Blum, P.-E. Sarlin, J. Nieto, R. Siegwart, C. Cadena, Fishyscapes: a benchmark for safe semantic segmentation in autonomous driving, in Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops (Seoul, Korea, October 2019), pp. 2403–2412

[CDAT18]

A. Chaudhry, P.K. Dokania, T. Ajanthan, P.H.S. Torr, Riemannian walk for incremental learning: understanding forgetting and intransigence, in Proceedings of the European Conference on Computer Vision (ECCV) (Munich, Germany, September 2018), pp. 556–572

[CLU+21]

R. Chan, K. Lis, S. Uhlemeyer, H. Blum, S. Honari, R. Siegwart, P. Fua, M. Salzmann, M. Rottmann, SegmentMeIfYouCan: a benchmark for anomaly segmentation, in Proceedings of the Conference on Neural Information Processing Systems (NIPS/NeurIPS) – Datasets and Benchmarks Track, virtual conference (December 2021), pp. 1–35

[CM15]

C. Creusot, A. Munawar, Real-time small obstacle detection on highways using compressive RBM road reconstruction, in Proceedings of the IEEE Intelligent Vehicles Symposium (IV) (Seoul, Korea, June 2015), pp. 162–167

[CMB+20]

F. Cermelli, M. Mancini, S. Rota Bulo, E. Ricci, B. Caputo, Modeling the background for incremental learning in semantic segmentation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), virtual conference (June 2020), pp. 9233–9242

[CMJM+18]

F.M. Castro, M.J. Marín-Jiménez, N. Guil, C. Schmid, K. Alahari, End-to-End Incremental Learning, in Proceedings of the European Conference on Computer Vision (ECCV) (2018), pp. 233–248

[COR+16]

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, B. Schiele, The Cityscapes dataset for semantic urban scene understanding, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Las Vegas, NV, USA, June 2016), pp. 3213–3223

[CRG21]

R. Chan, M. Rottmann, H. Gottschalk, Entropy maximization and meta classification for out-of-distribution detection in semantic segmentation, in Proceedings of the IEEE International Conference on Computer Vision (ICCV), virtual conference (October 2021), pp. 5128–5137

[CRGR21]

P. Colling, L. Roese-Koerner, H. Gottschalk, M. Rottmann, MetaBox+: a new region based active learning method for semantic segmentation using priority maps, in Proceedings of the International Conference on Pattern Recognition Applications and Methods (ICPRAM), virtual conference (February 2021), pp. 51–62

[CRH+20]

R. Chan, M. Rottmann, F. Hüger, P. Schlicht, H. Gottschalk, Controlled false negative reduction of minority classes in semantic segmentation, in Proceedings of the International Joint Conference on Neural Networks (IJCNN), virtual conference (July 2020), pp. 1–8

[CSMDB19]

A. Chefrour, L. Souici-Meslati, I. Difi, N. Bakkouche, A novel incremental learning algorithm based on incremental vector support machina and incremental neural network learn++. Revue d’Intelligence Artificielle 33(3), 181–188 (2019)CrossRef

[Cyb89]

G. Cybenko, Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2(4), 303–314 (1989)MathSciNetCrossRef

[CZP+18]

L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, H. Adam, Encoder-decoder with atrous separable convolution for semantic image segmentation, in Proceedings of the European Conference on Computer Vision (ECCV) (Munich, Germany, September 2018), pp. 833–851

[DDS+09]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, F.-F. Li, ImageNet: a large-scale hierarchical image database, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Miami, FL, USA, June 2009), pp. 248–255

[DHF07]

L. De Haan, A. Ferreira, Extreme Value Theory: An Introduction (Springer, 2007)

[DT18]

T. DeVries, G.W. Taylor, Learning Confidence for Out-of-Distribution Detection in Neural Networks (February 2018), pp. 1–12. arXiv:1802.04865

[EKSX96]

M. Ester, H.-P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) (August 1996), pp. 226–231

[FAML19]

M. Farajtabar, N. Azizan, A. Mott, A. Li, Orthogonal gradient descent for continual learning, in International Conference on Artificial Intelligence and Statistics (2020, June), pp. 3762–3773. PMLR

[GDS20]

F.K. Gustafsson, M. Danelljan, T. Bo Schön, Evaluating scalable Bayesian deep learning methods for robust computer vision, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, virtual conference (June 2020), pp. 1289–1298

[GG16]

Y. Gal, Z. Ghahramani, Dropout as a Bayesian approximation: representing model uncertainty in deep learning, in Proceedings of the International Conference on Machine Learning (ICML) (New York, NY, USA, June 2016), pp. 1050–1059

[GK16]

A. Gepperth, C. Karaoguz, A bio-inspired incremental learning architecture for applied perceptual problems. Cogn. Comput. 8(5), 924–934 (2016)CrossRef

[HAB19]

M. Hein, M. Andriushchenko, J. Bitterwolf, Why ReLU networks yield high-confidence predictions far away from the training data and how to mitigate the problem, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Long Beach, CA, USA, June 2019), pp. 41–50

[HBM+20]

D. Hendrycks, S. Basart, M. Mazeika, M. Mostajabi, J. Steinhardt, D. Song, Scaling Out-of-Distribution Detection for Real-World Settings (December 2020), pp. 1–12. arXiv:1911.11132

[HCP+19]

C. Hu, Y. Chen, X. Peng, H. Yu, C. Gao, L. Hu, A novel feature incremental learning method for sensor-based activity recognition. IEEE Trans. Knowl. Data Eng. 31(6), 1038–1050 (2019)CrossRef

[HG17]

D. Hendrycks, K. Gimpel, A baseline for detecting misclassified and out-of-distribution examples in neural networks, in Proceedings of the International Conference on Learning Representations (ICLR) (Toulon, France, April 2017), pp. 1–12

[HLMW17]

G. Huang, Z. Liu, L. van der Maaten, K.Q. Weinberger, Densely connected convolutional networks, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Honolulu, HI, USA, July 2017), pp. 4700–4708

[HMD19]

D. Hendrycks, M. Mazeika, T. Dietterich, Deep anomaly detection with outlier exposure, in Proceedings of the International Conference on Learning Representations (ICLR) (New Orleans, LA, USA, May 2019), pp. 1–18

[HMKS19]

D. Hendrycks, M. Mazeika, S. Kadavath, D. Song, Using self-supervised learning can improve model robustness and uncertainty, in Proceedings of the Conference on Neural Information Processing Systems (NIPS/NeurIPS) (Vancouver, BC, Canada, December 2019), pp. 15637–15648

[HPL+18]

S. Hou, X. Pan, C. Change Loy, Z. Wang, D. Lin, Lifelong learning via progressive distillation and retrospection, in Proceedings of the European Conference on Computer Vision (ECCV) (Munich, Germany, August 2018), pp. 452–467

[HTF07]

T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer, 2007)

[HVD14]

G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network, in Proceedings of the Conference on Neural Information Processing Systems (NIPS/NeurIPS) Workshops (Montréal, QC, Canada, December 2014), pp. 1–9

[HZ21]

J. He, F. Zhu, Unsupervised Continual Learning via Pseudo Labels (July 2021), pp. 1–9. arXiv:2104.07164

[JJJK18]

H. Jung, J. Ju, M. Jung, J. Kim, Less-forgetful learning for domain expansion in deep neural networks, in Proceedings of the AAAI Conference on Artificial Intelligence (New Orleans, LA, USA, February 2018), pp. 3358–3365

[JRF20]

N. Jourdan, E. Rehder, U. Franke, Identification of uncertainty in artificial neural networks, in Proceedings of the Uni-DAS e.V. Fahrerassistenz und automatisiertes Fahren (FAS) Workshop, virtual conference (July 2020), pp. 1–11

[KBDFs20]

M. Klingner, A. Bär, P. Donn, T. Fingscheidt, Class-incremental learning for semantic segmentation re-using neither old data nor old labels, in Proceedings of the IEEE Intelligent Transportation Systems Conference (ITSC), virtual conference (September 2020), pp. 1–8

[KBJC19]

D. Kim, J. Bae, Y. Jo, J. Choi, Incremental Learning With Maximum Entropy Regularization: Rethinking Forgetting and Intransigence (February 2019), pp. 1–10. arXiv:1902.00829

[KG17]

A. Kendall, Y. Gal, What uncertainties do we need in Bayesian deep learning for computer vision? in Proceedings of the Conference on Neural Information Processing Systems (NIPS/NeurIPS) (Long Beach, CA, USA, December 2017), pp. 5574–5584

[KKSZ09]

H.-P. Kriegel, P. Kröger, E. Schubert, A. Zimek, Outlier detection in axis-parallel subspaces of high dimensional data, in Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) (Bangkok, Thailand, April 2009), pp. 831–838

[Kle09]

J. Sakari Klemelä, Smoothing of Multivariate Data: Density Estimation and Visualization (Wiley, 2009)

[KNT00]

E.M. Knorr, R.T. Ng, V. Tucakov, Distance-based outliers: algorithms and applications. Int. J. Very Larg. Data Bases 8(3–4), 237–253 (2000)CrossRef

[KPR+17]

J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A.A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, R. Hadsell, Overcoming Catastrophic Forgetting in Neural Networks, in Proceedings of the national academy of sciences, 114(13), 3521–3526 (2017)

[KSZ08]

H.-P. Kriegel, M. Schubert, A. Zimek, Angle-based outlier detection in high-dimensional data, in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) (Las Vegas, NV, USA, August 2008), pp. 444–452

[LH18]

Z. Li, D. Hoiem, Learning without forgetting. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 40, 2935–2947 (2018)CrossRef

[LHFS21]

K. Lis, S. Honari, P. Fua, M. Salzmann, Detecting road obstacles by erasing them (April 2021), pp. 1–18. arXiv:2012.13633

[LKJ+17]

S.-W. Lee, J.-H. Kim, J. Jun, J.-W. Ha, B.-T. Zhang, Overcoming Catastrophic Forgetting by Incremental Moment matching. Advances in neural information processing systems, 30 (2017)

[LLLS18]

K. Lee, K. Lee, H. Lee, J. Shin, A simple unified framework for detecting out-of-distribution samples and adversarial attacks, in Proceedings of the Conference on Neural Information Processing Systems (NIPS/NeurIPS) (Montréal, QC, Canada, December 2018), pp. 7167–7177

[LLS18]

S. Liang, Y. Li, R. Srikant, Enhancing the reliability of out-of-distribution image detection in neural networks, in Proceedings of the International Conference on Learning Representations (ICLR) (Vancouver, BC, Canada, April 2018), pp. 1–15

[LLSL19]

K. Lee, K. Lee, J. Shin, H. Lee, Overcoming catastrophic forgetting with unlabeled data in the wild, in Proceedings of the IEEE International Conference on Computer Vision (ICCV) (Seoul, Korea, October 2019), pp. 312–321

[LMB+14]

T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. Lawrence Zitnick, Microsoft COCO: Common Objects in Context, in Proceedings of the European Conference on Computer Vision (ECCV) (Zurich, Switzerland, September 2014), pp. 740–755

[LNFS19]

K. Lis, K. Nakka, P. Fua, M. Salzmann, Detecting the Unexpected via Image Resynthesis, in Proceedings of the IEEE International Conference on Computer Vision (ICCV) (Seoul, Korea, October 2019), pp. 2152–2161

[LPB17]

B. Lakshminarayanan, A. Pritzel, C. Blundell, Simple and scalable predictive uncertainty estimation using deep ensembles, in Proceedings of the Conference on Neural Information Processing Systems (NIPS/NeurIPS) (Long Beach, CA, USA, December 2017), pp. 6402–6413

[Mac92]

D.J.C. MacKay, A practical Bayesian framework for backpropagation networks. Neural Comput. 4(3), 448–472 (1992)CrossRef

[Mah36]

P. Chandra Mahalanobis, On the generalized distance in statistics. Proc. Natl. Inst. India 2(1), 49–55 (1936)

[MCE20]

J.K. Mandivarapu, B. Camp, R. Estrada, Self-net: Lifelong learning via continual self-modeling. Frontiers in artificial intelligence, 3, 19

[MC89]

M. McCloskey, N. Cohen, Catastrophic interference in connectionist networks: the sequential learning problem. Psychol. Learn. Motiv. 24, 109–165 (1989)CrossRef

[MG19]

J. Mukhoti, Y. Gal, Evaluating Bayesian deep learning methods for semantic segmentation (March 2019), pp. 1–13. arXiv:1811.12709

[MH20]

A. Meinke, M. Hein, Towards neural networks that provably know when they don’t know, in Proceedings of the International Conference on Learning Representations (ICLR), virtual conference (April 2020), pp. 1–18

[MSC+19]

D. Mellado, C. Saavedra, S. Chabert, R. Torres, R.F. Salas, Self-improving generative artificial neural network for pseudorehearsal incremental class learning. Algorithms 12(10), 1–17 (2019)MathSciNetCrossRef

[MVD17]

A. Munawar, P. Vinayavekhin, G. De Magistris, Limiting the reconstruction capability of generative neural network using negative learning, in Proceedings of the IEEE International Workshop on Machine Learning for Signal Processing (MLSP) (Tokyo, Japan, September 2017), pp. 1–6

[MZ19]

U. Michieli, P. Zanuttigh, Incremental learning techniques for semantic segmentation, in Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops (Seoul, Korea, October 2019), pp. 3205–3212

[MZ21]

U. Michieli, P. Zanuttigh, Knowledge distillation for incremental learning in semantic segmentation. Comput. Vis. Image Underst. 205, 103167 (2021)CrossRef

[Nea96]

R.M. Neal, Bayesian Learning for Neural Networks (Springer, 1996)

[OOS17]

A. Odena, C. Olah, J. Shlens, Conditional image synthesis with auxiliary classifier GANs, in Proceedings of the International Conference on Machine Learning (ICML) (Sydney, NSW, Australia, August 2017), pp. 2642–2651

[OPK+19]

O. Ostapenko, M. Puscas, T. Klein, P. Jähnichen, M. Nabi, Learning to remember: a synaptic plasticity driven framework for continual learning, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Long Beach, CA, USA, June 2019), pp. 11313–11321

[ORF20]

P. Oberdiek, M. Rottmann, G.A. Fink, Detection and retrieval of out-of-distribution objects in semantic segmentation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), virtual conference (June 2020), pp. 1331–1340

[ORG18]

P. Oberdiek, M. Rottmann, H. Gottschalk, Classification uncertainty of deep neural networks based on gradient information, in Proceedings of the IAPR TC3 Workshop on Artificial Neural Networks in Pattern Recognition (ANNPR) (Siena, Italy, September 2018), pp. 113–125

[Pea01]

K. Pearson, On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 2(11), 559–572 (1901)

[PKGF03]

S. Papadimitriou, H. Kitagawa, P.B. Gibbons, C. Faloutsos, LOCI: fast outlier detection using the local correlation integral, in Proceedings of the International Conference on Data Engineering (Bangalore, India, March 2003), pp. 315–326

[PRG+16]

P. Pinggera, S. Ramos, S. Gehrig, U. Franke, C. Rother, R. Mester, Lost and found: detecting small road hazards for self-driving vehicles, in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (Daejeon, Korea, October 2016), pp. 1099–1106

[PTJ+21]

J. Peng, B. Tang, H. Jiang, Z. Li, Y. Lei, T. Lin, H. Li, Overcoming long-term catastrophic forgetting through adversarial neural pruning and synaptic consolidation, in IEEE Transactions on Neural Networks and Learning Systems (TNNLS) (February 2021) early access, pp. 1–14

[PUUH01]

R. Polikar, L. Upda, S.S. Upda, V. Honavar, Learn++: an incremental learning algorithm for supervised neural networks. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 31(4), 497–508 (2001)

[RCH+20]

M. Rottmann, P. Colling, T.-P. Hack, R. Chan, F. Hüger, P. Schlicht, H. Gottschalk, Prediction error meta classification in semantic segmentation: detection via aggregated dispersion measures of softmax probabilities, in Proceedings of the International Joint Conference on Neural Networks (IJCNN), virtual conference (July 2020), pp. 1–9

[RKSL17]

S.-A. Rebuffi, A. Kolesnikov, G. Sperl, C.H. Lampert, iCaRL: incremental classifier and representation learning, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Honolulu, HI, USA, July 2017), pp. 5533–5542

[Rob95]

A.V. Robins, Catastrophic forgetting, rehearsal and pseudorehearsal. Connect. Sci. 7(2), 123–146 (1995)

[Ros52]

M. Rosenblatt, Remarks on a multivariate transformation. Ann. Math. Stat. 23(3), 470–472 (1952)

[RPR20]

D. Roy, P. Panda, K. Roy, Tree-CNN: A Hierarchical Deep Convolutional Neural Network for Incremental Learning. Neural Networks, 121, 148–160 (2020)

[RRD+16]

A.A. Rusu, N.C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, R. Hadsell, Progressive Neural Networks (September 2016), pp. 1–14. arXiv:1606.04671

[RRS00]

S. Ramaswamy, R. Rastogi, K. Shim, Efficient algorithms for mining outliers from large data sets, in Proceedings of the ACM SIGMOD International Conference on Management of Data (MOD) (May 2000), pp. 427–438

[Rud87]

W. Rudin, Real and Complex Analysis (McGraw-Hill Inc., 1987)

[SAR20]

S. Shakib Sarwar, A. Ankit, K. Roy, Incremental learning in deep convolutional neural networks using partial network sharing. IEEE Access 8, 4615–4628 (2020)

[SCL+18]

J. Schwarz, W. Czarnecki, J. Luketina, A. Grabska-Barwinska, Y. W.Teh, R. Pascanu, R. Hadsell, Progress & Compress: A Scalable Framework for Continual Learning, in International Conference on Machine Learning (July, 2018), pp. 4528–4537. PMLR

[SKGK20]

A. Singh, A. Kamireddypalli, V. Gandhi, K.M. Krishna, LiDAR Guided Small Obstacle Segmentation, in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2020), pp. 8513–8520. IEEE

[SLKK17]

H. Shin, J.K. Lee, J. Kim, J. Kim, Continual learning with deep generative replay, in Proceedings of the Conference on Neural Information Processing Systems (NIPS/NeurIPS) (Long Beach, CA, USA, December 2017), pp. 2990–2999

[SY14]

M. Sakurada, T. Yairi, Anomaly detection using autoencoders with nonlinear dimensionality reduction, in Proceedings of the Workshop on Machine Learning for Sensory Data Analysis (MLSDA) (Gold Coast, QLD, Australia, December 2014), pp. 4–11

[TTA19]

O. Tasar, Y. Tarabalka, P. Alliez, Incremental learning for semantic segmentation of large-scale remote sensing data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 12(9), 3524–3537 (2019)

[vdMH08]

L. van der Maaten, G. Hinton, Visualizing data using t-SNE. J. Mach. Learn. Res. 9(86), 2579–2605 (2008)

[vdVT19]

G.M. van de Ven, A.S. Tolias, Three Scenarios for Continual Learning (April 2019), pp. 1–18. arXiv:1904.07734

[VGV+21]

S. Vandenhende, S. Georgoulis, W. Van Gansbeke, M. Proesmans, D. Dai, L. Van Gool, Multi-task learning for dense prediction tasks: a survey, in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (January 2021), early access, pp. 1–20

[WCW+18]

Y. Wu, Y. Chen, L. Wang, Y. Ye, Z. Liu, Y. Guo, Z. Zhang, Y. Fu, Incremental Classifier Learning With Generative Adversarial Networks (February 2018), pp. 1–10. arXiv:1802.00853

[WCW+19]

Y. Wu, Y.-J. Chen, L. Wang, Y. Ye, Z. Liu, Y. Guo, Y. Fu, Large scale incremental learning, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Long Beach, CA, USA, June 2019), pp. 374–382

[XZL+20]

Y. Xia, Y. Zhang, F. Liu, W. Shen, A. Yuille, Synthesize then compare: detecting failures and anomalies for semantic segmentation, in Proceedings of the European Conference on Computer Vision (ECCV), virtual conference (August 2020), pp. 145–161

[YCW+20]

F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, T. Darrell, BDD100K: a diverse driving dataset for heterogeneous multitask learning, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) virtual conference (June 2020), pp. 2636–2645

[YHW+19]

X. Yao, T. Huang, W. Chenglei, R. Zhang, L. Sun, Adversarial feature alignment: avoid catastrophic forgetting in incremental task lifelong learning. Neural Comput. 31(11), 2266–2291 (2019)

[YYLH18]

J. Yoon, E. Yang, J. Lee, S. Ju Hwang, Lifelong Learning With Dynamically Expandable Networks (June 2018), pp. 1–11. arXiv:1708.01547

[YZZ+19]

Y. Yang, D. Wei Zhou, D. Chuan Zhan, H. Xiong, Y. Jiang, Adaptive deep models for incremental learning: considering capacity scalability and sustainability, in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) (Anchorage, AK, USA, July 2019), pp. 74–82

[ZCCY21]

G. Zeng, Y. Chen, B. Cui, S. Yu, Continuous Learning of Context-dependent Processing in Neural Networks (June 2021), pp. 1–18. arXiv:1810.01256

[ZPG17]

F. Zenke, B. Poole, S. Ganguli, Continual learning through synaptic intelligence, in Proceedings of the International Conference on Machine Learning (ICML) (Sydney, NSW, Australia, August 2017), pp. 3987–3995

[ZSR+19]

Y. Zhu, K. Sapra, F.A. Reda, K.J. Shih, S. Newsam, A. Tao, B. Catanzaro, Improving semantic segmentation via video propagation and label relaxation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Long Beach, CA, USA, June 2019), pp. 8856–8865

Title: Detecting and Learning the Unknown in Semantic Segmentation
Authors: Robin Chan
Svenja Uhlemeyer
Matthias Rottmann
Hanno Gottschalk
Publisher: Springer International Publishing
Book: Deep Neural Networks and Data for Automated Driving
Print ISBN: 978-3-031-01232-7

Electronic ISBN: 978-3-031-01233-4

Copyright Year: 2022
DOI: https://doi.org/10.1007/978-3-031-01233-4_10

Springer Professional

Detecting and Learning the Unknown in Semantic Segmentation

Abstract

1 Introduction

2 Anomaly Detection Using Information and Entropy

3.1 Anomaly Detection in Semantic Segmentation

3.2 Incremental Learning in Semantic Segmentation

4 Anomaly Segmentation

4.1 Methods

4.2 Evaluation and Comparison of Anomaly Segmentation Methods

4.3 Combining Entropy Maximization and Meta Classification

5 Discovering and Learning Novel Classes

5.1 Unsupervised Identification and Segmentation of a Novel Class

5.2 Class-Incremental Learning

5.3 Experiments and Evaluation

5.4 Outlook on Improving Unsupervised Learning of Novel Classes

Acknowledgements

Premium Partner

Method	LostAndFoundNoKnown			RoadObstacle21
Method	AuPRC \(\uparrow \)	AuROC \(\uparrow \)	FPR\(_{95\text {TPR}}\) \(\downarrow \)	AuPRC \(\uparrow \)	AuROC \(\uparrow \)	FPR\(_{95\text {TPR}}\) \(\downarrow \)
Maximum Softmax	30.1	93.0	33.2	10.0	95.5	17.9
ODIN	52.9	95.1	30.0	11.9	96.0	16.4
Mahalanobis	55.0	97.5	12.9	19.5	95.1	21.7
Monte Carlo Dropout	36.8	92.2	35.5	4.9	83.5	50.3
Void classifier	4.8	79.5	47.0	10.4	89.7	41.5
Embedding density	61.7	98.0	10.4	0.8	81.0	46.4
Image resynthesis	42.7	96.4	17.4	37.5	98.6	4.7
SynBoost	81.7	98.3	4.6	71.3	99.4	3.2
Entropy maximization	77.9	98.0	9.7	76.0	99.7	1.3

Anomaly score/entropy threshold	Entropy maximization \(+\) thresholding				Entropy maximization \(+\) thresholding \(+\) meta classifier
\(a_i \ge \tau , i\in \mathcal {I}\)	FP \(\downarrow \)	FN \(\downarrow \)	F\(_1\) \(\uparrow \)	\(\delta \) in % \(\downarrow \)	FP \(\downarrow \)	FN \(\downarrow \)	F\(_1\) \(\uparrow \)	\(\delta \) in % \(\downarrow \)
\(\tau =0.30\)	8,068	191	0.26	0.30	290	308	0.82	0.06
\(\tau =0.40\)	4,035	289	0.39	0.11	251	359	0.81	0.03
\(\tau =0.50\)	1,215	415	0.60	0.04	145	447	0.80	0.02
\(\tau =0.60\)	327	613	0.69	0.02	49	619	0.76	0.02
\(\tau =0.70\)	135	879	0.61	0.01	21	881	0.63	0.01

Springer Professional

Abstract

1 Introduction

2 Anomaly Detection Using Information and Entropy

3 Related Works

3.1 Anomaly Detection in Semantic Segmentation

3.2 Incremental Learning in Semantic Segmentation

4 Anomaly Segmentation

4.1 Methods

4.2 Evaluation and Comparison of Anomaly Segmentation Methods

4.3 Combining Entropy Maximization and Meta Classification

5 Discovering and Learning Novel Classes

5.1 Unsupervised Identification and Segmentation of a Novel Class

5.2 Class-Incremental Learning

5.3 Experiments and Evaluation

5.4 Outlook on Improving Unsupervised Learning of Novel Classes

Acknowledgements

Premium Partner