Top

Published in:

Open Access 2022 | OriginalPaper | Chapter

Optimized Data Synthesis for DNN Training and Validation by Sensor Artifact Simulation

Authors : Korbinian Hagn, Oliver Grau

Published in: Deep Neural Networks and Data for Automated Driving

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Patentsearch

Off

Abstract

Synthetic, i.e., computer-generated imagery (CGI) data is a key component for training and validating deep-learning-based perceptive functions due to its ability to simulate rare cases, avoidance of privacy issues, and generation of pixel-accurate ground truth data. Today, physical-based rendering (PBR) engines simulate already a wealth of realistic optical effects but are mainly focused on the human perception system. Whereas the perceptive functions require realistic images modeled with sensor artifacts as close as possible toward the sensor, the training data has been recorded. This chapter proposes a way to improve the data synthesis process by application of realistic sensor artifacts. To do this, one has to overcome the domain distance between real-world imagery and the synthetic imagery. Therefore, we propose a measure which captures the generalization distance of two distinct datasets which have been trained on the same model. With this measure the data synthesis pipeline can be improved to produce realistic sensor-simulated images which are closer to the real-world domain. The proposed measure is based on the Wasserstein distance (earth mover’s distance, EMD) over the performance metric mean intersection-over-union (mIoU) on a per-image basis, comparing synthetic and real datasets using deep neural networks (DNNs) for semantic segmentation. This measure is subsequently used to match the characteristic of a real-world camera for the image synthesis pipeline which considers realistic sensor noise and lens artifacts. Comparing the measure with the well-established Fréchet inception distance (FID) on real and artificial datasets demonstrates the ability to interpret the generalization distance which is inherent asymmetric and more informative than just a simple distance measure. Furthermore, we use the metric as an optimization criterion to adapt a synthetic dataset to a real dataset, decreasing the EMD distance between a synthetic and the Cityscapes dataset from 32.67 to 27.48 and increasing the mIoU of our test algorithm (DeeplabV3+) from 40.36 to $47.63\%$.

1 Introduction

Validation of deep neural networks (DNNs) is increasingly resorting toward computer-generated imagery (CGI) due to its mitigation of certain issues. First, synthetic data can avoid privacy issues found with recordings of members of the public and, on the other hand, can automatically produce vast amounts of data at high quality with pixel-accurate ground truth data and reliability than costly manually labeled data. Moreover, simulations allow synthesis of rare cases and the systematic variation and explanation of critical constellations [SGH20]—a requirement for validation of products targeting safety-critical applications, such as automated driving. Here, the creation of corner cases and scenarios which otherwise could not be recorded in a real-world scenario without endangering other traffic participants is the key argument for the validation of perceptive AI with synthetic images.

Despite the advantages of CGI methods, training and validation with synthetic images still have challenges: Training with these images does not guarantee a similar performance on real-world images and validation is only valid if one can verify that the found weaknesses in the validation do not stem from the synthetic-to-real distribution shift seen in the input.

To measure and mitigate this domain shift, metrics have been introduced with various applications in the field of domain adaptation or transfer learning. In domain adaptation, the metrics such as FID, kernel inception distance (KID), and maximum mean discrepancy (MMD) are applied to train generative adversarial networks (GANs) to adapt on a target feature space [PTKY09] or to re-create the visual properties of a dataset [SGZ+16]. However, the problem of training and validation with synthetic imagery is directly related to the predictive performance of a perception algorithm on the target data, and these kinds of metrics struggle to correlate with the predictive performance [RV19]. Additionally, applications of domain adaptation methods often resort to specifically trained DNNs, e.g., GANs, which adapt one domain to the other and therefore add an extra layer of complexity and uncontrollability. This is especially unwanted if a validation goal is tested, e.g., to detect all pedestrians, and the domain adaption by a GAN would add additional objects into the scene (e.g., see [HTP+18]) making it even harder to attribute detected faults of the model to certain specifics of the tested scene. Here, the creation of images via a synthesis process allows to understand domain distance influence factors more directly as all parameters are under direct control.

Camera-recorded images inherently show visual imperfections or artifacts, such as sensor noise, blur, chromatic aberration, or image saturation, as can be seen in an image example from the A2D2 [GKM+20] dataset in Fig. 1. CGI methods, on the other hand, are usually based on idealized models; for example, the pinhole camera model [Stu14] which is free of sensor artifacts.

In this chapter, we present an approach to decrease the domain divergence of synthetic to real-world imagery for perceptive DNNs by realistically modeling sensor lens artifacts to increase the viability of CGI for training and validation. To achieve this, we first introduce a model of sensor artifacts whose parameters are extracted from a real-world dataset and then apply it on a synthetic dataset for training and measuring the remaining domain divergence via validation. Therefore, a new interpretation of the domain divergence by generalization of the distance of two datasets by the per-image performance comparison over a dataset utilizing the Wasserstein or earth mover’s distance (EMD) is presented. Next, we demonstrate how this model is able to decrease the domain divergence further by optimization of the initial extracted sensor camera simulation parameters as depicted in Fig. 6. Additionally, we compare our results with randomly chosen parameters as well as with randomly chosen and optimized parameters. Last, we strengthen the case for the usability of our EMD domain divergence measure by comparison with the well-known Fréchet inception distance (FID) on a set of real-world and synthetic datasets and highlight the advantage of our asymmetric domain divergence against the symmetric distance.

This chapter is related to two areas: domain distance measures, as used in the field of domain adaptation and synthetic data generation for training and validation.

Domain distance measures: A key challenge in domain adaptation approaches is the expression of a distance measure between datasets, also called domain shift. A number of methods were developed to mitigate this shift (e.g., see [LCWJ15, GL15, THSD17, THS+18]).

To measure the domain shift or domain distance, the inception score (IS) has been proposed [SGZ+16], where the classification output of an InceptionV3-based [SVI+16] discriminator network trained on the ImageNet dataset [DDS+09] is used. The works of [HRU+17, BSAG21] rely on features extracted from the InceptionV3 network to tune domain adaptation approaches, i.e., the FID and KID. However, these metrics cannot predict if the classification performance increases when adapted data is applied as training data for a discriminator [RV19].

Therefore, to measure performance directly, it is essential to train with the adapted or synthetic data and validate on the target data, i.e., cross-evaluation as done by [RSM+16, WU18, SAS+18].

Performance metrics: The mean intersection-over-union (mIoU) is a widely used performance metric for benchmarking semantic segmentation [COR+16, VSN+18]. Adaptations and improvements of the mIoU have been proposed which put more weight on the segmentation contour as in [FMWR18, RTG+19]. Performance metrics as the mIoU are computed over the whole validation dataset, i.e., the whole confusion matrix, but there are propositions to apply the mIoU calculation on a per-image basis and compare the resulting empirical distributions [CLP13].

A per-image comparison mitigates several shortcomings of a single evaluation metric on the whole dataset when used for comparison of classifiers on the same dataset. First, one can distinguish multimodal and unimodal distributions, i.e., strong classification on one half and weak classification on the other half of a set can lead to the same mean as an average classification on all samples. Second, unimodal distributions with the same mean but different shape are also indiscernible under a single dataset averaged metric. This justification led to our choice of a per-image-based mIoU metric as it allows for deeper investigations which are especially helpful when one wants to understand the characteristics that increase or decrease a domain divergence.

Sensor simulation for synthetic image training: The use of synthesized data for development and validation is an accepted technique and has been also suggested for computer vision applications (e.g., [BB95]). Recently, specifically for the domain of driving scenarios, games engines have been adapted [RHK17, DRC+17].

Although game engines provide a good starting point to simulate environments, they usually only offer a closed rendering set-up with many trade-offs balancing between real-time constraints and a subjectively good visual appearance for human observers. Specifically the lighting computation in the rendering pipelines is in-transparent. Therefore, it does not produce a physically correct imagery; instead only a fixed rendering quality (as a function of lighting computation and tone mapping), resulting in output of images having a low dynamic range (LDR) (typically 8-bit per RGB color channel).

Recently, physical-based rendering techniques have been applied to the generation of data for training and validation, like Synscapes [WU18]. For our chapter we use a dataset in high dynamic range (HDR) created with the physical-based Blender Cycles renderer.¹ We implemented a customized tone mapping to 8-bit per color channel and sensor simulation, as described in the next section.

While there is great interest in understanding the domain distance in the area of domain adaptation via generative strategies, i.e., GANs, there has been little research regarding sensor artifact influence on training and validation with synthetic images. Other works [dCCN+16, NdCCP18] add different kinds of sensor noise to their training set and report a degradation of performance, compared to a model trained with no noise in the training set, due to training of a harder, i.e., noisier, visual task. Adding noise in training is a common technique for image augmentation and can be seen as a regularization technique [Bis95] to prevent overfitting.

Our task of modeling sensor artifacts for synthetic images extracted from camera images is not aimed at improving the generalization through random noise, but to tune the parameters of our sensor model to closely replicate the real-world images and improve generalization on the target data.

First results of modeling camera effects to improve synthetic data learning on the perceptive task of bounding box detection have been proposed by [CSVJR18, LLFW20]. Lin et al. [LLFW20] additionally state that generalization is an asymmetric measure which should be considered when comparing with symmetric dataset distance measures from literature. Furthermore, Carlson et al. [CSVJR19] learned sensor artifact parameters from a real-world dataset and applied the learned parameters of their noise sources as image augmentation during training with synthetic data on the task of bounding box detection. However, contrasting our approach, they apply their optimization as style loss on a latent feature vector extracted from a VGG-16 network trained on ImageNet and evaluate the performance on the task of 2D object detection.

3 Methods

Given a synthetic (CGI) dataset of urban street scenes, our goal is to decrease the domain gap to a real-world dataset for semantic segmentation by realistic sensor artifact simulation. Therefore, we systematically analyze the image sensor artifacts of the real-world dataset and use this extracted parametrization for our sensor artifact simulation. To compare our synthetic dataset with the real-world dataset we contrive a novel per-image performance-based metric to measure the generalization distance between the datasets. We utilize a DeeplabV3+ [CZP+18] semantic segmentation model with a ResNet101 [HZRS16] backbone to train and evaluate on the different datasets throughout this paper. To show the valuable properties of our measure we compare it with the established domain distance, i.e., Fréchet inception distance (FID). Lastly, we use our measure as optimization criteria for adapting the parameters of our sensor artifact simulation with the extracted parameters as starting point and show that we can further decrease the domain distance from synthetic images to real-world images.

3.1 Sensor Simulation

We implemented a simple sensor model with the principle blocks depicted in Fig. 2: The module expects images in linear RGB space. Rendering engines like Blender Cycles² can provide these images as results in OpenEXR format.³

We simulate a simple model by applying chromatic aberration, blur, and sensor noise, as additive Gaussian noise (zero mean, variance is a free parameter), followed by a simple exposure control (linear tone mapping), finished by non-linear gamma correction.

First, we apply blur by a simple box filter with filter size $F \times F$ and a chromatic aberration (CA). The CA is approximated using radial distortions (k1, second order), e.g., [CV14], as defined in OpenCV. The CA is implemented as a per channel (red, green, blue) variation of the k1 radial distortion, i.e., we introduce an incremental parameter ca that affects the radial distortions: $\text {k1}(\text {blue}) = - {ca}; \text {k1}(\text {green}) = 0; \text {k1}(\text {red}) = + {ca}$. As the next step, we apply Gaussian noise to the input image.

Applying a linear function, the pixel values are then mapped and rounded to the target output byte range [0, ..., 255].

The two parameters of the linear mapping are determined by a histogram evaluation of the input RGB values of the respective image, imitating an auto exposure of a real camera. In our experiments we have set it to saturate $2\%$ (initially) of the brightest pixel values, as these are usually values of very high brightness, induced by sky or even the sun. Values below the minimum or above the set maximum are mapped to 0 or 255, respectively.

In the last step we apply gamma correction to achieve the final processed synthetic image:

$$\begin{aligned} \mathbf {x} = (\tilde{\mathbf {x}}) ^ {\gamma } \end{aligned}$$

(1)

The parameter $\gamma $ is an approximation of the sensor non-linear mapping function. For media applications this is usually $\gamma =2.2$ for the sRGB color space [RD14]. However, for industrial cameras, this is not yet standardized and some vendors do not reveal it.⁴ We therefore estimate the parameter as an approximation. Figure 3 depicts the difference of an image with and without simulated sensor artifacts.

3.2 Dataset Divergence Measure

Our proposed distance quantifies per image performance between models trained on different datasets but evaluated on the same dataset. Considering the task of semantic segmentation we chose the mIoU as our base metric. We then modify the mIoU to be calculated per image instead of the confusion matrix on the whole evaluated dataset. Next, we introduce the Wasserstein-1 or earth mover’s distance (EMD) metric as our divergence measure between the per-image mIoU distribution of two classifiers trained on distinct datasets, i.e., synthetic and real-world datasets, but evaluated on the same real-world dataset the second classifier has been trained with.

The mIoU is defined as follows:

$$\begin{aligned} mIoU = \frac{1}{S}\sum _{s\in \mathcal {S}}\frac{TP_s}{TP_s + FP_s + FN_s} \times 100\%, \end{aligned}$$

(2)

with $TP_s$, $FN_s$, and $FP_s$ being the amount of true-positives, false-negatives, and false-positives of the sth class over all images of the evaluated dataset.

Table 1

Due to differences in label definition of real-world datasets, the class mapping for training and evaluation is decreased to 11 classes that are common in all considered datasets: A2D2 [GKM+20], Cityscapes [COR+16], Berkeley Deep Drive (BDD100K) [YCW+20], Mapillary Vistas (MV) [NOBK17], India Driving Dataset (IDD) [VSN+18], GTAV [RVRK16], our synthetic dataset [KI 20], and Synscapes [WU18]

s	0	1	2	3	4	5	6	7	8	9	10
Label	`road`	`sidewalk`	`building`	`pole`	`traffic light`	`traffic sign`	`vegetation`	`sky`	`human`	`car`	`truck`

Here, $\mathcal {S}=\{0,1,...,S-1\}$, with $S=11$, as we use the 11 classes defined in Table 1. These classes are the maximal overlap of common classes in the real and synthetic datasets considered for cross-evaluation and comparison of our measure with the Fréchet inception distance (FID), as can be seen later in Sect. 4.3, Tables 3 and 4.

A distribution over the per-image IoU takes the following form:

$$\begin{aligned} IoU_{n} = \frac{1}{S}\sum _{s\in \mathcal {S}}\frac{TP_{s,n}}{TP_{s,n} + FP_{s,n} + FN_{s,n}} \times 100\%, \end{aligned}$$

(3)

where n denotes the nth image in the validation dataset. Here, $IoU_{n}$ is measured in $\%$. We want to compare the distributions of per-image IoU values from two different models; therefore, we apply the Wasserstein distance. The Wasserstein distance as an optimal mass transport metric from [KPT+17] is defined for density distributions p and q where $\text {inf}$ denotes the infinium, i.e., lowest transportation cost, $ \Gamma (p, q)$ denotes the set containing all joint distributions $\pi $, i.e., transportation maps, for (X,Y) which have the marginals p and q as follows:

$$\begin{aligned} W_r (p, q) = \left( \inf _{\pi \in \Gamma (p, q)} \int _{\mathbb {R} \times \mathbb {R}} |X-Y|^{r} \mathrm {d} \pi \right) ^{1/r}. \end{aligned}$$

(4)

This distance formulation is equivalent to the following [RTC17]:

$$\begin{aligned} W_r (p, q) = \left( \int _{-\infty }^{\infty } |P(t)-Q(t)|^r\right) ^{1/r} \mathrm {d}t . \end{aligned}$$

(5)

Here P and Q denote the respective cumulative distribution functions (CDFs) of p and q.

In our application we calculate the empirical distributions of p and q, which simplifies in this case to the function of the order statistics:

$$\begin{aligned} W_r(\hat{p},\hat{q}) = \left( \sum ^{n}_{i=1}|\hat{p}_{i}-\hat{q}_{i}|^r\right) ^{1/r} , \end{aligned}$$

(6)

where $\hat{p}$ and $\hat{q}$ are the empirical distributions of the marginals p and q sorted in ascending order. With $r=1$ and equal weight distributions we get the earth mover’s distance (EMD) which, in other words, measures the area between the respective CDFs with $L_{1}$ as ground distance.

We assume a sample size of at least 100 to be enough for the EMD calculation to be valid, as fewer samples might not guarantee a sufficient sampling of the domains. In our experiments we use sample sizes $\ge 500$.

The FID is a special case of the Wasserstein-2 distance derived from (6) with $p=2$ and $\hat{p}$ and $\hat{q}$ being normally distributed, leading to the following definition:

$$\begin{aligned} \text {FID}= ||\boldsymbol{\mu } -\boldsymbol{\mu }_{w}||^{2}+{\text {tr}} (\mathbf {\Sigma } + \mathbf {\Sigma }_{w}-2(\mathbf {\Sigma } \mathbf {\Sigma }_{w})^{1/2}), \end{aligned}$$

(7)

where $\boldsymbol{\mu }$ and $\boldsymbol{\mu }_{w}$ are the means, and $\mathbf {\Sigma }$ and $\mathbf {\Sigma }_{w}$ are the covariance matrices of the multivariate Gaussian-distributed feature vectors of synthetic and real-world datasets, respectively.

Compared to distance metrics such as the FID which by definition is symmetric, our measure is a divergence, i.e., the distance from dataset A to dataset B can be different to the distance from dataset B to dataset A. Being a divergence also reflects the characteristic of a classifier having different generalization distance when trained on dataset A and evaluated on dataset B or the other way around.

Because the ground measure of the signatures, i.e., the IoU per image, is bounded to $0\le IoU_{n}\le 100$, the EMD measure is then bounded to $0\le EMD\le 100^r$ with r being the Wasserstein norm. For $r=1$, the measure is bound with $0\le EMD\le 100$.

To verify whether the per-image IoU of a dataset is a good proxy of a dataset’s domain distribution, we need to verify that the distribution stays (nearly) constant when training from different starting conditions. Therefore, we trained six models of the DeeplabV3+ network with the same hyperparameters but different random initialization on the Cityscapes dataset and evaluated them on the validation set calculating the mIoU per image. The resulting distributions of each model in the ensemble are converted into a CDF as is shown in Fig. 4. To have a stronger empirical evidence of the per-image mIoU performance distribution being constant for a dataset, we apply the two-sample Kolmogorov-Smirnov test on each pair of distribution in the ensemble. The resulting p-values are at least $>0.95$, hence supporting our hypothesis.

3.3 Datasets

For our sensor parameter optimization experiments we consider two datasets. First, the real-world Cityscapes dataset, which consists of 2,975 annotated images for training and 500 annotated images for validation. All images were captured in urban street scenes in German cities. Second, the synthetic dataset provided by the KI-A project [KI 20]. This dataset consists of 21,802 annotated training images and 5,164 validation images. The KI-A synthetic dataset comprises urban street scenes, similar to Cityscapes, and suburban to rural street scenes which are characterized by less traffic and less dense house placements, therefore more vegetation and terrain objects.

4 Results and Discussion

4.1 Sensor Parameter Extraction

As a baseline for our sensor simulation, we analyzed images from the Cityscapes training data and measured the parameters. Sensor noise was extracted from about 10 images with uniformly colored areas ranging from dark to light colors. Chromatic aberration was extracted from 10 images with traffic signs on the outmost edges of the image, as can be seen in Fig. 5. The extracted values have been averaged over the count of images. The starting parameters of our optimization approach are then as follows: $\text {saturation}=2.0\%$, noise $\sim \mathcal {N}(0,\,3)\,$, $\gamma =0.8$, $F=4$, and $ca=0.08$.

4.2 Sensor Artifact Optimization Experiment

Utilizing the EMD as dataset divergence measure and the extracted sensor parameters from camera images of Cityscapes, we apply an optimization strategy to iteratively decrease the gap between the Cityscapes and the synthetic dataset [KI 20]. For optimization, we chose to use the trust region reflective (trf) method [SLA+15] as implemented in SciPy [VGO+20]. The trf is a least-squares minimization method to find the local minimum of a cost function given certain input variables. The cost function is the EMD from synthetic model and real-world model predictions on the same real-world validation dataset. The variables as input to the cost function are the parameters of the sensor artifact simulation. The trf method has the capability of bounding the variables to meaningful ranges. The stop criterion is met when the increase of parameter step size or decrease of the cost function is below $10^{-6}$.

The overall description of our optimization method is depicted in Fig. 6. Step 1: Initial parameters from the optimization method are applied in the sensor artifact simulation to the synthetic images. Step 2: The DeeplabV3+ model with ResNet101 backbone is pre-trained on 15 epochs on the original unmodified synthetic dataset and finetuned for one epoch on the synthetic dataset with applied sensor artifacts and a learning rate of 0.1. Step 3: The model parameters are frozen and set to evaluate. Step 4: The model predicts on the validation set of the Cityscapes dataset. Step 5: The remaining domain divergence is measured by evaluation of the mIoU per image and calculation of the EMD to the evaluations of a model trained on Cityscapes. Step 6: The resulting EMD is fed as cost to the optimization method. Step 7: New parameters are set for the sensor artifact simulation, or the optimization ends if the stop criteria are met.

After iterating the parameter optimization with the trf method, we compare our optimized trained model with the unmodified synthetic dataset by their per-image mIoU distributions on the Cityscapes dataset. Figure 7 depicts the distributions resulting from this evaluation. The DeeplabV3+ model trained with the optimized sensor artifact simulation applied on the synthetic dataset outperforms the baseline and achieves an EMD score of 26.48, while decreasing the domain gap by 6.19. The resulting parameters are $\text {saturation}=2.11\%$, noise $\sim \mathcal {N}(0,\,3.0000005)\,$, $\gamma =0.800001$, $F=4$ and $ca=0.008000005$. The parameters changed only slightly from the starting point, indicating the extracted parameters as good first choice.

An exemplary visual inspection of the results in Fig. 8 helps to understand the distribution shift and therefore the decreased EMD. While the best prediction performance image (top row) increased only slightly from the synthetic trained model (c) to the sensor artifact optimized model (d), the worst prediction case (bottom row) shows improved segmentation performance for the sensor-artifact-optimized model (d), in this case even better than the Cityscapes trained model (b).

Table 2

Performance results as per-class mIoU, overall mIoU, and EMD domain divergence evaluated on Cityscapes with models trained on Cityscapes, our synthetic only, synthetic with random parameterized lens artifacts, and synthetic extracted parameterized lens artifacts datasets. For the latter two, there are models evaluated on Cityscapes with and without optimization of the parameters for the sensor lens artifact simulation. The model trained with optimized extracted parameters achieves the highest performance on the Cityscapes dataset

Model trained on	Optimized	`road`	`sidewalk`	`building`	`pole`	`traffic light`	`traffic sign`	`vegetation`	`sky`	`human`	`car`	`truck`	mIoU $\leftarrow $	EMD $\rightarrow $
Cityscapes	no	97.50	81.21	92.22	54.54	57.28	70.26	92.33	93.98	82.55	93.67	81.62	81.56	–
w/o artifacts	no	60.46	23.51	71.99	10.70	26.00	13.47	75.72	74.70	51.27	34.46	1.74	40.37	32.67
w/ random artifacts	no	77.86	20.56	60.66	8.21	17.83	7.36	72.18	68.21	50.53	74.01	4.68	41.58	30.03
w/ extracted artifacts	no	80.65	27.17	66.54	10.43	21.64	11.87	68.82	67.99	59.22	70.06	7.39	44.71	29.84
w/ random artifacts	yes	82.89	34.69	71.11	11.29	16.66	9.12	69.81	76.77	61.48	79.50	5.30	45.76	26.58
w/ extracted artifacts	yes	79.62	30.30	75.74	16.30	28.31	13.86	78.75	70.88	59.74	64.06	6.42	47.63	26.48

We compare the overall mIoU performance on the Cityscapes datasets between models trained with the initial unmodified synthetic dataset, the synthetic dataset with random initialized lens artifact parameters, and the synthetic dataset with extracted parameters from Cityscapes with the baseline of a model trained on the Cityscapes dataset. Results are listed in Table 2 (rows 1–4). Additionally, for the random and the extracted parameters, we evaluate the performance with initial and optimized parameters, where the parameters have been optimized by our EMD minimization (rows 5 and 6). While the model without any sensor simulation achieves the lowest overall performance (row 2), the model with random parameter initialization achieves a slightly higher performance (row 3) and is surpassed by the model with the Cityscapes extracted parameters (row 4). Next, we take the models trained with optimized parameters into account (rows 5 and 6). Both models outperform all non-optimized experiment settings in terms of overall mIoU, with the model using optimized extracted parameters from Cityscapes showing the best overall mIoU (row 6). Concretely, the model trained with optimized random starting parameters achieves higher performance on classes road, sidewalk, human, and even significantly on the car class but still falls behind on five of the remaining classes and the overall performance on the Cityscapes dataset (row 5). Further, the random parameter optimized model took over 22 iterations to converge to its local minimum, whereas the optimization of extracted starting parameters only took six iterations until reaching a local minimum, making it more than three times faster to converge. Furthermore, it is shown that all models with applied sensor lens artifacts outperform the model trained without additional lens artifacts.

4.3 EMD Cross-evaluation

To get a deeper understanding of the implications of our EMD score, we evaluate our EMD results on a range of real-world and synthetic datasets for semantic segmentation. Including real-world datasets A2D2 [GKM+20], Cityscapes (CS) [COR+16], Berkeley Deep Drive (BDD100K) [YCW+20], Mapillary Vistas (MV) [NOBK17], India Driving Dataset (IDD) [VSN+18], as well as synthetic GTAV [RVRK16], our synthetic (Synth and SynthOpt) [KI 20], and Synscapes (SYNS) [WU18] datasets. In Table 3 the results of cross-domain analysis measured with the EMD score are depicted. The columns denote that a DeeplabV3+ model has been trained on the corresponding dataset, i.e., the source dataset, whereas the rows denote the datasets it was evaluated on, i.e., the target datasets. Our optimized synthetic dataset achieves lower EMD scores, shown in boldface, than the synthetic baseline. While the domain divergence decrease is high on real datasets, the divergence decreased only marginally for the other synthetic datasets. Inspecting the EMD result on all datasets, the lowest divergence values are indicated by underline; the MV dataset shows to be closest to all the other evaluated datasets.

Table 3

Cross-domain divergence results of models trained on different real-world and synthetic datasets and evaluated on various validation or test sets of an average size of 1000 images. The domain divergence is measured with our proposed EMD measure; boldface values indicate the lowest divergence comparing our synthetic (Synth) and synthetic-optimized (SynthOpt) datasets, whereas underlined values indicate the lowest divergence values over all the datasets. The model trained with optimized lens artifacts applied to the synthetic images exhibits a smaller domain divergence than the model trained without lens artifacts

EMD $\downarrow $	A2D2	BDD100K	CS	GTAV	IDD	MV	SYNS	Synth	SynthOpt
A2D2	–	18.70	23.46	37.84	20.03	10.72	46.78	34.95	29.32
BDD100K	6.36	–	9.45	22.14	7.26	1.42	36.33	26.43	21.54
CS	10.90	12.09	–	36.42	13.01	4.08	20.62	32.66	26.48
GTAV	33.28	28.37	29.72	–	30.55	23.30	37.53	36.08	32.94
IDD	24.37	19.83	24.71	34.71	–	12.81	46.23	41.64	36.95
MV	10.63	10.36	14.34	28.35	9.20	–	35.97	30.35	27.03
SYNS	25.45	31.46	23.64	45.16	25.12	23.45	–	43.76	43.56

To set our measure in relation to established domain distance measures, we calculated the FID from each of our considered datasets to one another. The results are shown in Table 4. The FID, defined in (7), is the Wasserstein-2 distance of feature vectors from the InceptionV3 [SVI+16] network sampled on the two datasets to be compared with each other.

Table 4

Cross-domain distance results measured with the Fréchet inception distance (FID). Lowest FID between synthetic (Synth) and synthetic optimized (SynthOpt) datasets are in boldface, whereas the lowest FID values over all datasets are underlined

FID $\downarrow $	A2D2	BDD100K	CS	GTAV	IDD	MV	SYNS	Synth	SynthOpt
A2D2	–	60.16	98.46	78.16	58.75	41.84	109.35	116.54	121.55
BDD100K	60.16	–	59.90	62.42	52.15	29.66	74.871	115.51	109.08
CS	98.46	59.90	–	85.81	68.92	59.69	43.87	119.97	112.42
GTAV	78.16	62.42	85.81	–	74.08	51.00	89.62	92.513	92.24
IDD	58.75	52.15	68.92	74.08	–	37.36	64.09	118.06	125.30
MV	41.84	29.66	59.69	51.00	37.36	–	70.24	74.46	78.66
SYNS	109.35	74.87	43.87	89.62	64.09	70.24	–	113.77	108.3

Again, boldface values indicate the lowest FID values between the synthetic (Synth) and synthetic-optimized (SynthOpt) datasets, whereas underlined values indicate the lowest values of all datasets. Here, only 4 out of the 7 datasets are closer, measured by the FID, to the synthetic-optimized dataset than to the original dataset. Furthermore, the FID sees the CS and the SYNS dataset closer to one another than the EMD divergence measure, while the MV dataset shows the lowest FID among the other evaluated datasets.

FID and EMD somewhat agree, if we evaluate the distance as minimum per-row in both tables, that the Mapillary Vistas dataset is in most cases the dataset that is closest to all other datasets.

Now, calculating the minimum per-column in both tables, the benefit of our asymmetric EMD comes to the light. The minimum per-column values of the FID are unchanged due to the diagonal symmetry of the cross-evaluation matrix stemming from the inherent symmetry of the measure. However, the EMD regards the BDD100K as the closest dataset. An intuitive explanation for the different minimum observations of the EMD is as follows: Training with many images exhibiting different geospatial and sensor properties of the Mapillary Vistas dataset covers a very broad domain and results in good generalization capability and therefore evaluation performance. Training with any of the other datasets cannot generalize well to the vast domain of Mapillary Vistas but to the rather constrained domain of BDD100K, which consists of lower resolution images with heavy compression artifacts, where even a model that has been trained on BDD100K does not generalize well on.

The asymmetric nature of our EMD allows for a more thorough and complex analysis of dataset discrepancies, when applied to the tasks of visual understanding, e.g., semantic segmentation, which otherwise cannot be captured by inherently symmetric distance metrics such as FID. Contrasting to [LLFW20], we could with our evaluation method not identify a consistency between FID and the generalization divergence, i.e., our EMD measure.

5 Conclusions

In this chapter, we could demonstrate that by utilizing the performance metric per image as a proxy distribution for a dataset and the earth mover’s distance (EMD) as a divergence measure between distributions, one can decrease visual differences of a synthetic dataset through optimization and increase the viability of CGI for training and validation purposes of perceptive AI. To reinforce our argument for per-image performance measures as proxy distributions, we showed that training an ensemble of a fixed model with different random starting conditions but with the same hyperparameters leads to the same per-image performance distributions when these ensemble models are evaluated on the validation set of the training dataset. When utilizing synthetic imagery for validation, the domain gap, due to visual differences between real and computer-generated images, is hindering the applicability of these datasets. As a step toward decreasing the visual differences, we apply the proposed divergence measure as a cost function to an optimization which varies the parameters of the sensor artifact simulation, while trying to re-create the sensor artifacts that the real-world dataset exhibits. As starting point of the sensor artifact parameters, we extracted empirically the values from chosen images of the real-world dataset. The optimization improved the visual difference between the real-world and the optimized synthetic dataset measurably by the EMD and we could show that even when starting with random initialized parameters we can decrease the EMD and increase the mIoU on the target datasets. When measuring the divergence after parameter optimization to other real-world and synthetic datasets, we could show that the EMD decreases for all considered datasets but when measured by the FID only four of the datasets are closer. As the EMD is derived from the mIoU per image, it is an indicator of performance on the target dataset, whereas the FID fails to relate with performance. Effective minimization of the visual difference between synthetic and real-world datasets with the EMD domain divergence measure is one step further toward fully utilizing CGI for validation of perceptive AI functions.

Acknowledgements

The research leading to these results is funded by the German Federal Ministry for Economic Affairs and Energy within the project “Methoden und Maßnahmen zur Absicherung von KI-basierten Wahrnehmungsfunktionen für das automatisierte Fahren (KI Absicherung)”. The authors would like to thank the consortium for the successful cooperation.

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

previous chapter Analysis and Comparison of Datasets by Leveraging Data Distributions in Latent Spaces

next chapter Improved DNN Robustness by Multi-task Training with an Auxiliary Self-Supervised Task

Provided by the KI Absicherung project [KI 20].

https://www.blender.org/.

https://www.openexr.com/.

The providers of the Cityscapes dataset don’t document the exact mapping.

[BB95]

W. Burger, M.J. Barth, Virtual reality for enhanced computer vision, in J. Rix, S. Haas, J. Teixeira (eds.), Virtual Prototyping: Virtual Environments and the Product Design Process (Springer, 1995), pp. 247–257

[Bis95]

M. Christopher Bishop, Training with noise is equivalent to Tikhonov regularization. Neural Comput. 7(1), 108–116 (1995)

[BSAG21]

M. Binkowski, D.J. Sutherland, M. Arbel, A. Gretton, Demystifying MMD GANs, Jan. 2021, pp. 1–36. arxiv:1801.01401

[CLP13]

G. Csurka, D. Larlus, F. Perronnin, What is a good evaluation measure for semantic segmentation? in Proceedings of the British Machine Vision Conference (BMVC), Bristol, UK, Sept. 2013, pp. 1–11

[COR+16]

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, B. Schiele, The Cityscapes dataset for semantic urban scene understanding, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, June 2016, pp. 3213–3223

[CSVJR18]

A. Carlson, K.A. Skinner, R. Vasudevan, M. Johnson-Roberson, Modeling camera effects to improve visual learning from synthetic data, in Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, Aug. 2018, pp. 505–520

[CSVJR19]

A. Carlson, K.A. Skinner, R. Vasudevan, M. Johnson-Roberson, Sensor Transfer: Learning Optimal Sensor Effect Image Augmentation for Sim-to-Real Domain Adaptation, Jan. 2019, pp. 1–8. arxiv:1809.06256

[CV14]

V. Chari, A. Veeraraghavan, L. Distortion, Radial distortion, in Computer Vision: A Reference Guide. ed. by K. Ikeuchi (Springer, Boston, MA, 2014), pp. 443–445CrossRef

[CZP+18]

L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, H. Adam, Encoder-Decoder with atrous separable convolution for semantic image segmentation, in Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, Sept. 2018, pp. 833–851

[dCCN+16]

G.B.P. da Costa, W.A. Contato, T.S. Nazaré, J.E.S. do Batista Neto, M. Ponti, An Empirical Study on the Effects of Different Types of Noise in Image Classification Tasks, Sept. 2016, pp. 1–6. arxiv:1609.02781

[DDS+09]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, F.-F. Li, ImageNet: a large-scale hierarchical image database, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, June 2009, pp. 248–255

[DRC+17]

A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, V. Koltun, CARLA: an open urban driving simulator, in Proceedings of the Conference on Robot Learning CORL, Mountain View, CA, USA, Nov. 2017, pp. 1–16

[FMWR18]

E. Fernandez-Moral, R. Martins, D. Wolf, P. Rives, A new metric for evaluating semantic segmentation: leveraging global and contour accuracy, in Proceedings of the IEEE Intelligent Vehicles Symposium (IV), Changshu, China, June 2018, pp. 1051–1056

[GKM+20]

J. Geyer, Y. Kassahun, M. Mahmudi, X. Ricou, R. Durgesh, A.S. Chung, L. Hauswald, V.H. Pham, M. Mühlegg, S. Dorn, T. Fernandez, M. Jänicke, S. Mirashi, C. Savani, M. Sturm, O. Vorobiov, M. Oelker, S. Garreis, P. Schuberth, A2D2: Audi Autonomous Driving Dataset, April 2020, pp. 1–10. arxiv:2004.06320

[GL15]

Y. Ganin, V. Lempitsky, Unsupervised domain adaptation by backpropagation, in Proceedings of the International Conference on Machine Learning (ICML), Lille, France, July 2015, pp. 1180–1189

[HRU+17]

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, S. Hochreiter, GANs trained by a two time-scale update rule converge to a local Nash equilibrium, in Proceedings of the Conference on Neural Information Processing Systems (NIPS/NeurIPS), Long Beach, CA, USA, Dec. 2017, pp. 6626–6637

[HTP+18]

J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. Efros, T. Darrell, CyCADA: cycle-consistent adversarial domain adaptation, in Proceedings of the International Conference on Machine Learning (ICML), Stockholm, Sweden, July 2018, pp. 1989–1998

[HZRS16]

K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, June 2016, pp. 770–778

[KI 20]

KI Absicherung Consortium. KI Absicherung: Safe AI for Automated Driving (2020). Assessed 18 Nov. 2021

[KPT+17]

S. Kolouri, S.R. Park, M. Thorpe, D. Slepcev, G.K. Rohde, Optimal mass transport: signal processing and machine-learning applications. IEEE Signal Process. Mag. 34(4), 43–59 (2017)

[LCWJ15]

M. Long, Y. Cao, J. Wang, M.I. Jordan, Learning transferable features with deep adaptation networks, in Proceedings of the International Conference on Machine Learning (ICML), July 2015, pp. 97–105

[LLFW20]

Z. Liu, T. Lian, J. Farrell, B. Wandell, Neural network generalization: the impact of camera parameters. IEEE Access 8, 10443–10454 (2020)CrossRef

[NdCCP18]

T.S. Nazaré, G.B.P. da Costa, W.A. Contato, M. Ponti, Deep convolutional neural networks and noisy images, in Proceedings of the Iberoamerican Congress on Pattern Recognition (CIARP), Madrid, Spain, Nov. 2018, pp. 416–424

[NOBK17]

G. Neuhold, T. Ollmann, S. Rota Bulò, P. Kontschieder, The mapillary vistas dataset for semantic understanding of street scenes, in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, Oct. 2017, pp. 4990–4999

[PTKY09]

S.J. Pan, I.W. Tsang, J.T. Kwok, Q. Yang, Domain adaptation via transfer component analysis, in Proceedings of the International Joint Conferences on Artificial Intelligence (IJCAI), Pasadena, CA, USA, July 2009, pp. 1187–1192

[RD14]

R. Ramanath, M.S. Drew, Color Spaces, in Computer Vision: A Reference Guide. ed. by K. Ikeuchi (Springer, Boston, MA, 2014), pp. 123–132CrossRef

[RHK17]

S.R. Richter, Z. Hayder, V. Koltun, Playing for benchmarks, in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, Oct. 2017, pp. 2232–2241

[RSM+16]

G. Ros, L. Sellart, J. Materzynska, D. Vazquez, A.M. Lopez, The SYNTHIA dataset: a large collection of synthetic images for semantic segmentation of urban scenes, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, June 2016, pp. 3234–3243

[RTC17]

A. Ramdas, N. García Trillos, M. Cuturi, On Wasserstein two sample testing and related families of nonparametric tests. Entropy 19(2), 47 (2017)

[RTG+19]

H. Rezatofighi, N. Tsoi, J.Y. Gwak, A. Sadeghian, I. Reid, S. Savarese, Generalized intersection over union: a metric and a loss for bounding box regression, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, June 2019, pp. 658–666

[RV19]

S.V. Ravuri, O. Vinyals, Seeing is not necessarily believing: limitations of BigGANs for data augmentation, in: Proceedings of the International Conference on Learning Representations (ICLR) Workshops, New Orleans, LA, USA, June 2019, pp. 1–5

[RVRK16]

S.R. Richter, V. Vineet, S. Roth, V. Koltun, Playing for data: ground truth from computer games, in Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, Oct. 2016, pp. 102–118

[SAS+18]

F.S. Saleh, M.S. Aliakbarian, M. Salzmann, L. Petersson, J.M. Alvarez, Effective use of synthetic data for urban scene semantic segmentation, in Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, Sept. 2018, pp. 84–100

[SGH20]

Q.S. Sha, O. Grau, K. Hagn, DNN analysis through synthetic data variation, in Proceedings of the ACM Computer Science in Cars Symposium (CSCS), virtual conference, Dec. 2020, pp. 1–10

[SGZ+16]

T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, Xi Chen, Improved Techniques for Training GANs, June 2016, pp. 1–10. arxiv:1606.03498

[SLA+15]

J. Schulman, S. Levine, P. Abbeel, M. Jordan, P. Moritz, Trust region policy optimization, in Proceedings of the International Conference on Machine Learning (ICML), Lille, France, July 2015, pp. 1889–1897

[Stu14]

P. Sturm, Pinhole Camera Model, in Computer Vision: A Reference Guide. ed. by K. Ikeuchi (Springer, Boston, MA, 2014), pp. 610–613CrossRef

[SVI+16]

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecture for computer vision, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, June 2016, pp. 2818–2826

[THS+18]

Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, M. Chandraker, Learning to adapt structured output space for semantic segmentation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, June 2018, pp. 7472–7481

[THSD17]

E. Tzeng, J. Hoffman, K. Saenko, T. Darrell, Adversarial discriminative domain adaptation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, July 2017, pp. 2962–2971

[VGO+20]

P. Virtanen, R. Gommers, T.E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S.J. van der Walt, M. Brett, J. Wilson, K.J. Millman, N. Mayorov, A.R.J. Nelson, E. Jones, R. Kern, E. Larson, C.J. Carey, l. Polat, Y. Feng, E.W. Moore, J. Vand erPlas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen, E.A. Quintero, C.R. Harris, A.M. Archibald, A.H. Ribeiro, F. Pedregosa, P. van Mulbregt, and SciPy 1.0 Contributers. Fundamental algorithms for scientific computing in python. SciPy 1.0. Nat. Methods 17, 261–272 (2020)

[VSN+18]

G. Varma, A. Subramanian, A. Namboodiri, M. Chandraker, C.V. Jawahar, IDD: A Dataset for Exploring Problems of Autonomous Navigation in Unconstrained Environments, Nov. 2018, pp. 1–9. arxiv:1811.10200

[WU18]

M. Wrenninge, J. Unger, Synscapes: A Photorealistic Synthetic Dataset for Street Scene Parsing, Oct. 2018, pp. 1–13. arxiv:1810.08705

[YCW+20]

F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, T. Darrell, BDD100K: a diverse driving dataset for heterogeneous multitask learning, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), virtual conference, June 2020, pp. 2636–2645

Title: Optimized Data Synthesis for DNN Training and Validation by Sensor Artifact Simulation
Authors: Korbinian Hagn
Oliver Grau
Publisher: Springer International Publishing
Book: Deep Neural Networks and Data for Automated Driving
Print ISBN: 978-3-031-01232-7

Electronic ISBN: 978-3-031-01233-4

Copyright Year: 2022
DOI: https://doi.org/10.1007/978-3-031-01233-4_4

Springer Professional

Optimized Data Synthesis for DNN Training and Validation by Sensor Artifact Simulation

Abstract

1 Introduction

3 Methods

3.1 Sensor Simulation

3.2 Dataset Divergence Measure

3.3 Datasets

4 Results and Discussion

4.1 Sensor Parameter Extraction

4.2 Sensor Artifact Optimization Experiment

4.3 EMD Cross-evaluation

5 Conclusions

Acknowledgements

Premium Partner

Springer Professional

Abstract

1 Introduction

2 Related Works

3 Methods

3.1 Sensor Simulation

3.2 Dataset Divergence Measure

3.3 Datasets

4 Results and Discussion

4.1 Sensor Parameter Extraction

4.2 Sensor Artifact Optimization Experiment

4.3 EMD Cross-evaluation

5 Conclusions

Acknowledgements

Premium Partner