Top

Published in:

Open Access 2022 | OriginalPaper | Chapter

A Variational Deep Synthesis Approach for Perception Validation

Authors : Oliver Grau, Korbinian Hagn, Qutub Syed Sha

Published in: Deep Neural Networks and Data for Automated Driving

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Patentsearch

Off

Abstract

This chapter introduces a novel data synthesis framework for validation of perception functions based on machine learning to ensure the safety and functionality of these systems, specifically in the context of automated driving. The main contributions are the introduction of a generative, parametric description of three-dimensional scenarios in a validation parameter space, and layered scene generation process to reduce the computational effort. Specifically, we combine a module for probabilistic scene generation, a variation engine for scene parameters, and a more realistic sensor artifacts simulation. The work demonstrates the effectiveness of the framework for the perception of pedestrians in urban environments based on various deep neural networks (DNNs) for semantic segmentation and object detection. Our approach allows a systematic evaluation of a high number of different objects and combined with our variational approach we can effectively simulate and test a wide range of additional conditions as, e.g., various illuminations. We can demonstrate that our generative approach produces a better approximation of the spatial object distribution to real datasets, compared to hand-crafted 3D scenes.

1 Introduction

This chapter introduces an automated data synthesis approach for the validation of perception functions based on a generative and parameterized synthetic data generation. We introduce a multi-stage strategy to sample the input domain of the possible generative scenario and sensor space and discuss techniques to reduce the required vast amount of computational effort. This concept is an extension and generalization of our previous work on parameterization of the scene parameters of concrete scenarios, called validation parameter space (VPS) [SGH20]. We extend this parameterization by a probabilistic scene generator to widen the coverage of the generated scenarios and a more realistic sensor simulation, which also allows to variate and simulate different sensor characteristics. This ‘deep’ synthesis concept overcomes currently available systems (as discussed in the next section) or manually, i.e., by human-operator-generated synthetic data. We describe, how our synthetic data validation engine makes use of the parameterized, generative content to implement a tool supporting complex and effective validation strategies.

Perception is one of the hardest problems to solve in any automated system. Recently, great progress has been made in applying machine learning techniques to deep neural networks to solve perceptional problems. Automated vehicles (AVs) are a recent focus as an important application of perception from cameras and other sensors, such as LiDAR and RaDAR [YLCT20]. Although the current main effort is on developing the hardware and software to implement the functionality of AVs, it will be equally important to demonstrate that this technology is safe. Universally accepted methodologies for validating safety of machine learning-based systems are still an open research topic.

Techniques to capture and render models of the real world have matured significantly over the last decades and are now able to synthesize virtual scenes in a visual quality that is hard to distinguish from real photographs for human observers. Computer-generated imagery (CGI) is increasingly popular for training and validation of deep neural networks (DNNs) (see, e.g., [RHK17, Nik19]). Synthetic data can avoid privacy issues found with recordings of members of the public and can automatically produce ground truth data at higher quality and reliability than costly manually labeled data. Moreover, simulations allow synthesis of rare scene constellations helping validation of products targeting safety-critical applications, specifically automated driving.

Due to the progress in visual and multi-sensor synthesis, building systems for validation of these complex systems in the data center becomes feasible now and offers more possibilities for the integration of intelligent techniques in the engineering process of complex applications. We compare our approach with methods and strategies targeting testing of automated driving [JWKW18].

The remainder of this chapter is structured as follows: The next section will give an outline of related work in the field. In Sect. 3 we give an overview of our approach. Section 4 describes an outline of our synthetic data validation engine, our parameterization, including a realistic sensor simulation, and the effective computation of the required variations. In Sect. 5 we present evaluation results, followed by Sect. 6 with some concluding remarks.

The use of synthesized data for development and validation is an accepted technique and has been also suggested for computer vision applications (e.g., [BB95]). Several methodologies for verification and validation of AVs have been developed [KP16, JWKW18, DG18] and commercial options exist.¹ These tools were originally designed for virtual testing of automotive functions, such as braking systems, and then extended to provide simulation and management tools for virtual test drives in virtual environments. They provide real-time-capable models for vehicles, roads, drivers, and traffic which are then being used to generate test (sensor) data as well as APIs for users to integrate the virtual simulation into their own validation system.

What is getting presented in this chapter is focusing on the validation of perception functions, which is an essential module of automated systems. However, by separating the perception as a component, the validation problem can also be decoupled from the validation of the full driving stack. Moreover, this separation allows, on the one hand, the implementation of various more specialized validation strategies and, on the other hand, there is no need to simulate dynamic actors and the connected problem of interrelations between them and the ego-vehicle. The full interaction of objects is targeted by upcoming standards like OpenScenario.²

Recently, specifically in the domain of driving scenarios, game engines have been adopted for synthetic data generation by extraction of in-game images and labels from the rendering pipeline [WEG+00, RVRK16]. Another virtual simulator system, which gained popularity in the research community, is CARLA [DRC+17], also based on a commercial game engine (Unreal4 [Epi04]). Although game engines provide a good starting point to simulate environments, they usually only offer a closed rendering setup with many trade-offs balancing between real-time constraints and a subjectively good visual appearance to human observers. Specifically, the lighting computation in this rendering pipelines is limited and does not produce physically correct imagery. Instead, game engines only deliver fixed rendering quality typically with 8 bit per RGB color channel and only basic shadow computation.

In contrast, physical-based rendering techniques have been applied to the generation of data for training and validation, as in the Synscapes dataset [WU18]. For our experimental deep synthesis work, we use the physical-based open-source Blender Cycles renderer³ in high dynamic range (HDR) resolution, which allows realistic simulation of illumination and sensor characteristics increasing the coverage of our synthetic data in terms of scene situations and optical phenomena occurring in real-world scenarios.

The effect of sensor and lens effects on perception performance has not been studied a lot. In [CSVJR18, LLFW20], the authors are modeling camera effects to improve synthetic data for the task of bounding box detection. Metrics and parameter estimation of the effects from real camera images are suggested by [LLFW20] and [CSVJR19]. A sensor model including sensor noise, lens blur, and chromatic aberration was developed based on real datasets [HG21] and integrated into our validation framework.

Looking at virtual scene content, the most recent simulation systems for validation of a complete AD system include simulation and testing of the ego-motion of a virtual vehicle and its behavior. The used test content or scenarios are therefore aimed to simulate environments spanning a huge virtual space and are then virtually driving a high number of test miles (or km) in the virtual world provided [MBM18, WPC20, DG18]. Although this might be a good strategy to validate full AD stacks, one remaining problem for validation of perception systems is the limited coverage of data testing critical scene constellations (sometimes called ‘corner cases’) and parameters that lead to drop in performance of the DNN perception.

A more suitable approach is to use probabilistic grammar systems [DKF20, WU18] to generate 3D scenarios which include a catalog of different object classes, and places them relative to each other to cover the complexity of the input domain. In this chapter we demonstrate the effectiveness of a simple probabilistic grammar system together with our previous scene parameter variation [SGH20] with a novel multi-stage strategy. This approach allows to systematically test conditions and relevant parameters for validation of perceptional function in a structured way.

3 Concept and Overview

The novelty of the framework introduced in this chapter is the combination of modules for parameterized generation and testing of a wide range of scenarios and scene parameters as well as sensor parameters. It is tailored towards exploration of factors that (hypothetically) define and limit the performance of perception modules.

A core design feature of the framework is the consequent parameterization of the scene composition, scene, and sensor parameters into a validation parameter space (VPS) as outlined in Sect. 4.2. This parameterization only considers the near proximity of the ego-car or sensor; in other words, only the objects visible to the sensor are generated. This allows a much more well-defined test of constellations involving a specific number of object types, environment topology (e.g., types and dimensions of streets), and relation of objects, usually as an implicit function of where objects are positioned relative in the scene.

This leads to a different data production and simulator pipeline than for conventional AV validation which typically provides a virtual world with a large extent to simulate and test the driving functions down to a physical level, inspired by real-world validation and test procedures [KP16, MBM18, DG18, JWKW18, WPC20].

Figure 1 shows the building blocks of our VALERIE system. The system runs an expansion of the VPS specified in the ‘validation task’ description. Our current implementation is based on a probabilistic description of how to generate the scene and defines the parameter variations in the parameter space. In the future, the validation task should also include a more abstract target description of the evaluation metrics.

The data synthesis block consists of three sub-components: The probabilistic scene generator generates a scene constellation, including a street layout, and places three-dimensional objects from the asset database according to probabilistic placement rules laid out in the scenario preparation. The parameter variation generator produces variations of that scene, including sun and light settings and variations of the placement of objects (see Fig. 2 for some examples). The sensor & environment simulation is using a rendering engine to compute a realistic simulation of the sensor impressions.

Further, ground truth data is provided through the rendering process, which can be used for a pixel-accurate depth map (distance from camera to scene object) or meta data, like pixel-wise label identifiers of classes or object instances. Depending on the perception task, this information is specifically used for training and evaluation of semantic segmentation (see Sect. 4.5).

The output of the sensor simulation is passed to the perception function under test and the response to that data is computed. An evaluation metric specific to the validation task is based on the perception response. The ground truth data, as generated by the rendering process is usually required here, e.g., to compute the similarity to the known appearance of objects. In the experiments presented in this chapter we used known performance metrics for DNNs, such as the mean intersection-over-union (mIoU) metric, as introduced by [EVGW+15].

The parameterization along with the computation flow are described in detail in the next section.

4 VALERIE: Computational Deep Validation

The goal of validation is usually to demonstrate that the perception function (DNN) is performing to a level defined by the validation goal for all cases included and specified in the ODD (operational design domain) [SAE18].

The framework presented in this contribution supports the validation of perception with the data synthesis modules outlined above. Further, we suggest to consider three levels or layers, which are supported in our framework:

The scene variation generates 3D scenes using a probabilistic grammar.

The scene parameter variation generates variations of scene parameters, such as moving objects or changing the illumination of the scene by changing the time of the day.

The sensor variation generates sensor faults and artifacts.

For an actual DNN validation, an engineer or team would build various scenarios and specify variations within these scenarios. The variations can contain lists of alternative objects (assets), different topology and poses (including position and orientation of objects) and expansions of streets and object and global parameters such as direction of the sun. Our modular multi-level approach enables strategies to sample the input space, typically by applying scene variation and this can be then combined with more in-depth parameter variation runs and variation of sensor parameters, as required.

Specifically, the ability to either combine two or all three levels of our framework allows to cover a wider range of object and scene constellations. In particular, with our integrated asset management approach, a validation engineer can ensure that certain, e.g., known critical object classes are included in the validation runs, i.e., he can explicitly control the coverage of these object classes. By combination with our parameter variation sub-system, local changes are varied, including relative positioning of critical objects, the positioning of the ego-vehicle or camera, global scene parameters such as the sun angle, etc. can be achieved.

4.1 Scene Generator

In computer graphics, a scene is considered as a collection $\mathcal {O}=\{\mathbf {o}_1,\mathbf {o}_2,\ldots \}$ of objects $\mathbf {o}_i$ and these are usually organized in scene graphs (see, e.g., [Wer94]) and this model is also basis for file format specifications to exchange 3D models and scenes, e.g., VRML⁴ or glTF.⁵

Each object in this graph can have a position and orientation and scale in a (world) coordinate system. These are usually combined into a transformation matrix $\mathbf {T}$. Several parameterizations for position and orientations are possible, for the position usually a Cartesian 3D vector for orientation notations such as Euler angles or quaternions are common.

Objects $\mathbf {o}_i$ are described as geometry, e.g., as a triangular mesh and appearance (material).

Sensors, such as a camera, can also be represented in a scene graph and so can be light sources. Both also have a position and orientation, and accordingly, the same transformation matrices as for objects can be applied (except scaling).

The probabilistic scene generator (depicted in Fig. 1) places objects $\mathbf {o}_i$ according to a grammar file that specifies rules for placements and orientations in specific areas of the 3D scene. The example json file in Table 1 specifies the placement of tree objects in a rectangular area of the scene: The tree objects are randomly drawn from a database, the field assets_list specifies a list of possible assets in universally unique identifier (UUID) notation. The 3D models in the database are tagged with a class_id, which specifies the type of objects, e.g., humans or buildings. The class information will be used in the generation of meta-data, semantic class labels, and an instance object identifier which allows to determine on a pixel level the originating 3D object.

Table 1

Example placement description (json format)

https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-01233-4_13/MediaObjects/514228_1_En_13_Tab1_HTML.png

The placement and orientation are determined by a pseudo random number generator, with a controlled seed for the ability to exactly re-run experiments if required. The scene generator handles other constraints, such as specific target densities in specific areas, distant ranges between objects, and it finally checks that the placed object neither collides nor intersects with other objects in the scene.

The street base is also part of the input description for the scene generator. A street can be varied in width, type of crossings, and textures for street and sidewalk. In the current simplistic implementation, the street base is limited to a flat ‘lego world’, i.e., only rectangular structures are implemented. Each call of the scene generator generated a different randomized scene according to the rules in the generation description. Figure 2 shows scenes generated by a number of runs of the scene generator.

4.2 Validation Parameter Space (VPS)

Another core aspect of our validation approach is to parameterize all possible variations of scene, sensor parameters, and states in a unified validation parameter space (VPS): Objects in a scene graph can be manipulated by considering their properties or attributes as a list of variable parameters. A qualitative overview of those parameters is given in Table 2. Most attributes are of geometrical nature, but also materials or properties of light sources can be varied, as depicted in Fig. 3.

Table 2

Overview of parameters to vary in a scene

Object class	Variable parameters
Static object, e.g., buildings	Limited to position, orientation, and size
Streets, roads	Geometry (e.g., position, size of lanes, etc.), friction (as function of weather conditions)
Vehicles	$\mathbf {T}_\text {v}$ = (position, orientation), trajectory $\mathcal {T}_\text {v}=(\mathbf {T}_\text {v}(t))$
Humans (pedestrian)	$\mathbf {T}_\text {p}$ = (position, orientation), trajectory $\mathcal {T}_\text {p}=(\mathbf {T}_\text {p}(t))$
Environment	Light, weather conditions
Sensors	$\mathbf {T}_\text {s}$ = (position, orientation), trajectory $\mathcal {T}_\text {s}=(\mathbf {T}_\text {s}(t))$, sensor attributes

In addition to static properties, a scene graph can include object properties that vary over time. Some of them are already included in Table 2, such as the trajectories of objects and sensors, indicated as $\mathcal {T}=(\mathbf {T}(t))$, with discrete time instants t. Computer graphic systems handle these temporal variations, also known as animations, and in principle, any attribute can be varied over time by these systems.

We introduce an important restriction in the current implementation of our validation and simulation engine: Our animations are fixed, i.e., they do not change during the runtime of the simulation in order to allow deterministic and repeatable object appearance, like poses of characters. This could be different for example when a complete autonomous system is simulated, as the actions of the system might change the way other scene agents react. We will include these aspects in the discussion and outlook and will discuss how these aspects could be mitigated.

For the use in our validation engine, as described in the next section, we augment a description of the scene in a scene graph (the asset) as outlined above, with an explicit description of those parameters which are variable in a validation run. Currently, our engine considers a list of numerical parameters with the following attributes:

A specific example to describe variations of the position of a person in a 2-D plane in pseudo markup notation is

Parameters, such as p1, are unique parameter identifiers used in the validation engine to produce and test variations of the scene.

4.3 Computation of Synthetic Data

Synthetic data is generated using computer graphics methods. Specifically for color (RGB) images, there are many software systems available, both commercially and as open source. For our experiments in this chapter, we are using Blender,⁶ as this tool allows importing, editing, and rendering of 3D content, including scripting.

The generation of synthetic data involves the following steps: First, a 3D scene model with a city model and pedestrians is generated using the probabilistic scene generator and is stored in one or more files.

The scene files are loaded into one scene graph and objects have a unique identifier and can be addressed by the following naming convention:

For the example used in Sect. 4.2, scene.person-1.pos.x refers to a path from the root object scene to the object person-1 and addresses the attribute pos.x in person-1. The object names are composed of ObjectClass-ObjectInstanceID. These conventions are used to assign a class or instance labels during ground truth generation.

The labels for object classes will be mapped to a convention used in annotation formats (i.e., as used with in the Cityscapes dataset [COR+16]) for training and evaluation of the perception function. The 2D image of a scene is computed along with the ground truth extracted from the modeling software rendering engine.

Using a second parameter pos.y, as included in the example in Sect. 4.2, would allow the positioning of the person in a plane, spanned by x- and y-axis of the coordinate system defined by the scene graph.

4.4 Sensor Simulation

We implemented a sensor model with the function blocks described in the chapter ‘Optimized Data Synthesis for DNN Training and Validation by Sensor Artifact Simulation’ [HG22], Sect. 3.1 ‘Sensor Simulation’, and depicted in Fig. 2. The module expects images in linear RGB space and floating point resolution as provided by the state-of-the-art rendering software.

We simulate a camera error model by applying sensor noise, as additive Gaussian noise (with zero mean and freely selectable variance) and an automatic, histogram-based exposure control (linear tone-mapping), followed by non-linear Gamma correction. Further, we simulate the following lens artifacts chromatic aberration and blur. Figure 4 shows a comparison of the standard tone-mapped 8-bit RGB output of Blender (left) with our sensor simulation. The parameters were adapted to match the camera characteristic of Cityscape images. The images do not only look more realistic to the human eye, they also are closing the domain gap between the synthetic and real data (for details see the chapter ‘Optimized Data Synthesis for DNN Training and Validation by Sensor Artifact Simulation’ [HG22]).

4.5 Computation and Evaluation of Perceptional Functions

Perception functions consist of a multitude of different approaches considering the wide range of different tasks. For experiments presented in this chapter, we are considering the tasks of semantic segmentation and 2D bounding box detection. In the first task, the perception function segments an input image into different objects by assigning a semantic label to each of the input image pixels. One of the main advantages of semantic segmentation is the visual representation of the task which can be easily understood and analyzed for flaws by a human.

For semantic segmentation we consider two different topologies: DeeplabV3+ as proposed in [CPK+18] and Detectron2 [WKM+19], both are utilizing ResNet101 [HZRS16] backbones.

These algorithms are trained on three different datasets to create three different models for evaluation. The first dataset is the Cityscapes dataset [COR+16], a collection of European urban street scenes during daytime with good to medium weather conditions. The second dataset is A2D2 [GKM+20]. Similar to the Cityscapes dataset it is a collection of European urban street scenes and additionally it has sequences from driving on a motorway. The last dataset, KI-A tranche 3, is a synthetic dataset provided by BIT-TS, a project partner of the KI-Absicherung project,⁷ consisting of urban street scenes inspired by the preceding two real-world datasets. All of these datasets are labeled on a subset of 11 classes which are alike in these datasets to provide comparability between the results of the different trained and evaluated models.

For the second task, the 2D-bounding box detection, we utilize the single-shot multibox detector (SSD) by [LAE+16], a 2D-bounding box detector trained on the synthetic data for pedestrian detection. This bounding box detector is applied on our variational data in Sect. 5.

To measure the performance of the task of semantic segmentation, the mean intersection-over-union (mIoU) from the COCO semantic segmentation benchmark task is used [LSD15]. The mIoU is denoted as the intersections between predicted semantic label classes and their corresponding ground truth divided by the union of the same, averaged over all classes. Another performance measure utilized is the pixel accuracy (pAcc) which is defined as follows:

$$\begin{aligned} pAcc = \frac{TP+TN}{TP+FP+FN+TN}. \end{aligned}$$

(1)

The number of true positives (TP), true negatives (TN), false positives (FP), and true negatives (TN) are used to calculate pAcc, which can also be seen as a measure for correctly predicted pixels over all pixels considered for evaluation.

For the 2D-bounding box detection we are interested in cases where, according to our definition, the performance-limiting factors are within bounds where the network should still be able to correctly predict a reasonable bounding box for each object to detect. For each synthesized and inferred image, the true positive rate (TPR) is calculated. The TPR is defined as the number of correctly detected objects (TP) over the sum of correctly detected and undetected objects (TP+FN). As we are interested in prediction failure cases we can then filter out all images with a true positive rate (TPR) of 1 and are left with images where the detection has omitted objects to detect.

4.6 Controller

The VALERIE controller (as depicted in Fig. 1, validation flow control) executes the validation run. This run can be configured in multiple ways depending on how much synthetic data is generated and evaluated. Two aspects have a major influence on this: First, the specification of parameters to be varied, and second, the used sampling strategy, which also depends on the validation goal. Both aspects are briefly described in the following.

Specification of variable validation parameters: As outlined in Sect. 4.2, the approach depends on the provision of a generative scene model. This consists of a parameterized 3D scene model and includes 3D assets in the form of static and dynamic objects. On top of this, we define variable parameters in this scene as an explicit list, as explained in Sect. 4.2.

For the specification of a validation run, all or a subset of these parameters are selected and a range and sampling distribution for that specific parameter is added. For example, to vary the x-position of a person in the scene along a line with the uniform or homogeneous distribution and a step size of 1 m, we define

The parameters refer to the following: Parameter p1 refers to parameter declarations of x position of person-1 in the example of Sect. 4.2. The field UNIFORM refers to a uniform sampling distribution. Other modes include GAUSSIAN (Gaussian distribution). The parameters 1.5, 5.5, 1.0 refer to the parameter range [1.5...5.5] and the initial step size of 1m.

Sampling of variable validation parameters: The actual expansion or sampling of the validation parameter space can be further configured and influenced in the VALERIE controller by selecting a sampler and validation strategy or goal.

The sampler object provides an interface to the controller to the validation parameter space, considering the parameter ranges and optionally the expected parameter distribution. We support uniform and Gaussian distributions.

In our current implementation, the controller can be configured to either sample the validation parameter space by a full grid search, or by a Monte-Carlo random sampling.

However, the step size can be iteratively adapted depending on the validation goal. One option here is to automatically refine the search for edge cases (or corner cases) in the parameter space: As an edge case, we consider here a parameter instance, where the evaluation function is changing between an ‘acceptable’ state to a ‘failed’ state (using a continuous performance metric). For our use case of person detection, that means a drop in the performance metric below a threshold.

Other validation goals we are planning to implement could be the automated determination of sensitive parameters or (ultimately) more intelligent search through high-dimensional validation parameter spaces.

4.7 Computational Aspects and System Scalability

Our approach is designed for execution in data centers. The implementation of the components described above is modular and makes use of containerized modules using docker.⁸ For the actual execution of the modules we use the Slurm⁹ scheduling tool, which allows running our validation engine with a high number of variants in parallel, allowing the exploration of many states in the validation parameter space.

The results presented here are produced on an experimental setup using six dual Xeon server nodes, each equipped with 380 GB RAM. The runtime of the rendering process as outlined above is mainly determined by the rendering and in the order of 10...15 min per frame, using the high-quality physically based rendering (PBR) Cycles render engine.

5 Evaluation Results and Discussion

To evaluate the effectiveness of our data synthesis approach, we conducted experiments in generating scenes, variation of a few important parameters, and then we evaluated the perception performance including an analysis of performance-limiting factors, such as occlusions and distance to objects.

We used our scene generator to generate variations of street crossings, as depicted in Fig. 2. For these examples a base ground is generated first, with flexible topology (crossings, t-junction) and dimensions of streets, sidewalks, etc. In the next step, buildings, persons, and objects, including cars, traffic signs, etc. , are selected from a database and randomly placed by the scene generator, taking into account the probabilistic description and rules. The approach can handle any number of object assets. The current experimental setup includes a total of about 500 assets, with about 60 different buildings, 180 different person models, and other objects, including vegetation, vehicles, and so on.

Scene parameter variation: Within the generated scenes, we vary the position and orientations of persons and some occluding objects.

Further, we change the illumination by changing the time of the day. This has two main effects: First, it is changing the illumination intensity and color (dominant at sunset and sunrise), and second, it is generating a variation of shadows casted into the scene. In particular, from our experience, the latter creates challenging situations for the perception.

Comparison of object distribution: Fig. 5 shows the spatial distribution of persons in a) the Cityscapes dataset, b) KI-A tranche 3 dataset, and c) a dataset using our generative scene generator, as depicted in Figs. 2 and 3. The diagrams present a top-view of the respective sensor (viewing cone) and the color encodes the frequency of persons within the sensor viewing cone, i.e., they give a representation of distance and direction of persons in all considered frames of the dataset. The real-world Cityscapes dataset has a distribution that corresponds with most persons located left and right of the center, i.e., the street. There are slightly more persons on the right side, which can be explained by the fact that often sidewalks on the left hand are occluded by vehicles from the other road side. The distribution of our dataset resembles as expected this distribution, with slightly less occupation in the distance. In contrast, the distribution of the KI-A tranche 3 dataset shows a very sharp cumulation of the distribution on what corresponds to a narrow band on the sidewalks of their 3D simulation.

Influence of different occluding objects on detection performance: A number of object and attribute variations are depicted in Fig. 6. On the left side, the SSD bounding box detector [LAE+16] is applied to the three images with different occluding objects in front of a pedestrian. In all three images, two bounding boxes are predicted for the same pedestrian. While one bounding box includes the whole body, the second bounding box only covers the non-occluded upper part of the pedestrian. On the right side, the DeeplabV3+ model trained on the KI-A tranche 3 is used to create a semantic map of the same three images. Besides the arms, the pedestrian is detected, even partially through the occluding fence. However, another interesting observation can be made: The ground the pedestrian stands on is always labeled as sidewalk. We interpret this as an indication to a bias in the training data, as the training data does not include enough images of pedestrians on the road, just on the sidewalk. This hypothesis can be further strengthened when we inspect the pedestrian distributions in Fig. 5b, where the pedestrians are distributed narrowly left and right off the street in the middle. Additionally, both bounding box prediction and the semantic segmentation do not include the pedestrian’s arms in their predictions. This can also be attributed to a bias in the training data.

Influence of noise on detection performance: An experiment demonstrating our sensor simulation determines the influence of sensor noise on the predictive performance. In Fig. 7, Gaussian noise with increasing variance is applied to an image, and three DeeplabV3+ models trained on A2D2, Cityscapes, and a synthetic dataset, respectively, are used to predict on the data. While image color pixels are represented in the range $x_i\in [0,255]$, the noise variance is in the range of $\sigma ^2\in [0,20]$ with a step size of 1. For each noise variance step, the mIoU performance metric on the image prediction per model is calculated. While initially the models trained on Cityscapes and the synthetic dataset increase in performance, all models’ predictive performance ultimately decreases with an increasing level of sensor noise. The initial increase can be explained to stem from the domain shift of training to validation data, where in the training data a small noise variance can be observed.

Analysis of performance-limiting factors: Some scene parameters have a major influence on the perception performance. This includes the occlusion rate of objects, with totally occluded objects that are obviously not detectable or the object size (in pixels) in the images, also with a natural boundary where detection breaks up if the object size is too small. Other performance-limiting factors include contrast and other physically observable parameters.

To measure the influence or sensitivity of perception functions against performance-limiting factors we designed an experiment using about 30,000 frames containing one person each. The person is moved and rotated on the sidewalk and on the street. The occlusions are determined by rendering a mask of the person and comparison with the actual instance mask considering occluding objects. A degree of 100% represents a fully occluded object. Figure 8 shows results of this experiment, each gray dot representing one frame and the colored curves showing regression plots with differently clothed persons.

The figure shows a pAcc downwards trend with increasing occlusion rates. The Detectron2 model (trained on Cityscapes) is comparatively more robust than DeeplabV3. The plot shows that Detectron2 offers stable detection with occlusion rates <35% and then the performance drops. DeeplabV3’s (also trained on Cityscapes) performance drops after 15% occlusion rate. The curves are not linearly following a trend due to the fact that there are other scene parameters (sunlight, shadow, direction of pedestrian) which are not constant across the rendered images.

What can also be seen in the figures is that, despite the trend of the regression curves, there is a great variation in the data—visible by the widely scattered grey points. That means that the performance depends also on other factors besides the occlusion rate. Figure 8 is showing one example of analysis possible with the meta-data provided by our framework. More parameters are considered in our previous work [SGH20].

Data bias analysis: Another experiment we conducted considers failure cases, i.e., false negatives (FN) of the SSD 2D-bounding box detector regarding pedestrian detection. To accomplish this, we rendered 2640 images with our variational data synthesis engine. These images are then inferred by the SSD model and evaluated. Only pedestrians with a bounding box width greater than 0.1 $\times $ image width and a height of 0.1 $\times $ image height are considered valid for evaluation. Additionally, only objects with an occlusion rate below 25% are considered valid. These restrictions guarantee that pedestrians in the validation are of sufficient size, i.e., close to the camera, and clearly visible due to little occlusion and would therefore be easy to detect.

With these restrictions in place we found that from all the pedestrian assets the synthesis engine placed in the scene, there were six assets that were omitted by the SSD model as can be seen in Fig. 9.

The asset ID 1 is an Arabian woman wearing traditional clothes effectively veiling the person. Asset ID 2 is a Caucasian woman clothed in summer casual, i.e., short pants and short sleeves, revealing parts of her skin. The second Arabian ethnicity asset with the ID 3 is similar to asset 1 clothed in traditional veiling clothes but of male gender. The remaining assets 4, 5, and 6 are of male gender and Caucasian ethnicity wearing different work clothes, i.e., a blue paramedical outfit for ID 4, business casual jeans and jacket for ID 5, and white physician clothes for ID 6.

The asset ID 2 with the summer casual clothed woman is only miss-detected a few times, in most cases the detection worked well, indicating no data bias for this asset. In contrast, the pedestrian asset ID 6 of a physician dressed in white hospital clothing has not been detected at all. Additionally, two of the assets that were relatively most often overlooked by the network are the Arabian clothed woman with asset ID 1, as well as an Arabian clothed man with the ID 3. This result would suggest that these kind of pedestrian assets, i.e., IDs 1, 3, and 6, were not present in the data for training the model and adding them to it will lead to a mitigation of this exact failure case.

6 Outlook and Conclusions

This chapter has introduced a new generative data synthesis framework for the validation of machine learning-based perception functions. The approach allows a very flexible description of scenes and parameters to be varied and systematical tests of parameter variations in our unified validation parameter space.

The conducted experiments demonstrate the benefits of splitting the validation process into scene variation that looks into randomized placement of objects and a variation of scene parameters and sensor simulation. Our simple probabilistic scene generator is scalable and able to produce scenes with a high number of different objects—as provided by an asset database. The spatial distribution of the positioned objects, as demonstrated for persons in Fig. 5, is more realistic compared to manually crafted 3D scenes. Along with our sensor simulation (results discussed in the chapter ‘Optimized Data Synthesis for DNN Training and Validation by Sensor Artifact Simulation’ [HG22]), we present a step to close the domain-gap between synthetic and real data. Future work will continue to analyze the influence of other factors, such as rendering fidelity, scene complexity, and composition, to further improve the capabilities of the framework and make it even more applicable for the validation of real-world AI functions.

Our experiments with performance-limiting factors, as shown for occlusion rates and object size (as a function of distance to the camera) in the previous section gives clear evidence that the performance of perception functions cannot be characterized by only a few factors. It is, however, a complex function of many parameters and aspects, including scene complexity, scene lighting and weather conditions, and the sensor characteristics. The deep validation approach described in this chapter is addressing this multi-dimensional complexity problem and we designed a system and methodology for flexible validation strategies to span all these parameters at once.

Our validation parameterization, as demonstrated in the results section, is an effective way to detect performance problems in perception functions. Moreover, it allows in its flexible design the sampling and a practical computation at scale allowing for deep exploration of the multi-variate validation parameter space. Therefore, we see our system as a valuable tool for the validation of perception functions.

Moving forward we are looking into using the deep synthesis approach to implement sophisticated algorithms to support more complex validation strategies. As another key direction we target improvements in the computational efficiency of our validation approach, allowing coverage of more complexity and parameter dimensions.

Acknowledgements

The research leading to these results is funded by the German Federal Ministry for Economic Affairs and Energy within the project ‘Methoden und Maßnahmen zur Absicherung von KI-basierten Wahrnehmungsfunktionen für das automatisierte Fahren (KI Absicherung)’. The authors would like to thank the consortium for the successful cooperation.

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

previous chapter Safety Assurance of Machine Learning for Perception Functions

next chapter The Good and the Bad: Using Neuron Coverage as a DNN Validation Technique

For example, Carmaker from IPG or PreScan from TASS International.

https://www.asam.net/standards/detail/openscenario/.

https://www.blender.org/.

ISO/IEC 14772-1:1997 and ISO/IEC 14772-2:2004 https://www.web3d.org.

www.khronos.org/gltf.

www.blender.org.

https://www.ki-absicherung-projekt.de/.

www.docker.com.

https://slurm.schedmd.com.

[BB95]

W. Burger, M.J. Barth, Virtual reality for enhanced computer vision, in Virtual Prototyping: Virtual Environments and the Product Design Process, ed. by J. Rix, S. Haas, J. Teixeira (Springer, 1995), pp. 247–257

[COR+16]

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, B. Schiele, The Cityscapes dataset for semantic urban scene understanding, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Las Vegas, NV, USA, 2016), pp. 3213–3223

[CPK+18]

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A.L. Yuille, DeepLab: semantic image segmentation with deep convolutional nets, Atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 40(4), 834–848 (2018)

[CSVJR18]

A. Carlson, K.A. Skinner, R. Vasudevan, M. Johnson-Roberson, Modeling camera effects to improve visual learning from synthetic data, in Proceedings of the European Conference on Computer Vision (ECCV) Workshops (Munich, Germany, 2018), pp. 505–520

[CSVJR19]

A. Carlson, K.A. Skinner, R. Vasudevan, M. Johnson-Roberson, Sensor transfer: learning optimal sensor effect image augmentation for sim-to-real domain adaptation, pp. 1–8 (2019). arXiv:1809.06256

[DG18]

W. Damm, R. Galbas, Exploiting learning and scenario-based specification languages for the verification and validation of highly automated driving, in Proceedings of the IEEE/ACM International Workshop on Software Engineering for AI in Autonomous Systems (SEFAIAS) (Gothenburg, Sweden, 2018), pp. 39–46

[DKF20]

J. Devaranjan, A. Kar, S. Fidler, Meta-Sim2: learning to generate synthetic datasets, in Proceedings of the European Conference on Computer Vision (ECCV) (Virtual conference, 2020), pp. 715–733

[DRC+17]

A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, V. Koltun, CARLA: an open urban driving simulator, in Proceedings of the Conference on Robot Learning CORL (Mountain View, CA, USA, 2017), pp. 1–16

[Epi04]

Epic Games, Inc. Unreal Engine Homepage (2004). [Online; accessed 2021-11-18]

[EVGW+15]

M. Everingham, L.V. Gool, C.K.I. Williams, J. Winn, A. Zisserman, The pascal visual object classes challenge: a retrospective. Int. J. Comput. Vis. (IJCV) 111(1), 98–136 (2015)

[GKM+20]

J. Geyer, Y. Kassahun, M. Mahmudi, X. Ricou, R. Durgesh, A.S. Chung, L. Hauswald, V.H. Pham, M. Mühlegg, S. Dorn, T. Fernandez, M. Jänicke, S. Mirashi, C. Savani, M. Sturm, O. Vorobiov, M. Oelker, S. Garreis, P. Schuberth, A2D2: Audi autonomous driving dataset (2020), (pp. 1–10). arXiv:2004.06320

[HG21]

K. Hagn, O. Grau, Improved sensor model for realistic synthetic data generation, in Proceedings of the ACM Computer Science in Cars Symposium (CSCS) (Virtual Conference, 2021), pp. 1–9

[HG22]

K. Hagn, O. Grau, Optimized data synthesis for DNN training and validation by sensor artifact simulation, in Deep Neural Networks and Data for Automated Driving—Robustness, Uncertainty Quantification, and Insights Towards Safety ed. by T. Fingscheidt, H. Gottschalk, S. Houben (Springer, 2022), pp. 149–170

[HZRS16]

K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Las Vegas, NV, USA, 2016), pp. 770–778

[JWKW18]

P. Junietz, W. Wachenfeld, K. Klonecki, H. Winner, Evaluation of different approaches to address safety validation of automated driving, in Proceedings of the IEEE Intelligent Transportation Systems Conference (ITSC) (Maui, HI, USA, 2018), pp. 491–496

[KP16]

Nidhi Kalra, Susan M. Paddock, Driving to safety: how many miles of driving would it take to demonstrate autonomous vehicle reliability? Transp. Res. Part A: Policy Pract. 94, 182–193 (2016)

[LAE+16]

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A.C. Berg, SSD: single shot multibox detector, in Proceedings of the European Conference on Computer Vision (ECCV) (Amsterdam, The Netherlands, 2016), pp. 21–37

[LLFW20]

Zhenyi Liu, Trisha Lian, Joyce Farrell, Brian Wandell, Neural network generalization: the impact of camera parameters. IEEE Access 8, 10443–10454 (2020)CrossRef

[LSD15]

J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Boston, MA, USA, 2015), pp. 3431–3440

[MBM18]

T. Menzel, G. Bagschik, M. Maurer, Scenarios for development, test and validation of automated vehicles, in Proceedings of the IEEE Intelligent Vehicles Symposium (IV) (Changshu, China, 2018), pp. 1821–1827

[Nik19]

S.I. Nikolenko, Synthetic Data for Deep Learning (2019), pp. 1–156. arXiv:1909.11512

[RHK17]

S.R. Richter, Z. Hayder, V. Koltun, Playing for benchmarks, in Proceedings of the IEEE International Conference on Computer Vision (ICCV) (Venice, Italy, 2017), pp. 2232–2241

[RVRK16]

S.R. Richter, V. Vineet, S. Roth, V. Koltun, Playing for data: ground truth from computer games, in Proceedings of the European Conference on Computer Vision (ECCV) (Amsterdam, The Netherlands, 2016), pp. 102–118

[SAE18]

SAE International. SAE J3016: surface vehicle recommended practice—taxonomy and definitions for terms related to driving automation systems for on-road motor vehicles. SAE Int. (2018)

[SGH20]

Q.S. Sha, O. Grau, K. Hagn, DNN analysis through synthetic data variation, in Proceedings of the ACM Computer Science in Cars Symposium (CSCS) (Virtual Conference, 2020), pp. 1–10

[WEG+00]

B. Wymann, E. Espié, C. Guionneau, C. Dimitrakakis, R. Coulom, A. Sumner, et al., TORCS—The Open Racing Car Simulator (2000). [Online; accessed 2021-11-18]

[Wer94]

J. Wernecke, The Inventor Mentor: Programming Object-Oriented 3D Graphics With Open Inventor (Addison-Wesley, 1994)

[WKM+19]

Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, R. Girshick, Detectron2 (2019). [Online; assessed 2021-11-18]

[WPC20]

Mingyun Wen, Jisun Park, Kyungeun Cho, A scenario generation pipeline for autonomous vehicle simulators. HCIS 10, 1–15 (2020)

[WU18]

M. Wrenninge, J. Unger, Synscapes: a photorealistic synthetic dataset for street scene parsing (2018), pp. 1–13. arXiv:1810.08705

[YLCT20]

Ekim Yurtsever, Jacob Lambert, Alexander Carballo, Kazuya Takeda, A survey of autonomous driving: common practices and emerging technologies. IEEE Access 8, 58443–58469 (2020)CrossRef

Title: A Variational Deep Synthesis Approach for Perception Validation
Authors: Oliver Grau
Korbinian Hagn
Qutub Syed Sha
Publisher: Springer International Publishing
Book: Deep Neural Networks and Data for Automated Driving
Print ISBN: 978-3-031-01232-7

Electronic ISBN: 978-3-031-01233-4

Copyright Year: 2022
DOI: https://doi.org/10.1007/978-3-031-01233-4_13

Springer Professional

A Variational Deep Synthesis Approach for Perception Validation

Abstract

1 Introduction

3 Concept and Overview

4 VALERIE: Computational Deep Validation

4.1 Scene Generator

4.2 Validation Parameter Space (VPS)

4.3 Computation of Synthetic Data

4.4 Sensor Simulation

4.5 Computation and Evaluation of Perceptional Functions

4.6 Controller

4.7 Computational Aspects and System Scalability

5 Evaluation Results and Discussion

6 Outlook and Conclusions

Acknowledgements

Premium Partner

Springer Professional

Abstract

1 Introduction

2 Related Work

3 Concept and Overview

4 VALERIE: Computational Deep Validation

4.1 Scene Generator

4.2 Validation Parameter Space (VPS)

4.3 Computation of Synthetic Data

4.4 Sensor Simulation

4.5 Computation and Evaluation of Perceptional Functions

4.6 Controller

4.7 Computational Aspects and System Scalability

5 Evaluation Results and Discussion

6 Outlook and Conclusions

Acknowledgements

Premium Partner