1 Introduction
1.1 Problem Statement and Research Questions
-
RQ1: To what extent is it possible to identify safe and unsafe SDC test cases before executing them? Answering RQ1 is important to understand whether, and to what extent, it is possible to classify test cases for SDCs before executing them and by only considering static input features (i.e., referred to as Road Characteristics). We investigate the use of ML models for classifying test cases and study their application in the context of Lane Keeping, the fundamental requirement in autonomous driving. Specifically, in testing lane-keeping systems, unsafe scenarios cause self-driving cars to depart their lane (Gambi et al. 2019; Birchler et al.2022, 2022c), and input features describe the geometry of a road as a whole (i.e., Road Features).
-
RQ2: Does SDC-Scissor improve the cost-effectiveness of simulation-based testing of SDCs? RQ2 investigates whether SDC-Scissor improves the cost-effectiveness of simulation-based testing of SDCs, compared to baseline approaches. Hence, in the context of RQ2, we investigated whether SDC-Scissor reduces the time dedicated to executing irrelevant (safe) tests without affecting testing effectiveness.
-
RQ3: What is the actual upper bound on the precision and recall of ML techniques in identifying SDC safe and unsafe test cases when using static SDC features? In RQ1 and RQ2, we focused on investigating the feasibility and cost-effectiveness of using SDC Road Characteristics as features for the problem of classifying SDC test cases before executing them. In RQ3, we explore a complementary aspect, which is investigating whether there is an actual upper bound on precision and recall of ML techniques in identifying SDC safe and unsafe test cases when using static SDC features (available before executing the tests). Hence, once we identified the best ML models for classifying safe and unsafe test cases when compared to baseline approaches (in RQ1 and RQ2), we focus on answering RQ3 by (i) designing additional SDC test case features, called Diversity Metrics (compared to the previous features used in RQ1 and RQ2 for training the ML models, these metrics are more complex than just computing simple road characteristics of SDC test cases); and (ii) leveraging hyperparameter tuning strategies to find the optimal configurations of the most promising ML models (as observed in RQ1 and RQ2).
1.2 Summary of Results & Paper Contributions
-
Selection of SDCs test cases(RQ1): We investigated new methods in the area of SDCs for test case selection. We first compute SDC features that can be used to characterize safe and unsafe test cases before executing them. Hence, we introduced SDC-Scissor that leverages ML models to support test case selection for SDCs, to enhance testing cost-effectiveness.
-
SDC-Scissor’s Cost-effectiveness (RQ2): We compared the proposed approach against two distinct baseline approaches to demonstrate the testing cost-effectiveness of SDC-Scissor. The first one is a random baseline approach that selects tests randomly. The second baseline selects tests based on their road length, which means that test cases with long roads are preferred based on the intuitive assumption that long roads have a higher probability of being unsafe.
-
Offline v.s. Real-time Training (RQ2): We investigated two opposite setups for SDC test case selection that leverage ML models trained on offline data (i.e., trained on a large static dataset) and real-time data (i.e., dynamically generated tests).
-
Upper-bound of SDC static features (RQ3): We empirically investigated whether there is an actual upper-bound on the precision and recall of ML techniques in identifying SDC safe and unsafe test cases when using static SDC features (available before executing the tests).
-
Integration of SDC-Scissor in an Industrial Use Case (analysis detailed Section 6): We integrated SDC-Scissor into the development context of the AICAS use case, demonstrating that the proposed tool can automate the testing process of such a large automotive company.
2 Background
2.1 CPS Simulation Technologies
2.2 Simulation-Based Testing of Lane Keeping Systems
2.3 Article Terminology
3 The SDC-Scissor Approach
3.1 SDC-Scissor Architecture Overview
SDC-Test Generator
generates SDC simulation-based test cases.SDC-Test Executor
executes the tests and stores the test results, i.e., safe or unsafe labels, to allow training of the ML models.SDC-Features Extractor
extracts the input features from the SDC simulation-based test cases.SDC-Benchmarker
uses these features and collected labels to train the selected ML models and determines which ML model best predicts the tests that are more likely to detect faults.SDC-Predictor
uses the trained ML models to classify newly generated test cases, thus achieving cost-effective SDC simulation-based testing via test selection.3.2 SDC Test Case Features
Feature | Description | Range |
---|---|---|
Direct Distance | Euclidean distance between start and finish (Meters) | [0 – 489.9] |
Length | Total length of the driving path (Meters) | [50.6 – 3317.9] |
Num L Turns | Number of left turns on the driving path | [0 – 18] |
Num R Turns | Number of right turns on the driving path | [0 – 17] |
Num Straight | Number of straight segments on the driving path | [0 – 11] |
Total Angle | Cumulative turn angle on the driving path | [105 – 6420] |
Feature | Description | Range |
---|---|---|
Median Angle | Median turn angle on the driving path | [30 – 330] |
Std Angle | Standard deviation of turn angles on the driving path | [0 – 150] |
Max Angle | Maximum turn angle on the driving path | [60 – 345] |
Min Angle | Minimum turn angle on the driving path | [15 – 285] |
Mean Angle | Average turn angle on the driving path | [52.5 – 307.5] |
Median Radius | Median turn radius on the driving path | [7 – 47] |
Std Radius | Standard deviation of turn radius on the driving path | [0 – 22.5] |
Max Radius | Maximum turn radius on the driving path | [7 – 47] |
Min Radius | Minimum turn radius on the driving path | [2 – 47] |
Mean Radius | Average turn radius on the driving path | [5.3 – 47] |
Shapely
(Sean 2022), an open-source library for Python to perform geometric calculations. For each identified segment, we define a Shapely Polygon
object that includes the road points and the line representing the direct segment line. All classes of Shapely provide a similar interface as well for calculating the area of a Shapely object. The previously constructed Polygon has a property called area
. With this approach, we retrieve the area (also known as diversity in our context) of the segments. On this basis, we calculate two additional features; (i) Full Road Diversity
, and (ii) Mean Road Diversity
. As described in Table 3, the Full Road Diversity
is computed by summing up all areas spawned by each segment of a road, whereas the Mean Road Diversity
feature is the mean value of all areas of a single road. The main assumption for using these new features is that the road is more diverse if the spawned area is greater and, therefore, unsafer.
Feature | Description | Range |
---|---|---|
Full Road Diversity | The cumulative diversity of the full road composed of all segments. | \([0 - \infty ]\) |
Mean Road Diversity | The mean diversity of the segments of a road. | \([0 - \infty ]\) |
3.3 The SDC-Scissor’s Workflow
SDC-Test Executor
). Likewise, it relies on existing test generation algorithms integrated with that infrastructure to automatically generate the test cases to optimize (SDC-Test Generator
). Hence, SDC-Scissor can already be used to improve the cost-effectiveness of several test generators.SDC-Test Generator
and SDC-Test Executor
to collect the necessary data for training the ML Models, i.e., labeled test cases; next, it relies on SDC-Benchmarker
to determine the ML models that best classify the SDC test cases as safe or unsafe as described below. Given a set of labeled test cases and the corresponding input features extracted by SDC-Features Extractor
, SDC-Benchmarker
trains and evaluates an ensemble of standard ML models using the well-established sklearn
4 library. Next, it assesses each ML model’s quality using K-fold cross-validation and the whole dataset. Finally, it identifies the best-performing ML models according to Precision, Recall, and F-score metrics (Birchler et al. 2022) and outputs the best (trained) models as well as the features needed to operate them.SDC-Test Generator
and utilizes SDC-Features Extractor
to extract the necessary features. Finally, it invokes SDC-Predictor
for classifying safe or unsafe test cases before executing them.4 Study Design
4.1 SDC Test Cases Dataset Preparation
Test Subject | Feature Set | Data Points | ||
---|---|---|---|---|
Unsafe | Safe | Total | ||
BeamNG.AI cautious | Full Road | 312 (26%) | 866 (74%) | 1’178 |
BeamNG.AI moderate | Full Road | 2’543 (45%) | 3’095 (55%) | 5’638 |
BeamNG.AI reckless | Full Road | 1’655 (96%) | 74 (4%) | 1’729 |
Driver.AI | Full Road | 1’045 (19%) | 4’585 (81%) | 5’630 |
14’175 | ||||
BeamNG.AI moderate | Road Segment | 2’543 (3%) | 72’433 (97%) | 74’976 |
Driver.AI | Road Segment | 2’494 (3%) | 71’145 (97%) | 73’639 |
148’615 |
4.2 Research Method
-
Machine Learning-based Experiments (RQ1): The first set of experiments investigates whether ML models trained with the selected SDC test case features can identify safe and unsafe test cases before their execution.
-
Offline Experiments (RQ2): The second set of experiments investigates if and how much SDC-Scissor improves the cost-effectiveness of SDC simulation-based testing compared to baseline approaches.
-
Real-Time Experiments (RQ2): In these experiments, we train an adaptive model based on data observed while executing the tests and compare it with a pre-trained model.
-
Optimization Experiments (RQ3): The third set of experiments investigates how SDC-Scissor performance improves by adding new SDC features and tuning ML Models hyperparameters. Specifically, in RQ3, we focus on investigating whether there is an actual upper bound on the precision and recall achieved by the ML techniques in identifying SDC safe and unsafe test cases when using static SDC features (available before executing the tests).
4.2.1 Machine Learning-based Experiments (RQ1)
Dimension | Description | Dimension Configurations |
---|---|---|
Dataset | Using different datasets to train | BeamNG.AI (RF 1,1.5,2), Driver.AI, |
the model | and Combined Datasets | |
Training Set | Changing training set size by | 40% training set & 60% test set; |
using different percentage split | 50% training set & 50% test set; | |
for training and test sets | 60% training set & 40% test set; | |
80% training set & 20% test set. |
4.2.2 Offline Experiments (RQ2)
Dataset | Number of safe tests | Number of unsafe tests |
---|---|---|
Complete Set | 3095 | 2543 |
Training Set | 2034 | 2034 |
Test Pool (95/5) | 1061 | 55 |
Test Pool (80/20) | 1061 | 265 |
Test Pool (60/40) | 763 | 509 |
Test Pool (30/70) | 218 | 509 |
4.2.3 Real-Time Experiments (RQ2)
-
Pre-trained Model in which we used the best performing model identified during the Machine Learning-based Experiments (Section 5.1). We trained this model using the re-balanced dataset for the case of BeamNG.AI RF 1.5, as this is the configuration of the test subject used for this experiment.
-
Adaptive Model in which we also used the best performing model identified during the Machine Learning-based Experiments (Section 5.1 but trained with only 60 randomly generated test cases. After this initial training, we retrain the ML model after executing the predicted unsafe test cases using the newly collected ground truth labels for those test cases. Figure 6 illustrates this process. Notably, since the ML model may be inaccurate, this process collects both positive and negative labels.
Metric | Description | Range |
---|---|---|
Number of Unsafe Test Execution | The number of unsafe tests the approach simulated during the experiment | 0-N |
Number of Safe Tests Execution | The number of safe tests the approach simulated during the experiment | 0-N |
Time Allocation | How much time relative to the total time was spent with an action | 0-1 |
True Positives/Negatives | Number of correct predictions for categories safe and unsafe | 0-Number of Predictions |
False Positives/Negatives | Number of incorrect predictions for categories safe and unsafe | 0-Number of Predictions |
4.2.4 Optimization Experiments (RQ3)
-
C (confidenceFactor): Is the confidence factor, and we experimented with values [0.001,0.01,0.05,0.1,0.5]
-
M (minNumObj): Is the minimum number of instances in a leaf, and we experimented with values [1,10,20,50,100]
-
R (reducedErrorPruning): Reduced error pruning is an alternative algorithm for pruning that focuses on minimizing the statistical error of the tree. We experimented with the following values [yes,no]
-
S (subtreeRaising): This is a specific method of pruning whereby a whole set of branches further down the tree are moved up to replace branches that were grown above it. We experimented with the following values of it [yes,no]
-
I (numIterations): Is the number of trees in the forest, and we experimented with values [5,10,100,1000,2000]
-
K (numFeatures): Is the max number of features considered for splitting a node, and we experimented with values [0,10,100,500,1000]
-
depth: Is the maximum depth of the tree (0 unlimited), and we experimented with values [0,5,10,20]
-
M (minNumObj): Is the minimum number of instances in a leaf , and we experimented with values [1,10,20,50,100]
-
’loss’ = [’log_loss’, ’deviance’, ’exponential’]
-
’learning_rate’ = [0.01, 0.1, 0.2, 0.4]
-
n_estimators’ = [10, 100, 1000]
-
’criterion’ = [’friedman_mse’, ’squared_error’, ’mse’]
-
’penalty’ = [’l1’, ’l2’, ’elasticnet’, ’none’]
-
’dual’ = [True, False]
-
’max_iter’ = [10, 100, 1000]
-
’solver’ = [’newton-cg’, ’lbfgs’, ’liblinear’, ’sag’, ’saga’]
-
’penalty’ = [’l1’, ’l2’]
-
’loss’ = [’hinge’, ’squared_hinge’]
-
’dual’ = [True, False]
5 Results
5.1 Machine Learning-Based Experiments (RQ1)
5.1.1 Machine Learning-Based Experiments with Road Characteristics
Model | Unsafe test cases | Safe test cases | ||||
---|---|---|---|---|---|---|
Prec. | Recall | F1 | Prec. | Recall | F1 | |
BeamNG RF 1.5 | ||||||
J48 | 69.2% | 67.4% | 68.2% | 61.5% | 63.5% | 62.5% |
Naïve Bayes | 79.3% | 53.2% | 63.6% | 59.3% | 83.1% | 69.2% |
Logistic | 78.1% | 65.3% | 71.1% | 64.8% | 77.8% | 70.7% |
Random Forest | 75.8% | 62.7% | 68.6% | 62.5% | 75.6% | 68.4% |
Driver.AI | ||||||
J48 | 19.5% | 64.1% | 29.9% | 82.9% | 39.6% | 53.6% |
Naïve Bayes | 20.3% | 78.5% | 32.3% | 85.8% | 29.8% | 44.2% |
Logistic | 22.7% | 56.5% | 32.4% | 85.0% | 56.3% | 67.7% |
Random Forest | 22.3% | 52.6% | 31.3% | 84.4% | 58.2% | 68.9% |
5.1.2 Analysis of Relevant Features
5.1.3 Impact of Risk Factor (RF)
5.1.4 Knowledge Transfer Between Different Driving Agents
5.2 Offline Experiments (RQ2)
5.2.1 FIX Experiment results
Model | Cost-effectiveness (percentage of failing tests) | ||
---|---|---|---|
SDC-scissor | Random baseline | RL baseline | |
Random Forest | 4.0 (80%) | 0.7419 (42.6%) | 1.5 (60%) |
Gradient Boosting | 1.5 (60%) | 0.7419 (42.6%) | 1.5 (60%) |
SVM | 0.6667 (40%) | 0.7419 (42.6%) | 1.5 (60%) |
Naive Bayes | 0.6667 (40%) | 0.7419 (42.6%) | 1.5 (60%) |
Logistic Regression | 4.0 (80%) | 0.7419 (42.6%) | 1.5 (60%) |
Decision Tree | 0.4286 (30%) | 0.7419 (42.6%) | 1.5 (60%) |
5.2.2 REACH Experiment
Model/Pool | Tests # | Execution time | |
---|---|---|---|
Safe | Unsafe | ||
Smart Selector | |||
Test Pool (0.05/0.95) | 98.5 | 4664 | 375 |
Test Pool (0.3/0.7) | 19 | 475 | 376 |
Test Pool (0.5/0.5) | 14 | 214 | 389 |
Test Pool (0.7/0.3) | 11 | 54 | 379 |
Baseline | |||
Test Pool (0.05/0.95) | 171 | 8079 | 382 |
Test Pool (0.3/0.7) | 35 | 1243 | 383 |
Test Pool (0.5/0.5) | 18.5 | 439 | 391 |
Test Pool (0.7/0.3) | 14 | 193 | 387 |
5.3 Real-Time Experiments (RQ2)
Model | Acc. | Unsafe | Safe | ||
---|---|---|---|---|---|
Prec. | Recall | Prec. | Recall | ||
Pre-trained Model | 72.1% | 65.2% | 82% | 81.2% | 64% |
Real-time Model | 69% | 67.7% | 59.3% | 69.9% | 77% |
5.4 Optimization Experiments (RQ3)
ML Technique | Param. Config. | F1 | Weighted avg. F1 | |
---|---|---|---|---|
Safe | Unsafe | |||
Random Forest | I = 5, | 35.1% | 72.4% | 57.8% |
K = 10, | ||||
depth= 10, | ||||
M = 50 | ||||
J48 | C = 0.5, | 42.6% | 70.3% | 59.5% |
M = 20 | ||||
Gradient Boosting | criterion=friedman_mse, | 77.0% | 0.0% | 48.0% |
learning_rate= 0.01, | ||||
loss=log_loss, | ||||
n_estimators= 10 | ||||
Logistic | dual=False, | 76.0% | 12.0% | 52.0% |
max_iter= 10, | ||||
penalty=none, | ||||
solver=saga | ||||
Naive-Bayes | No parameters | 71.0% | 41.0% | 60.0% |
SVC | dual=False, | 76.0% | 28.0% | 58.0% |
loss=squared_hinge, | ||||
penalty=l2 |
ML Technique | Precision | Recall | F1 | |||
---|---|---|---|---|---|---|
Safe | Unsafe | Safe | Unsafe | Safe | Unsafe | |
J48 | 49.8% | 65.4% | 76.0% | 37.1% | 42.6% | 70.3% |
Naive Bayes | 66.0% | 47.0% | 75.0% | 37.0% | 71.0% | 41.0% |
6 Integration of SDC-Scissor in the Industrial Use Case
6.1 Experiments Involving an Industrial Use Case (AICAS)
-
Increased level of test automation: Currently, AICAS inputs are manually generated or designed by testers and developers in its organization. The usage of an integrated framework such as SDC-Scissor can enable the generation of test cases automatically, increasing automation and diversity of generated SDC scenarios.
-
Increased level of realism: Most of the manually entered signals inserted in the Can Bus protocol by the testers and developers of the AICAS organization do not reflect a real driving set of signals (e.g., the provided acceleration and steering angle of the vehicle are not reflecting a real driving test scenario, which makes the used inputs in most cases too random or unrealistic).
-
SDC Test Case Generation and Storage (Steps 1-2): As visualized in Fig. 18, we first use SDC-Scissor to generate 3,559 SDC test cases (with BeamNG, with RF 1.5 - moderate driving), execute them, and store the corresponding execution log in a JSON file (i.e., the actual simulation.full.json containing all information concerning the generated and executed tests by SDC-Scissor, see Fig. 18), which constitutes the dataset of our experiments.
-
SDC Test Data Conversion & Generation of CAN Playback Data (Steps 3-5): In this stage, we convert (and visualized in Fig. 19) the execution log from the JSON file (i.e., simulation.full.json generated by SDC-Scissor to CAN Playback Data (i.e., the file simulation.canplayback.*).
-
Transmission of CAN-based Signals (Steps 6): The messages (i.e., the CAN Playback Data) generated in the previous step are then transmitted to the CAN Device according to defined timestamps, consistent with the one generated by SDC-Scissor while executing SDC test cases. Specifically, referring to the specified used CAN database (i.e., < .dbc >), we converted SDC-Scissor test case data (i.e., < simulation.file.json >) to CAN messages (i.e., < simulation.canplayback.csv >). Using a specified CAN interface device, logged CAN frames are played back to external CAN bus devices. These final steps allow us to finally send realistic SDC signals concerning the driving scenarios to the CAN Device (i.e., SDC test cases generated by SDC-Scissor) in an automated fashion).
6.2 Industrial Use Case (AICAS): Integration Results
Property | Value |
---|---|
Nr. SDC test cases generated by SDC-Scissor (BeamNG RF 1.5) | 3,559 |
Total Simulation Time | 12h 17m and 11s |
Average Simulation Time | 12.428 s |
Max. Simulation Time | 21.4 s |
Property | Value |
---|---|
Nr. SDC test cases Generated by SDC-Scissor (BeamNG RF 1.5) | 3,559 |
Total Conversion of Messages + Transmission of CAN signals | 52.391 s |
Mean Time for Conversion of Messages + Transmission of CAN signals (per each SDC test case) | 14.721 ms |
Min Time for Conversion of Messages + Transmission of CAN signals (per each SDC test case) | 7.892 ms |
Min Time for Conversion of Messages + Transmission of CAN signals (per each SDC test case) | 30.006 ms |