Skip to main content

2020 | Buch

Search-Based Software Engineering

12th International Symposium, SSBSE 2020, Bari, Italy, October 7–8, 2020, Proceedings

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 12th International Symposium on Search-Based Software Engineering, SSBSE 2020, held in Bari, Italy, in October 2020.

The 13 research papers and 5 short papers presented together with 1 keynote were carefully reviewed and selected from 34 submissions. SBSE is a research area focused on the formulation of software engineering problems as search problems, and the subsequent use of complex heuristic techniques to attain optimal solutions to such problems. A wealth of engineering challenges - from test generation, to design refactoring, to process organization - can be solved efficiently through the application of automated optimization techniques. SBSE is a growing field - sitting at the crossroads between AI, machine learning, and software engineering - and SBSE techniques have begun to attain human-competitive results.

Due to the Corona pandemic SSBSE 2020 was held as a virtual event.

Inhaltsverzeichnis

Frontmatter

Keynote

Frontmatter
Search-Based Software Testing for Formal Software Verification – and Vice Versa
Abstract
“Testing can be quite effective for showing the presence of bugs, but is hopelessly inadequate for showing their absence”. This famous remark, which was made by Dijkstra, has often been used to indicate a dichotomy between testing and verification. From a practitioner’s point of view, however, there is not much difference in the ways testing and verification techniques may be used in practice. While engineers would try to demonstrate that their systems are correct (verification), they often find themselves in a situation where they have to prioritize bug finding (testing). As a result, the choice to go for formal verification versus testing is largely driven by the practical needs and the context specificities. In this keynote, I will focus on search-based software testing (SBST) and review some recent research that combines ideas from the SBST and the formal verification communities to improve the analysis of models of cyber physical systems (CPS). I will present an empirical study that compares CPS model testing and verification, a search-based testing approach for compute-intensive CPS models that builds on a well-known formal verification framework, and a technique to automatically generate formal environment assumptions for CPS models using search algorithms and genetic programming.
Shiva Nejati

Research Papers

Frontmatter
Automated Unit Test Generation for Python
Abstract
Automated unit test generation is an established research field, and mature test generation tools exist for statically typed programming languages such as Java. It is, however, substantially more difficult to automatically generate supportive tests for dynamically typed programming languages such as Python, due to the lack of type information and the dynamic nature of the language. In this paper, we describe a foray into the problem of unit test generation for dynamically typed languages. We introduce Pynguin, an automated unit test generation framework for Python. Using Pynguin, we aim to empirically shed light on two central questions: (1) Do well-established search-based test generation methods, previously evaluated only on statically typed languages, generalise to dynamically typed languages? (2) What is the influence of incomplete type information and dynamic typing on the problem of automated test generation? Our experiments confirm that evolutionary algorithms can outperform random test generation also in the context of Python, and can even alleviate the problem of absent type information to some degree. However, our results demonstrate that dynamic typing nevertheless poses a fundamental issue for test generation, suggesting future work on integrating type inference.
Stephan Lukasczyk, Florian Kroiß, Gordon Fraser
Do Quality Indicators Prefer Particular Multi-objective Search Algorithms in Search-Based Software Engineering?
Abstract
In Search-Based Software Engineering (SBSE), users typically select a set of Multi-Objective Search Algorithms (MOSAs) for their experiments without any justification, or they simply choose an MOSA because of its popularity (e.g., NSGA-II). On the other hand, users know certain characteristics of solutions they are interested in. Such characteristics are typically measured with Quality Indicators (QIs) that are commonly used to evaluate the quality of solutions produced by an MOSA. Consequently, these QIs are often employed to empirically evaluate a set of MOSAs for a particular search problem to find the best MOSA. Thus, to guide SBSE users in choosing an MOSA that represents the solutions measured by a specific QI they are interested in, we present an empirical evaluation with a set of SBSE problems to study the relationships among commonly used QIs and MOSAs in SBSE. Our aim, by studying such relationships, is to identify whether there are certain characteristics of a QI because of which it prefers a certain MOSA. Such preferences are then used to provide insights and suggestions to SBSE users in selecting an MOSA, given that they know which quality aspects of solutions they are looking for.
Shaukat Ali, Paolo Arcaini, Tao Yue
It Is Not Only About Control Dependent Nodes: Basic Block Coverage for Search-Based Crash Reproduction
Abstract
Search-based techniques have been widely used for white-box test generation. Many of these approaches rely on the approach level and branch distance heuristics to guide the search process and generate test cases with high line and branch coverage. Despite the positive results achieved by these two heuristics, they only use the information related to the coverage of explicit branches (e.g., indicated by conditional and loop statements), but ignore potential implicit branchings within basic blocks of code. If such implicit branching happens at runtime (e.g., if an exception is thrown in a branchless-method), the existing fitness functions cannot guide the search process. To address this issue, we introduce a new secondary objective, called Basic Block Coverage (BBC), which takes into account the coverage level of relevant basic blocks in the control flow graph. We evaluated the impact of BBC on search-based crash reproduction because the implicit branches commonly occur when trying to reproduce a crash, and the search process needs to cover only a few basic blocks (i.e., blocks that are executed before crash happening). We combined BBC with existing fitness functions (namely STDistance and WeightedSum) and ran our evaluation on 124 hard-to-reproduce crashes. Our results show that BBC, in combination with STDistance and WeightedSum, reproduces 6 and 1 new crashes, respectively. BBC significantly decreases the time required to reproduce 26.6% and 13.7% of the crashes using STDistance and WeightedSum, respectively. For these crashes, BBC reduces the consumed time by 44.3% (for STDistance) and 40.6% (for WeightedSum) on average.
Pouria Derakhshanfar, Xavier Devroey, Andy Zaidman
Search-Based Testing for Scratch Programs
Abstract
Block-based programming languages enable young learners to quickly implement fun programs and games. The Scratch programming environment is particularly successful at this, with more than 50 million registered users at the time of this writing. Although Scratch simplifies creating syntactically correct programs, learners and educators nevertheless frequently require feedback and support. Dynamic program analysis could enable automation of this support, but the test suites necessary for dynamic analysis do not usually exist for Scratch programs. It is, however, possible to cast test generation for Scratch as a search problem. In this paper, we introduce an approach for automatically generating test suites for Scratch programs using grammatical evolution. The use of grammatical evolution clearly separates the search encoding from framework-specific implementation details, and allows us to use advanced test acceleration techniques. We implemented our approach as an extension of the Whisker test framework. Evaluation on sample Scratch programs demonstrates the potential of the approach.
Adina Deiner, Christoph Frädrich, Gordon Fraser, Sophia Geserer, Niklas Zantner
Solving Schedulability as a Search Space Problem with Decision Diagrams
Abstract
Real-time system design involves proving the schedulability of a set of tasks with hard timing and other constraints that should run on one or several cores. When those requirements are known at design time, it is possible to compute a fixed scheduling of tasks before deployment. This approach avoids the overhead induced by an online scheduler and allows the designer to verify the schedulability of the taskset design under normal and degraded conditions, such as core failures. In this context, we propose to solve the schedulability problem as a state space exploration problem. We represent the schedulings as partial functions that map each task to a core and a point in time. Partial functions can be efficiently encoded using a new variant of decision diagrams, called Map-Family Decision Diagrams (MFDDs). Our setting allows first to create the MFDD of all possible schedulings and then apply homomorphic operations directly on it, in order to obtain the schedulings that respect the constraints of the taskset.
Dimitri Racordon, Aurélien Coet, Emmanouela Stachtiari, Didier Buchs
Transforming Interactive Multi-objective Metamodel/Model Co-evolution into Mono-objective Search via Designer’s Preferences Extraction
Abstract
The simultaneous evolution of metamodels and models is called the meta-models/models co-evolution problem. While some Interactive/automated metamodel/model co-evolution techniques have been proposed using multi-objective search, designers still need to explore a large number of possible revised models. In this paper, we propose an approach to convert multi-objective search into a mono-objective one after interacting with the designer to identify a set of model changes based on his/her preferences. The first step consists of using a multi-objective search to generate different possible model edit operations by finding a trade-off between three objectives. Then, the designer may give feedback on some proposed solutions. The extracted preferences are used to transform the multi-objective search into a mono-objective one by generating an evaluation function based on the weights for the existing fitness functions that are automatically computed from the feedback. Thus, the designer will just interact with only one solution generated by the mono-objective search. We evaluated our approach on a set of metamodel/model co-evolution case studies and compared it to existing fully automated and interactive meta-model/model co-evolution techniques. The results show that the mono-objective search after the interaction with the users significantly improved the co-evolution changes for several widely used metamodels.
Wael Kessentini, Vahid Alizadeh
Evolutionary Grammar-Based Fuzzing
Abstract
A fuzzer provides randomly generated inputs to a targeted software to expose erroneous behavior. To efficiently detect defects, generated inputs should conform to the structure of the input format and thus, grammars can be used to generate syntactically correct inputs. In this context, fuzzing can be guided by probabilities attached to competing rules in the grammar, leading to the idea of probabilistic grammar-based fuzzing. However, the optimal assignment of probabilities to individual grammar rules to effectively expose erroneous behavior for individual systems under test is an open research question. In this paper, we present EvoGFuzz, an evolutionary grammar-based fuzzing approach to optimize the probabilities to generate test inputs that may be more likely to trigger exceptional behavior. The evaluation shows the effectiveness of EvoGFuzz in detecting defects compared to probabilistic grammar-based fuzzing (baseline). Applied to ten real-world applications with common input formats (JSON, JavaScript, or CSS3), the evaluation shows that EvoGFuzz achieved a significantly larger median line coverage for all subjects by up to 48% compared to the baseline. Moreover, EvoGFuzz managed to expose 11 unique defects, from which five have not been detected by the baseline.
Martin Eberlein, Yannic Noller, Thomas Vogel, Lars Grunske
Commonality-Driven Unit Test Generation
Abstract
Various search-based test generation techniques have been proposed to automate the generation of unit tests fulfilling different criteria (e.g., line coverage, branch coverage, mutation score, etc.). Despite several advances made over the years, search-based unit test generation still suffers from a lack of guidance due to the limited amount of information available in the source code that, for instance, hampers the generation of complex objects. Previous studies introduced many strategies to address this issue, e.g., dynamic symbolic execution or seeding, but do not take the internal execution of the methods into account. In this paper, we introduce a novel secondary objective called commonality score, measuring how close the execution path of a test case is from reproducing a common or uncommon execution pattern observed during the operation of the software. To assess the commonality score, we implemented it in EvoSuite and evaluated its application on 150 classes from JabRef, an open-source software for managing bibliography references. Our results are mixed. Our approach leads to test cases that indeed follow common or uncommon execution patterns. However, if the commonality score can have a positive impact on the structural coverage and mutation score of the generated test suites, it can also be detrimental in some cases.
Björn Evers, Pouria Derakhshanfar, Xavier Devroey, Andy Zaidman
Using a Genetic Algorithm to Optimize Configurations in a Data-Driven Application
Abstract
Users of highly-configurable software systems often want to optimize a particular objective such as improving a functional outcome or increasing system performance. One approach is to use an evolutionary algorithm. However, many applications today are data-driven, meaning they depend on inputs or data which can be complex and varied. Hence, a search needs to be run (and re-run) for all inputs, making optimization a heavy-weight and potentially impractical process. In this paper, we explore this issue on a data-driven highly-configurable scientific application. We build an exhaustive database containing 3,000 configurations and 10,000 inputs, leading to almost 100 million records as our oracle, and then run a genetic algorithm individually on each of the 10,000 inputs. We ask if (1) a genetic algorithm can find configurations to improve functional objectives; (2) whether patterns of best configurations over all input data emerge; and (3) if we can we use sampling to approximate the results. We find that the original (default) configuration is best only 34% of the time, while clear patterns emerge of other best configurations. Out of 3,000 possible configurations, only 112 distinct configurations achieve the optimal result at least once across all 10,000 inputs, suggesting the potential for lighter weight optimization approaches. We show that sampling of the input data finds similar patterns at a lower cost.
Urjoshi Sinha, Mikaela Cashman, Myra B. Cohen
Measuring and Maintaining Population Diversity in Search-Based Unit Test Generation
Abstract
Genetic algorithms (GAs) have been demonstrated to be effective at generating unit tests. However, GAs often suffer from a loss of population diversity, which causes the search to prematurely converge, thus negatively affecting the resulting code coverage. One way to prevent premature convergence is to maintain and increase population diversity. Although the impact of population diversity on the performance of GAs is well-studied in the literature, little attention has been given to population diversity in unit test generation. We study how maintaining population diversity influences the Many-Objective Sorting Algorithm (MOSA), a state-of-the-art evolutionary search algorithm for generating unit tests. We define three diversity measures based on fitness entropy, test executions (phenotypic diversity), and Java statements (genotypic diversity). To improve diversity, we apply common methods that fall into two groups: niching (such as fitness sharing and clearing) and non-niching (such as diverse initial populations). Our results suggest that increasing diversity does not have a beneficial effect on coverage in general, but it may improve coverage once the search stagnates.
Nasser Albunian, Gordon Fraser, Dirk Sudholt

New Ideas and Emerging Results

Frontmatter
Search@Home: A Commercial Off-the-Shelf Environment for Investigating Optimization Problems
Abstract
Search heuristics, particularly those that are evaluation-driven (e.g., evolutionary computation), are often performed in simulation, enabling exploration of large solution spaces. Yet simulation may not truly replicate real-world conditions. However, search heuristics have been proven to be successful when executed in real-world constrained environments that limit searching ability even with broad solution spaces. Moreover, searching in situ provides the added benefit of exposing the search heuristic to the exact conditions and uncertainties that the deployed application will face. Software engineering problems can benefit from in situ search via instantiation and analysis in real-world environments. This paper introduces Search@Home, an environment comprising heterogeneous commercial off-the-shelf devices enabling rapid prototyping of optimization strategies for real-world problems.
Erik M. Fredericks, Jared M. Moore

Replications and Negative Results

Frontmatter
Exploring the Use of Genetic Algorithm Clustering for Mobile App Categorisation
Abstract
Search-based approaches have been successfully used as clustering algorithms in several domains. However, little research has looked into their effectiveness for clustering tasks commonly faced in Software Engineering (SE). This short replication paper presents a preliminary investigation on the use of Genetic Algorithm (GA) to the problem of mobile application categorisation. Our results show the feasibility of GA-based clustering for this task, which we hope will foster new avenues for Search-Based Software Engineering (SBSE) research in this area.
Afnan A. Al-Subaihin, Federica Sarro
Impact of Test Suite Coverage on Overfitting in Genetic Improvement of Software
Abstract
Genetic Improvement (GI) uses automated search to improve existing software. It can be used to improve runtime, energy consumption, fix bugs, and any other software property, provided that such property can be encoded into a fitness function. GI usually relies on testing to check whether the changes disrupt the intended functionality of the software, which makes test suites important artefacts for the overall success of GI. The objective of this work is to establish which characteristics of the test suites correlate with the effectiveness of GI. We hypothesise that different test suite properties may have different levels of correlation to the ratio between overfitting and non-overfitting patches generated by the GI algorithm. In order to test our hypothesis, we perform a set of experiments with automatically generated test suites using EvoSuite and 4 popular coverage criteria. We used these test suites as input to a GI process and collected the patches generated throughout such a process. We find that while test suite coverage has an impact on the ability of GI to produce correct patches, with branch coverage leading to least overfitting, the overfitting rate was still significant. We also compared automatically generated tests with manual, developer-written ones and found that while manual tests had lower coverage, the GI runs with manual tests led to less overfitting than in the case of automatically generated tests. Finally, we did not observe enough statistically significant correlations between the coverage metrics and overfitting ratios of patches, i.e., the coverage of test suites cannot be used as a linear predictor for the level of overfitting of the generated patches.
Mingyi Lim, Giovani Guizzo, Justyna Petke
Bet and Run for Test Case Generation
Abstract
Anyone working in the technology sector is probably familiar with the question: “Have you tried turning it off and on again?”, as this is usually the default question asked by tech support. Similarly, it is known in search-based testing that metaheuristics might get trapped in a plateau during a search. As a human, one can look at the gradient of the fitness curve and decide to restart the search, so as to hopefully improve the results of the optimization with the next run. Trying to automate such a restart, it has to be programmatically decided whether the metaheuristic has encountered a plateau yet, which is an inherently difficult problem. To mitigate this problem in the context of theoretical search problems, the Bet and Run strategy was developed, where multiple algorithm instances are started concurrently, and after some time all but the single most promising instance in terms of fitness values are killed. In this paper, we adopt and evaluate the Bet and Run strategy for the problem of test case generation. Our work indicates that use of this restart strategy does not generally lead to gains in the quality metrics, when instantiated with the best parameters found in the literature.
Sebastian Müller, Thomas Vogel, Lars Grunske
Bytecode-Based Multiple Condition Coverage: An Initial Investigation
Abstract
Masking occurs when one condition prevents another from influencing the output of a Boolean expression. Adequacy criteria such as Multiple Condition Coverage (MCC) overcome masking within one expression, but offer no guarantees about subsequent expressions. As a result, a Boolean expression written as a single complex statement will yield more effective test cases than when written as a series of simple expressions. Many approaches to automated test case generation for Java operate not on the source code, but on bytecode. The transformation to bytecode simplifies complex expressions into multiple expressions, introducing masking. We propose Bytecode-MCC, a new adequacy criterion designed to group bytecode expressions and reformulate them into complex expressions. Bytecode-MCC should produce test obligations that are more likely to reveal faults in program logic than tests covering the simplified bytecode.
A preliminary study shows potential improvements from attaining Bytecode-MCC coverage. However, Bytecode-MCC is difficult to optimize, and means of increasing coverage are needed before the technique can make a difference in practice. We propose potential methods to improve coverage.
Srujana Bollina, Gregory Gay

Challenge Solutions

Frontmatter
An Application of Model Seeding to Search-Based Unit Test Generation for Gson
Abstract
Model seeding is a strategy for injecting additional information in a search-based test generation process in the form of models, representing usages of the classes of the software under test. These models are used during the search-process to generate logical sequences of calls whenever an instance of a specific class is required. Model seeding was originally proposed for search-based crash reproduction. We adapted it to unit test generation using EvoSuite and applied it to Gson, a Java library to convert Java objects from and to JSON. Although our study shows mixed results, it identifies potential future research directions.
Mitchell Olsthoorn, Pouria Derakhshanfar, Xavier Devroey
Generating Diverse Test Suites for Gson Through Adaptive Fitness Function Selection
Abstract
Many fitness functions—such as those targeting test suite diversity—do not yield sufficient feedback to drive test generation. We propose that diversity can instead be improved through adaptive fitness function selection (AFFS), an approach that varies the fitness functions used throughout the generation process in order to strategically increase diversity. We have evaluated our AFFS framework, EvoSuiteFIT, on a set of 18 real faults from Gson, a JSON (de)serialization library. Ultimately, we find that AFFS creates test suites that are more diverse than those created using static fitness functions. We also observe that increased diversity may lead to small improvements in the likelihood of fault detection.
Hussein Almulla, Gregory Gay

Challenge Cases

Frontmatter
Defects4J as a Challenge Case for the Search-Based Software Engineering Community
Abstract
Defects4J is a collection of reproducible bugs, extracted from real-world Java software systems, together with a supporting infrastructure for using these bugs. Defects4J has been widely used to evaluate software engineering research, including research on automated test generation, program repair, and fault localization. Defects4J has recently grown substantially, both in number of software systems and number of bugs. This report proposes that Defects4J can serve as a benchmark for Search-Based Software Engineering (SBSE) research as well as a catalyst for new innovations. Specifically, it outlines the current Defects4J dataset and infrastructure, and details how it can serve as a challenge case to support SBSE research and to expand Defects4J itself.
Gregory Gay, René Just
Backmatter
Metadaten
Titel
Search-Based Software Engineering
herausgegeben von
Aldeida Aleti
Annibale Panichella
Copyright-Jahr
2020
Electronic ISBN
978-3-030-59762-7
Print ISBN
978-3-030-59761-0
DOI
https://doi.org/10.1007/978-3-030-59762-7