Die 7. Ausgabe des Internationalen Wettbewerbs für Softwaretests (Test-Comp) präsentiert einen detaillierten Vergleich automatischer Testsuiten-Generatoren für C-Programme, der den aktuellen Stand der Technik im Softwaretest unterstreicht. Der Wettbewerb setzte fortschrittliche Tools wie BenchExec, BenchCloud, FM-Weck und FM-Tools ein, um Testfälle durchzuführen und zu validieren, was einen robusten und reproduzierbaren Benchmarking-Prozess sicherstellte. Der Bericht geht auf die Ziele des Wettbewerbs ein, zu denen die Festlegung von Standards für die Generierung von Softwaretests, die Erstellung einer Reihe von Benchmarks für die Gemeinschaft und die Bereitstellung eines Überblicks über verfügbare Werkzeuge gehören. Die Ergebnisse werden in Tabellen, Grafiken und einer interaktiven Weboberfläche präsentiert, die einen umfassenden Überblick über die Wirksamkeit und Leistung verschiedener Testgeneratoren bietet. Der Wettbewerb führte auch neue Validierungstechniken ein, wobei verschiedene Compiler-Backends und Formatierungsoptionen verwendet wurden, um die bestmöglichen Abdeckungsergebnisse zu erzielen. An dieser Ausgabe nahmen 20 Testsuiten-Generatoren teil, die detaillierte Einblicke in ihre Eigenschaften und Technologien erhielten. Der Bericht schließt mit einer Diskussion über die Bedeutung der Ergebnisse des Wettbewerbs und ihre Auswirkungen auf den Bereich der Softwaretests, was ihn zu einer unverzichtbaren Lektüre für jeden macht, der sich für die neuesten Fortschritte im automatischen Softwaretest interessiert.
KI-Generiert
Diese Zusammenfassung des Fachinhalts wurde mit Hilfe von KI generiert.
Abstract
The 7th edition of the Competition on Software Testing (Test-Comp 2025) provides an overview and comparative evaluation of automatic test-suite generators for C programs. The experimental evaluation was performed on a benchmark set of 11 226 test-generation tasks for C programs. Each test-generation task consisted of a program and a test specification. The test specifications included error coverage (generate a test suite that exhibits a bug) and branch coverage (generate a test suite that executes as many program branches as possible). Test-Comp 2025 evaluated 20 software systems for test generation that are all freely available. This included 13 test-suite generators that participated with active support from teams led by 12 different representatives from 8 countries (actively maintained software systems, participation in competition jury). Test-Comp 2025 had 1 new participant ( ) and 2 re-entries (ESBMC-incr, ESBMC-kind). The evaluation included also 7 test-generation tools from previous years.
Hinweise
This report extends previous reports on Test-Comp [10‐14, 16, 17] by providing new results, while the procedures and setup of the competition stay mainly unchanged.
Reproduction packages are available on Zenodo (see Table 3).
1 Introduction
In its 7th edition, the International Competition on Software Testing (Test-Comp, https://test-comp.sosy-lab.org, [10‐14, 16, 17]) again compares automatic test-suite generators for C programs, in order to showcase the state of the art in the area of automatic software testing. This competition report is an update of the previous reports, referring to the rules and definitions, presents the competition results, and give some interesting data about the execution of the competition experiments. We use BenchExec [31] to execute the benchmark runs, BenchCloud [24] to distribute the execution to a large and elastic set of computers, FM-Weck [33] to execute tools from previous years using the container with all their requirements fulfilled, and the FM-Tools [18] collection to look up all the information we need about the tools for test-case generation, including their versions, parameters, and jury representatives. The results are presented in tables and graphs, also on the competition web site (https://test-comp.sosy-lab.org/2025/results), and are available in the accompanying archives (see Table 3).
Competition Goals. In summary, the goals of Test-Comp are the following [11]:
Establish standards for software test generation. This means, most prominently, to develop a standard for marking input values in programs, define an exchange format for test suites, agree on a specification language for test-coverage criteria, and define how to validate the resulting test suites.
Establish a set of benchmarks for software testing in the community. This means to create and maintain a set of programs together with coverage criteria, and to make those publicly available for researchers to be used in performance comparisons when evaluating a new technique.
Provide an overview of available tools for test-case generation and a snapshot of the state-of-the-art in software testing to the community. This means to compare, independently from particular paper projects and specific techniques, different test generators in terms of effectiveness and performance.
Increase the visibility and credits that tool developers receive. This means to provide a forum for presentation of tools and discussion of the latest technologies, and to give the participants the opportunity to publish about the development work that they have done.
Educate PhD students and other participants on how to set up performance experiments, package tools in a way that supports reproduction, and how to perform robust and accurate research experiments.
Provide resources to development teams that do not have sufficient computing resources and give them the opportunity to obtain results from experiments on large benchmark sets.
Related Competitions. In the field of formal methods, competitions are respected as an important evaluation method and there are many competitions [8, 26]. We refer to the report from Test-Comp 2020 [11] for a more detailed discussion and give here only the references to the most related competitions: Competition on Software Verification (SV-COMP) [19], Competition on Search-Based Software Testing (SBST) [54], and the DARPA Cyber Grand Challenge [56]. For the techniques used for automatic software testing, we refer to the literature [5, 41].
Anzeige
2 Definitions, Formats, and Rules
Organizational aspects such as the classification (automatic, off-site, reproducible, jury, training) and the competition schedule is given in the initial competition definition [10]. In the following, we repeat some important definitions that are necessary to understand the results.
Test-Generation Task. A test-generation task is a pair of an input program (program under test) and a test specification. A test-generation run is a non-interactive execution of a test generator on a single test-generation task, in order to generate a test suite according to the test specification. A test suite is a sequence of test cases, given as a directory of files according to the format for exchangeable test-suites.1
Fig. 1.
Flow of the Test-Comp execution for one test generator (taken from [11])
Execution of a Test Generator. Figure 1 illustrates the process of executing one test-suite generator on the benchmark suite. One test run for a test-suite generator gets as input (i) a program from the benchmark suite and (ii) a test specification (cover bug, or cover branches), and returns as output a test suite (i.e., a set of test cases). The test generator is contributed by a competition participant as a software archive in ZIP format on Zenodo, via a DOI entry of a version in the FM-Tools record of the test generator. All test runs are executed centrally by the competition organizer.
Execution of the Test Validator. The test-suite validator takes as input the test suite from the test generator and validates it by executing the program on all test cases: for bug finding it checks if the bug is exposed and for coverage it reports the coverage. We use the tool TestCov [30]2 as test-suite validator.
Anzeige
In Test-Comp 2025, we used TestCov in four configurations: (a) We use separate validations based on the compiler, with GCC and with Clang. The motivation for this is that the two different compilers use different choices for unspecified behavior, where the C standard leaves certain choices up to the compiler (for example, the unspecified order of evaluation of function arguments). (b) We use separate validations based on the formatting after instrumentation, with and without formatting. The motivation for this is that due to incompatibilities of the tools for formatting and coverage measurement, we would like to make sure to obtain the best possible coverage measurement by using those variants. For each test-validation run, the best of the four results is used to determine the score.
Test Specification. The specification for testing a program is given to the test generator as input file (either
or
for Test-Comp 2025).
The definition init(main()) is used to define the initial states of the program under test by a call of function main (with no parameters). The definition FQL(f) specifies that coverage definition f should be achieved. The FQL (FShell query language [45]) coverage definition COVER EDGES(@DECISIONEDGE) means that all branches should be covered (typically used to obtain a standard test suite for quality assurance) and COVER EDGES(@CALL(foo)) means that a call (at least one) to function foo should be covered (typically used for bug finding). A complete specification looks like: COVER(init(main()), FQL(COVER EDGES(@DECISIONEDGE))).
Table 1 lists the two FQL formulas that are used in test specifications of Test-Comp 2025; there was no change from 2020 (except that special function
does not exist anymore).
Table 1.
Coverage specifications used in Test-Comp 2025 (similar to 2019–2024)
License and Qualification. The license of each participating test generator must allow its free use for reproduction of the competition results. The license for each tool is available in the FM-Tools entry for the tool, as well as in Table 4. Details on qualification criteria can be found in the competition report of Test-Comp 2019 [12].
3 Categories and Scoring Schema
Benchmark Programs. The input programs were taken from the largest and most diverse open-source repository of software-verification and test-generation tasks3, which is also used by SV-COMP [19]. As in 2020 and 2021, we selected all programs for which the following properties were satisfied (see issue on GitLab4 and report [12]):
1.
compiles with gcc, if a harness for the special methods5 is provided,
2.
should contain at least one call to a nondeterministic function,
3.
does not rely on nondeterministic pointers,
4.
does not have expected result ‘false’ for property ‘termination’, and
5.
has expected result ‘false’ for property ‘unreach-call’ (only for category Cover-Error).
This selection yielded a total of 11 226 test-generation tasks, namely 1 215 tasks for category Cover-Error and 10 011 tasks for category Cover-Branches. The test-generation tasks are partitioned into categories, which are listed in Tables 6 and 7 and described in detail on the competition web site.6 Figure 2 illustrates the category composition.
Fig. 2.
Category structure for Test-Comp 2025
Category Cover-Error. The first category is to show the abilities to discover bugs. The benchmark set consists of programs that contain a bug. We produce for every tool and every test-generation task one of the following scores: 1 point, if the validator succeeds in executing the program under test on a generated test case that explores the bug (i.e., the specified function was called), and 0 points, otherwise.
Category Cover-Branches. The second category is to cover as many branches of the program as possible. The coverage criterion was chosen because many test generators support this standard criterion by default. Other coverage criteria can be reduced to branch coverage by transformation [44]. We produce for every tool and every test-generation task the coverage of branches of the program (as reported by TestCov [30]; a value between 0 and 1) that are executed for the generated test cases. The score is the returned coverage.
Max Over All Validators. As mentioned before, TestCov is executed four times on each test suite, using four different configurations. The score of a test suite is the maximum of the four computed scores.
Ranking. The ranking was decided based on the sum of points (normalized for meta categories). In case of a tie, the ranking was decided based on the run time, which is the total CPU time over all test-generation tasks. Opt-out from categories was possible and scores for categories were normalized based on the number of tasks per category (see competition report of SV-COMP 2013 [9], page 597).
Fig. 3.
Benchmarking components of Test-Comp and competition’s execution flow (same as for Test-Comp 2020)
Table 2.
Publicly available components for reproducing Test-Comp 2025
Dummy
Table 3.
Artifacts published for Test-Comp 2025
Dummy
Table 4.
Competition candidates with tool references and representing jury members;
indicates first-time participants,
indicates inactive (hors concours) participation; licenses are abbreviated, see the hyperlink or tool page at FM-Tools for the specific version of the license; TestCov is the validator that computes the score for each test-suite
Dummy
4 Reproducibility
We followed the same competition workflow that was described in detail in the previous competition report (see Sect. 4, [13]). All major components that were used for the competition were made available in public version-control repositories. An overview of the components that contribute to the reproducible setup of Test-Comp is provided in Fig. 3, and the details are given in Table 2. We refer to the report of Test-Comp 2019 [12] for a thorough description of all components of the Test-Comp organization and how we ensure that all parts are publicly available for maximal reproducibility.
In order to guarantee long-term availability and immutability of the test-generation tasks, the produced competition results, and the produced test suites, we also packaged the material and published it at Zenodo (see Table 3).
The competition used CoVeriTeam [28]7 again to provide participants access to execution machines that are similar to actual competition machines. The competition report of SV-COMP 2022 provides a description on reproducing individual results and on trouble-shooting (see Sect. 3, [15]). A new component in Test-Comp 2025 was the use of the container solution FM-Weck [33], which makes it possible to include also older archives in the comparative evaluation, even if the tools were made for an older distribution of Ubuntu or use packages that are not available anymore. The tools can specify in their FM-Tools [18] entry a container in which they can run.
Table 5.
Technologies and features that the test generators used
Dummy
5 Results and Discussion
This section represents the results of the competition experiments. The report shall help to understand the state of the art and the advances in fully automatic test generation for whole C programs, in terms of effectiveness (test coverage, as accumulated in the score) and efficiency (resource consumption in terms of CPU time). All results mentioned in this article were inspected and approved by the participants.
Participating Test-Suite Generators. Table 4 provides an overview of the participating test generators and references to publications, as well as the team representatives of the jury of Test-Comp 2025. (The competition jury consists of the chair and one member of each participating team.) An online table with information about all participating systems is provided on the competition web site.8 Table 5 lists the features and technologies that are used in the test generators.
There are test generators that did not actively participate (tester archives taken from last year) and that are not included in rankings. Those are called inactive participation and the tools are labeled with a symbol (
). In the past, we named those inactive tools ‘hors concours’, but since there could be other reasons for hors-concours participation (for example meta tools that consist of other participating tools), we now use the more specific term ‘inactive’.
Computing Resources. The computing environment and the resource limits were the same as for Test-Comp 2024 [17], except for the upgraded operating system: Each test run was limited to 4 processing units (cores), 15 GB of memory, and 15 min of CPU time. The test-suite validation was limited to 2 processing units, 7 GB of memory, and 5 min of CPU time. The machines for running the experiments are part of a compute cluster that consists of 168 machines. Each machine had one Intel Xeon E3-1230 v5 CPU, with 8 processing units each, a frequency of 3.4 GHz, 33 GB of RAM, and a GNU/Linux operating system (x86_64-linux, Ubuntu 24.04 with Linux kernel 6.8). We used BenchExec [31] to measure and control computing resources (CPU time, memory, CPU energy), BenchCloud [24] to distribute, install, run, and clean-up test-case generation runs, and to collect the results, and FM-Weck [33] to prepare the correct container according to the tools’ FM-Tools [18] entry. The values for CPU time are accumulated over all cores of the CPU. Further technical parameters of the competition machines are available in the repository which also contains the benchmark definitions.9
Table 6.
Quantitative overview over all results; empty cells mark opt-outs;
indicates first-time participants,
indicates hors-concours participation
Dummy
Table 7.
Overview of the top-three test generators for each category (measurement values for CPU time rounded to two significant digits, in hours)
Dummy
One complete test-generation execution of the competition consisted of 235 746 single test-generation run executions. The total CPU time was 3.7 years for one complete competition run for test generation (without validation). Test-suite validation consisted of 987 888 single test-suite validation runs. The total consumed CPU time was 0.95 years. Each tool was executed several times, in order to make sure no installation issues occur during the execution. Including preruns, the infrastructure managed a total of 968 364 test-generation runs (consuming 4.9 years of CPU time). The prerun test-suite validation consisted of 4 212 084 single test-suite validation runs (consuming 3.8 years of CPU time).
Quantitative Results. The quantitative results are presented in the same way as last year: Table 6 presents the quantitative overview of all tools and all categories. The head row mentions the category and the number of test-generation tasks in that category. The tools are listed in alphabetical order; every table row lists the scores of one test generator. We indicate the top three candidates by formatting their scores in bold face and in larger font size. An empty table cell means that the test generator opted-out from the respective main category (perhaps participating in subcategories only, restricting the evaluation to a specific topic). More information (including interactive tables, quantile plots for every category, and also the raw data in XML format) is available on the competition web site10 and in the results artifact (see Table 3). Table 7 reports the top three test generators for each category. The consumed run time (column ‘CPU Time’) is given in hours and the consumed energy (column ‘Energy’) is given in kWh.
Fig. 4.
Quantile functions for category Overall. Each quantile function illustrates the quantile (x-coordinate) of the scores obtained by test-generation runs below a certain number of test-generation tasks (y-coordinate). More details were given previously [12]. The graphs are decorated with symbols to make them better distinguishable without color.
Score-Based Quantile Functions for Quality Assessment. We use score-based quantile functions [31] because these visualizations make it easier to understand the results of the comparative evaluation. The web site (See Footnote 10) and the results artifact (Table 3) include such a plot for each category; as example, we show the plot for category Overall (all test-generation tasks) in Fig. 4. We had 18 test generators participating in category Overall, for which the quantile plot shows the overall performance over all categories (scores for meta categories are normalized [9]). A more detailed discussion of score-based quantile plots for testing is provided in the Test-Comp 2019 competition report [12].
Fig. 5.
Number of evaluated test generators for each year (blue/bottom: active participants from previous years, green/middle: number of first-time participants, gray/top: inactive participants from previous years)
6 Conclusion
The 7th Competition on Software Testing continues to provide an overview of fully-automatic test-generation tools for C programs. A total of 20 test-suite generators was compared (see Fig. 5 for the participation numbers and Table 4 for the details). This off-site competition uses a benchmark infrastructure that makes the execution of the experiments fully-automatic and reproducible. Transparency is ensured by making all components available in public repositories and have a jury (consisting of members from each team) that oversees the process. All test suites were validated by the test-suite validator TestCov [30] to measure the coverage. For the first time, the competition used several different validation runs for each test suite, in order to obtain the best possible coverage result, using different compiler backends and different formatting choices after instrumentation for coverage measurement. The results of the competition were presented at the 28th International Conference on Fundamental Approaches to Software Engineering (FASE) at ETAPS 2025 in Hamilton, Canada.
Data-Availability Statement
The test-generation tasks and results of the competition are published at Zenodo, as described in Table 3. All components and data that are necessary for reproducing the competition are available in public version repositories, as specified in Table 2. For easy access, the results are presented also online on the competition web site https://test-comp.sosy-lab.org/2025/results.
Funding Statement
This project was funded in part by the Deutsche Forschungsgemeinschaft (DFG) — 418257054 (Coop).
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.