Skip to main content
Erschienen in:

Open Access 2025 | OriginalPaper | Buchkapitel

SV-COMP’25 Reproduction Report (Competition Contribution)

verfasst von : Levente Bajczi, Zsófia Ádám, Zoltán Micskei

Erschienen in: Tools and Algorithms for the Construction and Analysis of Systems

Verlag: Springer Nature Switzerland

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
download
DOWNLOAD
print
DRUCKEN
insite
SUCHEN
loading …

Abstract

Das Kapitel untersucht den komplizierten Prozess der Reproduktion des International Competition on Software Verification (SV-COMP), eines zentralen Ereignisses in der Software-Verifikationsgemeinschaft. Die Studie unterstreicht die damit verbundene technologische Komplexität, da mehrere Tools auf tausenden von Programmen ausgeführt und von zahlreichen Validierern validiert werden, was eine groß angelegte Recheninfrastruktur erfordert. Die Autoren führten ihre Experimente parallel zu SV-COMP '25 durch, wobei sie einen Hochleistungsrechner-Cluster und einen akribisch konzipierten Benchmarking-Workflow verwendeten. Zu den wichtigsten Ergebnissen gehören Diskrepanzen bei den Beurteilungen der Aufgaben zwischen den offiziellen Ergebnissen und der Reproduktionsumgebung, die Probleme wie fehlende Pakete, angenommene Lese- / Schreibzugriffe und Diskrepanzen bei den Mustern der Ergebnisdateien hervorheben. Das Kapitel geht auch auf die Herausforderungen der Wiederverwendbarkeit von Werkzeugen außerhalb von SV-COMP ein und bietet Empfehlungen, um die Zuverlässigkeit des Wettbewerbs und die Zugänglichkeit seiner Werkzeuge zu verbessern. Indem sie diese Probleme angeht, zielt die Studie darauf ab, das Vertrauen in die Ergebnisse von SV-COMP zu stärken und die Übernahme seines technologischen Stacks für breitere Anwendungen in der Softwareverifikation und im Benchmarking zu erleichtern.

1 Introduction and Goals

The International Competition on Software Verification (SV-COMP) [2] is a long-running driving force in the software verification community, with a continuously expanding portfolio of benchmarks and tools.
Reproducibility, reusability, and availability were always a key priority. However, the size and complexity of SV-COMP mean a complex technological stack for benchmarking. Compared to just running a verifier on a program, instead, it is multiple tens of tools, each run on a subset of thousands of programs, multiplied by the number of validators that have to be executed on the results of these verifiers. This necessitates scaling up to a large cluster of machines. With each step, technological complexity grows.
There has been a partial reproduction of SV-COMP’23 [4], but we believe that a full reproducibility study of SV-COMP is due, to:
  • evaluate reproducibility across different environments,
  • provide an overview of the tasks and solutions involved, offering insights valuable to other competition organizers,
  • facilitate the adoption of this technological stack for applications beyond SV-COMP, e.g., benchmarking by researchers and tool developers,
  • identify hidden issues, e.g., hidden dependencies or missing documentation,
  • ensure that the competition can endure changes in organization,
  • enhance trust in SV-COMP’s results by reaffirming their reliability.

2 Reproduction Workflow

We did not use the official reproduction packages for SV-COMP’25, as at the time of writing, they are not yet finalized. Instead, we executed our experiments in parallel with the competition. This should not influence the validity of our findings, but some highlighted problems might have been already solved.
We relied on the official execution/reproduction guide for SV-COMP1, which provides scripts to run the verification tasks, run the validation tasks, and prepare the tables for the results. It is referenced by the benchmark definition repository2, which contains the .xml files used for running BenchExec [3] for both the verifiers and the validators; and the category-structure.yml file, which defines the participation and role of each tool in the competition. This repository also references the Formal-Methods Tools repository3, which contains further metadata (such as necessary packages) as well as a DOI referencing each version of each tool. The execution scripts use this DOI to download the tool binaries and execute them against the benchmarks4. The reproduction workflow can be seen (summarized) in Figure 1.
Fig. 1.
Reproduction Workflow
Fig. 2.
Benchmarking Workflow
We used the Hungarian Komondor cluster5, which consists of computation nodes with two 64-core AMD EPYC 7763 CPUs and 256 GB of system memory each, running Red Hat Enterprise Linux 8.6. We requested array jobs via the SLURM Workload Manager with resource allocations outlined in SV-COMP’s rules (i.e., 15 GB memory, 4 CPU cores, and 15 CPU-minutes for each verification task), with multiple tasks aggregated in the same job both sequentially and in parallel, in a Singularity6 container based on Ubuntu 24.04. Runexec was used to sequester the tasks to non-overlapping CPU cores.
We used the fuse-overlayfs package to create an overlay file system in the containers. For performance and stability reasons, we found that only mapping the current working directory as read-write is the best option, as we expect to find a witness there. Mapping the entire directory hierarchy (including sv-benchmarks with almost 200 000 files) as read-write caused instability problems when run on multiple hundreds of nodes from the same network filesystem, often causing fuse to become stuck in uninterruptible sleep. Therefore, all directories above the tool directory (the working directory in SV-COMP) are mapped read-only. The benchmarking workflow can be seen summarized in Figure 2.
We only ran non-hors-concours and non-inactive tools participating in SV-COMP’25. This resulted in running 33 verifiers (for a list, see Figure 3) and 33 validators. The results are published as a Zenodo artifact [1]. The tables are hosted at home.mit.bme.hu/\(\sim \)bajczi/sv-comp-repro-25. The ranking summary is available at results-verified/​scoretable.​html. The rankings correspond to the official rankings of verifiers (excluding tools with problems, see Sect. 3). Together, 38 358 tasks have mismatching verdict categories out of the more than 495 828 tasks (19 222 due to errors detailed in Sect. 3, but 19 136 due to other factors, mostly by reaching resource limits at different times). 1 officially correct result flipped to wrong (mlb), and 24 officially wrong results flipped to correct (mlb, swat). Compared to the official results, tasks correctly solved in both environments used \(24\%\) less walltime, \(32\%\) less cputime, and \(11\%\) more memory in the reproduction environment. Detailed statistics and differences are published on Zenodo [1].
Fig. 3.
Possible causes of failure. “?” denotes uninterpretable logs. Tools with “!” or “?” failed to run in the category, tools with https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-90660-2_10/MediaObjects/652620_1_En_10_Figh_HTML.gif were at least partly successful.

3 Conclusion and Recommendations

Besides some minor issues that we submitted solutions to via merge requests ( https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-90660-2_10/MediaObjects/652620_1_En_10_Figi_HTML.gif benchmarking/competition-scripts!160, !161), we did not find any problems in the competition scripts that would impede the reproduction workflow. Our only recommendation is to extend the documentation with details on how a competition can be run – e.g., how to run ranking calculations or run pre-runs by restricting resource allocation.
However, multiple tools produced no correct results, misrepresenting their actual SV-COMP performance. This stems from differences between SV-COMP’s execution environment and our reproduction attempt. Solving these issues is key to enhancing SV-COMP’s reliability and simplifying independent tool execution.
Missing Packages. Tool developers declare the packages they need alongside the tool archives, but many tools leave out necessary packages, or declare non-existent ones. In the reproduction environment, we installed all the packages explicitly declared by the tools. Consequently, some tools had access to missing dependencies if they were declared by others. To uncover how widespread this issue is, we ran a separate experiment, executing each tool in each major category (opt-ins were left out) with just the declared packages of the respective tools installed. The results can be seen in Figure 3, and in the Zenodo archive [1].
Assumed Read-Write Access. Some tools tried to overwrite input files. While this is not against the rules of SV-COMP, we feel this goes against the purpose of making verification tools accessible to people outside of SV-COMP: if a verification tool modifies the input files it verifies, users may (justifiably) avoid using it. Such problems are also denoted in the table in Figure 3.
Result Files Pattern Mismatch. The benchmark definition files specify which files to keep after verification. Some tools do not include their own witness filenames (e.g., Deagle produces an error-witness.graphml file, but the definition only keeps **/witness.* files). The competition environment presumably disregards this directive and keeps every file (otherwise, the tools in question could not get their witnesses validated). However, this means that the benchmark definition file could not be reused outside of SV-COMP.
Working Directory Must Be Tool Directory. Some tool developers presume that BenchExec will always be executed from the tool directory. Offending tools mostly rely on relative paths holding from the working directory rather than prepending the directory of the tool archive to these paths. This makes it hard to use the tool binaries outside of SV-COMP and benchmarking.

3.1 Recommendations

In order to address the uncovered problems, we recommend to make the following modifications to SV-COMP:
  • Enforce package declaration by a CI-check for the Formal-Methods Tools repository (preferred) or part of the tool qualification review (otherwise);
  • Encode in the rules which files the verification tools may modify/create (e.g., all files within the current working directory);
  • Enforce the result files pattern in the competition environment
  • Encode in the rules that tools must be able to execute when called from outside their respective directories, and enforce this rule by introducing a CI-check (preferred) or a manual review (otherwise).
While we understand that rules will never cover all corner cases that would make tools SV-COMP-specific, these modifications cover a large set of problems that make the reuse of participating tools difficult. We believe that making participating verifiers accessible to anyone (and not just as benchmarking baselines, but as actual verification tools) is one of the most significant purposes of SV-COMP, and these modifications would enhance its impact.
Funding. This research was partially funded by the EKÖP-24-3 New National Excellence Program under project numbers EKÖP-24-3-BME-288 and EKÖP-24-3-BME-213, and the Doctoral Excellence Fellowship Programme under project numbers 400434/ 2023 and 400433/2023; funded by the NRDI Fund of Hungary. We also acknowledge KIFÜ (Governmental Agency for IT Development, Hungary, ror.org/01s0v4q65) for awarding us access to the Komondor HPC facility based in Hungary.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://​creativecommons.​org/​licenses/​by/​4.​0/​), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Metadaten
Titel
SV-COMP’25 Reproduction Report (Competition Contribution)
verfasst von
Levente Bajczi
Zsófia Ádám
Zoltán Micskei
Copyright-Jahr
2025
DOI
https://doi.org/10.1007/978-3-031-90660-2_10