Dieses Kapitel befasst sich mit der innovativen Integration eines fein abgestimmten, lokal gehosteten Large Language Model (LLM) zur Synthese von Ranking-Funktionen für die Programmbeendigungsanalyse. Der Ansatz geht auf die Beschränkungen früherer Terminierungsprüfungen ein, die mit Benchmarks zu kämpfen hatten, die eine umfassende Schleifenabwicklung erforderten. Durch Nutzung eines fein abgestimmten Lama-1B-Modells wird in diesem Kapitel LLM-T-Check eingeführt, eine Methode, die Funktionen zur Erzeugung und Validierung von Kandidaten-Ranking-Funktionen (CRFs) mit überraschender Wirksamkeit generiert und validiert. Der Prozess beinhaltet die Feinabstimmung des LLM mithilfe eines hochwertigen Datensatzes, der aus der Beendigung von Benchmarks generiert wurde, und den Einsatz einer benutzerdefinierten Validierungsprüfung, um die Richtigkeit der erraten CRFs sicherzustellen. Das Kapitel behandelt auch die Architektur und den Werkzeugfluss von LLM-T-Check, wobei seine Stärken bei der Lösung komplexer Terminierungsprobleme und seine Schwächen bei der Handhabung bestimmter Programmtypen hervorgehoben werden. Darüber hinaus wird das Potenzial zukünftiger Verbesserungen untersucht, wie etwa die Überwindung symbolischer Grenzen und die Verbesserung der Verallgemeinerungskapazitäten des Modells. Die Evaluierungsergebnisse zeigen die Fähigkeit des Modells, bisher unlösbare Benchmarks zu verallgemeinern und zu lösen, was es zu einem bedeutenden Fortschritt im Bereich der Softwareverifizierung macht.
KI-Generiert
Diese Zusammenfassung des Fachinhalts wurde mit Hilfe von KI generiert.
Abstract
PROTON 2.1 presents (1) a new termination checking technique that uses a fine-tuned local LLM to synthesize ranking functions, and (2) support for multiple SAT solvers for non-termination checking.
Hinweise
R. Metta—Jury member.
1 PROTON 2.1: Ranking function synthesis via LLMs
The termination check in PROTON 1.1.2 [17] is based on loop unwinding assertions via CBMC [13]. In SV-COMP 2024 [11], this check failed on \(123\) benchmarks (e.g. Fig. 1), mainly as a large number of unwindings (\(\ge 10^5\)) are needed to satisfy the unwinding assertions in these benchmarks. To effectively check termination of such kinds of programs, we synthesize ranking functions by integrating a small, locally hosted LLM based on the Llama-3.2-1B-Instruct model (henceforth referred to as Llama-1B) [1] into the latest version of PROTON (2.1). This approach, henceforth referred to as LLM-T-Check, is similar to [15, 16]. As out-of-the box Llama-1B failed to synthesize valid ranking functions even for trivial programs, we fine-tuned it to unlock its ability to guess candidate ranking functions (henceforth referred to as CRFs) and designed a check to validate the correctness of these guessed CRFs, as described below.
Generating Fine-Tuning Data: For each program \(P_i\) of the 1536 terminating benchmark programs in SV-COMP, and for each loop \(L_j\) in \(P_i\), we assembled a prompt \(\pi _{ij}\) containing the source code of \(P_i\) and the body of loop \(L_j\) following a template similar to the one in [16]. Then, an instance of the Llama-3-70B-Instruct model [3] (henceforth referred to as Llama-70B) hosted on Hugging Face was queried once with each of these \(\pi _{ij}\). The response \(r_{ij}\) was parsed to extract the loop variant \(v_{ij}\) and supporting invariant \(u_{ij}\) guessed by Llama-70B. Each \(P_i\) was then annotated with ACSL [9] specifications corresponding to \(u_{ij}, v_{ij}\) and was passed to the wp [8] plugin of Frama-C [14]. If Frama-C was able to validate \(u_{ij}, v_{ij}\) a data point \(d_{ij}\) was added to the dataset D consisting of \(P_i\), \(L_j\), and \(v_{ij}\). This gave us a high-quality fine-tuning dataset D with 321 programs.
Anzeige
Fine-Tuning Llama-1B: The 321 benchmarks in D include only 22 of the \(123\) benchmarks that are hard for PROTON 1.1.2. To offset this under-representation, while simultaneously ensuring that the model is able to solve programs from outside the class of programs that are hard for PROTON 1.1.2, we separated these 22 programs into a set F. Then, during training, we ensured that the model sees data points from F with an increased probability (0.7) than those in \(D \setminus F\) (0.3). Using D and F, Llama-1B was fine-tuned with a loss that when given a prompt with \(P_i\) and \(L_j\), rewards responses containing \(u_{ij}, v_{ij}\), and penalizes other responses. The fine-tuning dataset and the scripts are available in [6].
Guessing a CRF: The llama.cpp [2] model inferencing library via the llama-cpp-python [4] bindings provides an API to efficiently run a GGUF model locally, and is optimized to consume minimal CPU and memory resources. We use a custom prompt template based on the ones used by [16] to run our fine tuned LLM via llama.cpp and generate a text response, comprising of the CRF and supporting invariant. We parse this response using a Python script.
Fig. 1.
count-by-nondet.c
Validating the CRF: We then instrument the source program with assertions that characterize the validity of the CRF guessed above as follows. First, we add a new variable
initialized to the max value of the type of
(at line 2 in Fig. 1) that tracks the value of the ranking function in the previous iteration (line 6). We then add two assertions at the head of the loop (line 4 and 5) to check that the value of the ranking function is less than
and remains non-negative. We run randomized tests on this program to quickly check if the CRF can be violated (restricted to max 100 tests in SV-COMP 2025). If this CRF is not violated by the randomized testing, we run CBMC with a small number of unwinds to check for any violations of the CRF missed by randomized testing. If CBMC does not report any violations, we assume that the CRF represents a valid ranking function, and report termination. Thus, this is an unsound validation check, and hence not ideal. However, this strategy has been surprisingly effective in catching invalid ranking functions and has not produced any false positives for the SV-COMP benchmarks.
Table 1.
Evaluation Results
Dummy
Evaluation: The remaining 1215 terminating benchmarks in SV-COMP outside D were never seen during fine-tuning, and therefore may be safely considered as a test set. Let us call these T. With this training-testing split, we evaluated the fine-tuned LLM, and the results are given in the first three columns of Table 1. Also, during the SV-COMP competition run, LLM-T-Check was run on 81 terminating benchmarks, since PROTON 1.1.2 returned UNKNOWN or ERROR on these (see Sect. 2). Let R denote the set of these 81 benchmarks. The last three columns of Table 1 shows how many of these programs were in F, D and T, and the performance of the LLM on each of these tests.
Anzeige
Does the Model Generalize: We found that the fine-tuned LLM used in LLM-T-Check is able to guess good CRFs for 21 out of 1215 programs that were not in D. This is supported by the data in Table 1, which shows several instances where the fine-tuned LLM was able to guess potentially correct CRFs for programs outside D, both in our own evaluation and in the SV-COMP runs. For instance, the
benchmark was not in D because Llama-70B guesses an incorrect CRF for this (
), but the fine-tuned LLM was able to guess a correct one (
). Further, we found that our model was able to make good guesses for CRFs even in modified versions of the SV-COMP benchmarks that were not even present in the test set. Thus, the fine-tuned model displays evidence of generalization.
1.1 Support for multiple SAT solvers in non-termination checking
PROTON 1.1.2 encodes an assertion in each loop to check if a program-state visited during some arbitrary iteration of the loop recurs in a subsequent iteration, using Z3 as the backend solver. In PROTON 2.1, we added support for Glucose [7] and Kissat [12] SAT solvers too. We architected the PROTON 1.1.2 scripts such that adding support for a new SAT or SMT solver just takes a 4-line API to invoke the solver. This enables PROTON 1.1.2 to be invoked with a solver of users’ choice, or even with multiple solvers one after another.
Fig. 2.
Tool flow of LLM-T-Check
2 Software Architecture
Figure 2 shows the flow of LLM-T-Check implemented in PROTON 2.1. It starts with calling PROTON 1.1.2 [17]. LLM-T-Check is called only if PROTON 1.1.2 returns unknown or error. LLM-T-Check starts by performing LLM Inference using our fine-tuned LLM via llama.cpp, and obtains a text response, with a limit of 64 new tokens and a context length of 2048 tokens. If the token limit is exceeded by the prompt itself, or if the response does not contain any grammatically valid CRF, we declare the generation as failed and return unknown. If the generation succeeds, we first perform randomized testing to test the validity of the CRF. If the tests fail, we again return unknown, else we move on to running CBMC (version 5.95.0) with small number of unwinds (pre-configured to max 3 unwinds for SV-COMP 2025) to further validate the CRF. If CBMC does not invalidate the CRF, we report the program as terminating. The ranking function generation uses llama.cpp via the llama-cpp-python bindings. Other components of PROTON 2.1 are the same as PROTON 1.1.2; we refer the interested readers to [17] for further details.
3 Strengths and Weaknesses
In SV-COMP 2025, LLM-T-Check ran on 206 benchmarks that PROTON 1.1.2 could not solve. Of these, 23 are recursive, which our instrumenter does not support. Of the remaining 183, 102 were non-terminating and 81 were terminating.
Strengths: For the non-terminating benchmarks, incorrect ranking functions guessed by the LLM were rejected by the validation phase. Thus, even though our current validation check is unsound in theory, it is unlikely to produce false positives in practice. From the 81 terminating benchmarks, many require upwards of 100000 loop-unwinds for termination, and hence the termination check of PROTON 1.1.2 did not scale on these. LLM-T-Check is able to solve 14 of these 81. It is interesting to note that benchmark
is solved only by LLM-T-Check with \((\mathtt {10^5-i})\) as the ranking function, while none of the other tools in SV-COMP 2025 are able to solve this. This is a promising first step in integrating locally-hosted LLMs, which in addition to their privacy and cost benefits, allows us to build offline modules that can be bundled into software verification tools.
Weaknesses: LLM-T-Check was unable to find ranking functions for 67 of the 81 programs. Of these, 45 are complex programs with multiple loops from categories such as
and
. These programs exceed the token limit for the online hosted teacher Llama-70B, and hence will require more complex solutions to be able to generate CRFs. For the remaining 22, the guessed CRFs were invalidated during the checking phase.
Future work: Currently we are limited by the token limit of the online LLM when generating D. To overcome this, we are planning to use a fine-tuning technique that is independent of online hosted models, such as using feedback from solvers, circumventing the token limit.
4 Tool Configuration, Setup, and Contributors
PROTON 2.1 comes with an MIT license. Instructions to install and run, including a sample run command, are in the
in PROTON 2.1’s Zenodo Archive [5]. PROTON 2.1 opted to participate only in the Termination category in SV-COMP 2025 [10].
Software Project and Contributors: PROTON 2.1 is developed and maintained by the authors at TCS Research, IIT Delhi and IIT Bombay. We thank everyone who has contributed to the development of PROTON 2.1, Clang and LLVM Infrastructure, CBMC, Glucose Syrup, Kissat, Z3, HuggingFace Transformers, PyTorch, Meta-Llama-3-70B-Instruct, Llama-3.2-1B-Instruct, llama.cpp, and llama-cpp-python.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.