Dieses Kapitel untersucht das Potenzial von Graphics Processing Units (GPUs) zur Beschleunigung der Modellüberprüfung, wobei der Schwerpunkt auf der probabilistischen Modellüberprüfung (PMC) und dem Tool GPUexplore liegt. Anfänglich wurden GPUs verwendet, um Teile der Modellüberprüfung zu beschleunigen, aber Werkzeuge wie GPUexplore führen den gesamten Prozess nun auf einer GPU aus. GPUexplore konstruiert die Zustandsräume von Markov-Ketten und verifiziert Diskrete-Zeit-Markov-Ketten (DTMCs) gegen Probabilistic Computation-Tree Logic (PCTL) -Formeln. Das Tool akzeptiert DTMC-Modelle im JANI-Format, wandelt sie in CUDA-Code um und führt das Modell aus, um neue Zustände zu erzeugen. Der Aufbau des staatlichen Weltraums wird massiv parallel von GPU-Threads durchgeführt, wobei gemeinsamer Speicher für das Caching und eine globale Baumdatenbank für die Speicherung genutzt werden. GPUexplore nutzt NVIDIAs Unified Memory, um die Speichernutzung der Geräte zu maximieren, und unterstützt komprimierten State Storage, um den Speicherbedarf zu reduzieren. Der Verifikationsprozess beruht auf Matrix-Vektor-Multiplikationen und nutzt die cuSPARSE-Bibliothek von NVIDIA. Experimentelle Ergebnisse zeigen die Skalierbarkeit und Beschleunigung von GPUexplore im Vergleich zu herkömmlichen Werkzeugen wie Storm, insbesondere für große Zustandsräume und komplexe PCTL-Formeln. Das Kapitel schließt mit Plänen für zukünftige Optimierungen und Unterstützung für weitere Modelltypen, wobei das Potenzial der GPU-beschleunigten PMC hervorgehoben wird.
KI-Generiert
Diese Zusammenfassung des Fachinhalts wurde mit Hilfe von KI generiert.
Abstract
GPUexplore\(^{\textsc {prob}}\) is an extension of GPUexplore that constructs state spaces of Markov Chains and performs probabilistic model checking entirely on a GPU. It can construct the state space of a Discrete-Time Markov Chain and verify that it satisfies a given Probabilistic Computation-Tree Logic formula. We present the tool, and experimentally compare with Storm, demonstrating its effectiveness.
Hinweise
This work is supported by NWO grant OCENW.M.21.061 for the GAP project.
1 Introduction
Graphics Processing Units (GPUs) have great potential to accelerate model checking, both explicit-state [8, 10, 24, 26, 29, 30, 35] and symbolic, based on SAT solving [25]. Initially, GPUs were used to accelerate parts of the model checking procedure, such as successor generation [12], property checking using a given in-memory state space [3], and counter-example construction [34], but soon, tools were developed to perform model checking entirely on a GPU [4, 8, 10, 27, 31‐33, 35]. One of these tools, GPUexplore, has been actively developed since 2013, and was recently equipped with an algorithm to verify Linear-Time Temporal Logic formulae [24]. In this paper, we present GPUexplore\(^{\textsc {prob}}\), an extension of GPUexplore aimed at probabilistic model checking (PMC) [2]. The tool is available on Zenodo[14].
In the earliest work on GPU accelerated model checking, PMC was already targeted, since the matrix-vector multiplications required to verify probabilistic properties can be ideally performed by GPUs [5, 6, 19, 28]. However, state space exploration, i.e., the identification of all states reachable from a defined initial state, is much harder to parallelise, and formed the bottleneck of the computation, as it was still performed by a single CPU thread. Once the state space was explored, the resulting transition matrix, storing all the transitions of the state space, was copied to the GPU device memory, after which the multiplications could be performed. Using the experience gained in parallelising state space exploration for GPUexplore [29‐31, 33], we have developed the first GPU tool that constructs the state spaces of Markov Chains and performs explicit-state PMC entirely on a GPU. To verify probabilistic systems, GPUexplore\(^{\textsc {prob}}\) currently supports checking whether a Discrete-Time Markov Chain [22] (DTMC) satisfies a Probabilistic Computation-Tree Logic (PCTL) formula [2]. For each state in a DTMC, a single probability distribution is defined over its successor states, i.e., the probabilities of its outgoing transitions add up to 1. Support for other model types, such as Continuous-Time Markov Chains (CTMCs) and Markov Decision Processes (MDPs), is planned for future work. These types are more expressive, but offer similar parallelisation challenges, hence GPUexplore\(^{\textsc {prob}}\) currently already demonstrates the effectiveness of GPU accelerated PMC in general.
Anzeige
2 Construction and Verification of DTMCs
From model to transition matrix.GPUexplore\(^{\textsc {prob}}\) has been developed to run on NVIDIA GPUs with at least Compute Capability 7.0, and is implemented in CUDA
. It currently accepts DTMC models in the JANI format [7]. Prism [20] models, together with a PCTL formula, can first be converted to JANI, using the Storm-conv tool of the Storm model checker [11, 15] (see step 1 in Fig. 1). A JANI model is given to the GPUexplore\(^{\textsc {prob}}\) exporter tool, written in Python, that produces CUDA
code necessary to identify all the system states reachable from the specified initial state of the model (step 2). In particular, the exporter generates CUDA functions that accept a state as input, and, when executed, produce successor states of the given state, in accordance with the model, as output. Instead of interpreting the model, GPUexplore\(^{\textsc {prob}}\) therefore executes the model to create new states. The code produced by the exporter can be combined with GPUexplore\(^{\textsc {prob}}\) generic code, which implements the main data structures, such as hash table, vector and matrix, and CUDA functions that are common for all input models, such as finding and storing states in the hash table. The combined code can be compiled using NVIDIA’s nvcc compiler (step 3) resulting in an executable, specific for the input model.
Fig. 1.
Workflow from model to executable.
Fig. 2.
Workflow from state space construction to property checking.
Once an executable has been obtained, state space construction and verification can commence. Fig. 2 shows the workflow of this, and Fig. 3 visualises how GPUexplore\(^{\textsc {prob}}\) runs on a GPU, in particular which data is stored in which type of memory. First, an exploration kernel, i.e., CUDA function, is executed repeatedly to identify, construct and store all states reachable from the initial state (step 1 in Fig. 2). We assume that there is a finite number of reachable states. The explore operation is performed in a massively parallel way by the GPU threads. These threads run in blocks of 512 threads on Streaming Multiprocessors (SM), see Fig. 3. By default, 3,240 of these blocks are running, with the SMs interleaving between the blocks. Each block has some fast, on-chip memory, called shared memory, which is used as a state cache to store the states that will be explored by the block, and the states constructed during the exploration. Once all the states in this cache scheduled for exploration have been processed, the newly constructed states are stored in a large global tree database, which is a hash table residing in the GPU device memory, unless they are already there. A newly stored state can immediately be claimed for exploration by the thread block storing it, but, as there is a predefined limit to the number of states a block can explore in parallel, in practice, often many states will be left to be picked up by other blocks. This ensures that in many cases, depending on the state space structure, many blocks will perform work simultaneously. An important feature of GPUexplore\(^{\textsc {prob}}\), new w.r.t. GPUexplore, is that it uses NVIDIA’s unified memory. This allows combining CPU memory with device memory into one address space. Because of this feature, GPUexplore\(^{\textsc {prob}}\) can use more of the available device memory for its tree database, and hence explore larger state spaces. After the exploration stage, the tree database can be moved to CPU memory to make room for the transition matrix while still remaining accessible.
Fig. 3.
GPUexplore\(^{\textsc {prob}}\) running on a GPU.
To make better use of the relatively small amount of device memory, states are stored as binary trees, with the leaves containing the actual state information, i.e., the values of the variables in the model and the current state of each process in the model. By doing so, common subtrees can be shared by states, which in practice means that 2–6\(\times \) less memory is required [29]. In addition, the tree roots are stored in the tree database in a compressed form by means of Cleary compression [9, 29]. The result is that roots can be stored as 32-bit integers, while non-roots are stored as 64-bit integers.
Once all the reachable states have been stored, the threads perform several combined parallel prefix sums [16] (step 2 of Fig. 2). This is the first step of the procedure to derive a transition matrix from the filled tree database. In such a matrix, a non-zero probability \(p \in \langle 0, 1]\) at position (i, j) indicates that from state i, a transition with probability p exists to state j. Deriving this matrix from the tree database in a massively parallel way is non-trivial, and our procedure has been specifically designed for GPUexplore\(^{\textsc {prob}}\). As transition matrices tend to be very sparse, the matrix is stored in Compressed Sparse Row (CSR) format, so only the non-zero entries are actually stored. To achieve this, every stored state needs to be assigned a unique ID in the range \([0,n\rangle \), with n the number of reachable states, and for each state, the number of outgoing transitions, and therefore successor states, needs to be recorded in device memory. During exploration, when a state has been explored, we store its number of successors at its location in the tree database. After exploration, these numbers are gathered together into an array using stream compaction [18], and they are added up with a parallel prefix sum, as illustrated in Fig. 4. Initially, the successor numbers are gathered, see the top of the figure. The algorithm works in rounds. In each round i, the first one being 0, the GPU threads scan this array, and the value in each cell j is updated to be the sum of its old value plus the value in cell \(j - 2^i\), if it exists. After \(\log n\) rounds, the algorithm terminates. After all resulting values have been shifted one position to the right, and the first cell has been assigned 0, the offsets have been obtained. A very similar operation can be performed to assign the states consecutive IDs. Essentially, for every state, the number of states preceding it must be counted. Our procedure performs both prefix sums in a single procedure, i.e., their rounds are combined.
Fig. 4.
Computing offsets.
Anzeige
Once this step has finished, the non-zero probabilities can be stored in an array called values. For each state i, the probabilities of its outgoing transitions need to be stored in values from position offsets[i] to position offsets[\(i+1\)]. Likewise, the IDs of the states reached via these transitions, i.e., the columns of the matrix, are stored in an array succs. Storage of the probabilities and the successor IDs is again performed in a massively parallel way, by running the explore kernel again (step 3). In the kernel, the successors of each state are explored twice: once to construct the successor and retrieve its global id from global memory into shared memory, and once more to relate the retrieved global id to its predecessor. This is needed because the state storage and retrieval mechanism to and from global memory is agnostic of the source state.
Verifying a PCTL property.GPUexplore\(^{\textsc {prob}}\) has full support for the temporal logic PCTL. A PCTL formula is a state formula, in which atomic propositions are combined with logical operators plus the probabilistic operator \(\mathop {\mathbb {P}_{\mathop {\bowtie } b}} (\varphi )\), with \(\bowtie \ \in \{ \le , <, >, \ge \}\) and \(b \in [0.0,1.0]\). Here, \(\varphi \) is a path formula, in which state formulae are combined using the operators \(\mathop {\textsf{X}}\) (next), \(\mathop {\textsf{U}\,} \)(until) and \(\mathop {\textsf{U}^{\le k}}\) (bounded until). A state s satisfies \(\mathop {\mathbb {P}_{\mathop {\bowtie } b}} (\varphi )\) iff the probability of following a path from s that satisfies \(\varphi \) is in the interval represented by \(\mathop {\bowtie } b\). A derived operator \(\mathop {\mathbb {P}_{=?}} (\varphi )\) can be used to request the probability of following a path satisfying \(\varphi \).
We refrain from explaining how exactly PCTL formulae can be verified, the interested reader is referred to [2]. What is important to note here is that the verification procedure (step 4 in Fig. 2) relies on matrix-vector multiplications. For instance, for a formula of the form \(\varphi = \mathop {\mathbb {P}_{\mathop {\bowtie } b}} (\Phi _1 \mathop {\textsf{U}\,} \Phi _2)\), a system of linear equations must be solved, which can be done by using an iterative method, such as Jacobi or Gauss-Seidel. The used matrix is the transition matrix, and the initial vector v is derived from the PCTL formula. For instance, to verify the formula \(\varphi \) given above, \(v[i] = 1.0\) iff state i satisfies \(\Phi _2\). To perform matrix-vector multiplications, GPUexplore\(^{\textsc {prob}}\) uses NVIDIA’s cuSPARSE library [23].
To store intermediate verification results, GPUexplore\(^{\textsc {prob}}\) associates with each state a 64-bits bit-vector; each bit can be used to store the result of verifying a particular subformula of the given PCTL formula. Initially, the states in the tree database are inspected to determine the result for each atomic proposition in the given PCTL formula, and these results are stored in the bit-vectors, such that from that point on, the tree database does not need to be inspected anymore.
3 Experiments
Experimental setup. We compared GPUexplore\(^{\textsc {prob}}\) with Storm. Storm 1.9.0, built from commit 5d5ebe4 of the master branch of the Storm GitHub repository, was run on an AMD Ryzen™ 7 5800X from 2020, installed on an MSI MPG x570 Gaming Pro Carbon Wi-Fi motherboard, with 32GB of DDR4 RAM. This machine is equipped with an RTX 3090 GPU from 2020, with 10,496 CUDA cores and 24GB of GDDR6X VRAM. GPUexplore\(^{\textsc {prob}}\) was run on this GPU.
Fig. 5.
Scalability w.r.t. Crowds.
Benchmarks. From the Quantitative Verification Benchmark Set [13] (QComp), all DTMC models were selected with 1) at least one instance having a state space with the number of states between 5,000 and 3,000,000 states (to exclude very small state spaces, for which GPU acceleration is not useful, and to exclude state spaces too large to fit in GPUexplore\(^{\textsc {prob}}\) ’s 20GB tree database), and 2) at least one formula in PCTL, not involving a reward extension [1], as this is not yet supported by GPUexplore\(^{\textsc {prob}}\).1 In addition, benchmarks created for the Rubicon tool [17] and for [21] have been used, taken from their accompanying artifacts. Table 1 presents a list of benchmarks, referring to how our naming scheme for instances (see Table 2) provides the values of the involved constants. In addition, the verified PCTL formula is given.
Table 1.
Benchmark descriptions.
Dummy
Table 2.
Runtimes (secs.) of GPUexplore\(^{\textsc {prob}}\) and Storm. OOM: out of mem.
Dummy
Fig. 5 shows how GPUexplore\(^{\textsc {prob}}\) and Storm (with its sparse engine) scale when verifying Crowds instances. TotalRuns ranges from 3 to 15, CrowdSize is 10. For TotalRuns=15, Storm ran out of memory (32GB). GPUexplore\(^{\textsc {prob}}\) could verify this instance, thanks to unified memory, demonstrating its usefulness. For the 14-instance, GPUexplore\(^{\textsc {prob}}\) achieved an \(856\times \) speedup.
Table 2 presents the runtime results for both GPUexplore\(^{\textsc {prob}}\) and Storm (sparse engine). For each case, the number of reachable states and transitions is given. The runtimes are broken down into exploration / construction time of the state spaces and checking the PCTL formula. The last three columns compare the results of the two tools in the relevant columns, i.e., exploration time, checking time, and total time. The BRP case shows that for models with very small state spaces, both in terms of number of states and number of transitions, using a GPU is not promising, which is to be expected; to keep many GPU threads busy, large state spaces are required. The results for the other models show examples where exploration starts to pay off, with the best result being \(47\times \) faster than Storm (Crowds.15.10). Regarding formula checking, GPUexplore\(^{\textsc {prob}}\) is particularly fast when many matrix-vector multiplications are required, which is always the case when the formula contains the bounded until operator with a large bound. The best recorded result is \(968\times \) faster than Storm (Crowds.14.10). For unbounded until, Storm is often able to apply preprocessing steps to avoid the multiplications (almost) altogether. As these steps have not yet been implemented in GPUexplore\(^{\textsc {prob}}\), it tends to be slower in those cases. However, there is no fundamental reason preventing us from adding support for those steps in the tool, and we plan to add them in the near future.
Concluding, the results show the potential of GPU accelerated PMC, both for state space construction and verification. For future work, we will further optimise verification, and add support for CTMCs and MDPs. Finally, we will conduct research on using Multi-Terminal Binary Decision Diagrams.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
An exception is the EGL model, which is excluded due to a current technical limitation of our tool, related to the number of outgoing transitions a state may have.