In diesem Kapitel wird der Z3-Noodler 1.3 vorgestellt, ein hochmoderner String-Solver, der sich in der Stringabteilung von SMT-COMP '24 als überlegen erwiesen hat. Das Rahmenwerk ist so konzipiert, dass es mit einer breiten Palette von String-Beschränkungen umgehen kann, indem es mehrere Entscheidungsverfahren integriert, die jeweils auf bestimmte Arten von Beschränkungen zugeschnitten sind. Dieser Ansatz adressiert die Komplexität der String-Lösung, bei der ein einzelnes Verfahren häufig nicht ausreicht, um vielfältige Probleme der realen Welt zu lösen. Das Kapitel beschreibt die Implementierung von zwei neuen Entscheidungsverfahren: eine zur Handhabung reiner regulärer Beschränkungen unter Verwendung endlicher Automaten und eine andere zur Umwandlung von Gleichungsblockstringbeschränkungen in linear-ganzzahlige arithmetische (LIA) Beschränkungen. Darüber hinaus werden die Optimierung der Nielsen-Transformation und die Erweiterung der Möglichkeiten zur Modellerzeugung diskutiert. Die Bewertung von Z3-Noodler 1.3 anhand von Standard-SMT-LIB-Benchmarks zeigt signifikante Verbesserungen bei der Anzahl der gelösten Instanzen und der Lösungszeit im Vergleich zu anderen hochmodernen Tools. Das Kapitel bietet auch Einblicke in die Auswirkungen unterschiedlicher Entscheidungsverfahren und Modellgenerierung auf die Gesamtleistung, was es zu einer entscheidenden Lektüre für diejenigen macht, die sich für die neuesten Entwicklungen im Bereich der Stringlösungstechnologien interessieren.
KI-Generiert
Diese Zusammenfassung des Fachinhalts wurde mit Hilfe von KI generiert.
Abstract
Z3-Noodler is a fork of the Z3 SMT solver replacing its string theory implementation with a portfolio of decision procedures and a selection mechanism for choosing among them based on the features of the input formula. In this paper, we give an overview of the used decision procedures, including a novel length-based procedure, and their integration into a robust solver with a good overall performance, as witnessed by Z3-Noodler winning the string division of SMT-COMP’24 by a large margin. We also extended the solver with a support for model generation, which is essential for the use of the solver in practice.
1 Introduction
In recent years, research in string solving gained a significant traction, motivated by problems such as finding security vulnerabilities in web applications [9, 37, 45, 55] or analyzing user policies controlling access to cloud resources [5, 31, 44].
Currently, there is a lot of string solvers utilizing various string solving approaches, usually integrated in general SMT solvers, such as cvc 5 [6, 8, 32‐34, 40, 42, 43] and Z3 [10‐14, 38, 48‐51, 56, 57], or more standalone string solvers, such as OSTRICH [20‐23, 36]. In order to achieve a high performance on various real-world problems, one specific decision procedure applied to all formulae is usually not sufficient enough. In practice, there often occur formulae with different characteristics that are usually not coverable by a single procedure. One way to overcome this problem lies in using a combination of dedicated procedures tailored for a particular fragment of string constraints. For example, given a formula containing only quadratic word equations, the Nielsen transformation [39] usually performs better than any other approach.
Anzeige
One of the currently best string solvers is Z3-Noodler1 [24, 25, 29], a winner of the string division of SMT-COMP’24 [47]. In [24], version 1.0 of Z3-Noodler was introduced, which implements the stabilization-based procedure described in previous papers [15, 25, 29]. Here we introduce version 1.3, implementing a framework for selecting and running decision procedures and two novel decision procedures: (i) one implementing multiple heuristics for handling pure regular constraints, based on finite automata library Mata [26] and (ii) a procedure based on transforming equational block string constraints into linear-integer arithmetic (LIA) constraints based on the lengths and alignments of string literals, which are then solved by the internal Z3 LIA solver. Furthermore, we show how Z3-Noodler implements and optimizes the Nielsen transformation, whose preliminary implementation was already present in version 1.0. On top of that, version 1.3 extends Z3-Noodler by model generation. In particular, we explain how a model can be constructed not only for each of the aforementioned procedures, but also for the stabilization-based procedure, for which model generation was also missing.
We evaluate the impact of the decision procedures and the model generation, and compare Z3-Noodler with other string solvers on standard SMT-LIB benchmarks. The results show that the implemented decision procedures have large impact on the number of solved instances and the solving time, while the model generation has just a minimal impact. The comparison with other tools exposes that Z3-Noodler outperforms other state-of-the-art tools on the SMT-LIB benchmarks.
2 Preliminaries
Sets, functions, and graphs. We use \(\mathbb {N}\) to denote the set of natural numbers (including 0), \(\mathbb {Z}\) to denote the set of integers, \(\mathbb {B}= \{ \top , \bot \}\) to denote the Boolean values, and \(\mathbb {B}_3= \mathbb {B}\cup \{ \textsf{undef}\}\) to denote \(\mathbb {B}\) extended with an undefined value. We use \(\uplus \) to denote the disjoin union. For a function \(f:X \rightarrow Y\), we use \(f \mathop {\vartriangleleft }\{ x \mapsto y \}\) to denote the function \((f \setminus (\{ x \}\times Y)) \cup \{ x \mapsto y \}\). We use boldface \(\boldsymbol{x}\) to denote vectors and \(\boldsymbol{x}_i\) to denote the i-th item of \(\boldsymbol{x}\). A (directed) graph is a pair \(G = (V,E)\) where V is a set of nodes and \(E \subseteq V\times V\) is a set of (directed) edges.
Strings and languages. We fix a finite alphabet \(\Sigma \) of symbols for the rest of the paper and we use letters from the start of the alphabet (\(a,b,c,\dots \)) to denote symbols from \(\Sigma \). A string (or word) over \(\Sigma \) is a finite sequence \(u = a_1\cdots a_n\) of symbols from \(\Sigma \). We say that \(|u| = n\) is the length of u. The length of the empty string \(\epsilon \) is \(|\epsilon | = 0\). The set of all strings over \(\Sigma \) is denoted by \(\Sigma ^*\). The concatenation of strings u and v is denoted \(u\cdot v\) or uv for short (\(\epsilon \) is the neutral element). Moreover, iteration of a word w is inductively defined as \(w^0 = \epsilon \) and \(w^{k+1} = w^{k}\cdot w\) for \(k\in \mathbb {N}\). A language is a subset of \(\Sigma ^*\).
Anzeige
Automata. A (nondeterministic) finite automaton (NFA) over \(\Sigma \) is a tuple \(A=(Q,\delta ,I,F)\) where Q is a finite set of states, \(\delta \) is a set of transitions of the form
with \(q,r\in Q\) and \(a\in \Sigma \), \(I\subseteq Q\) is the set of initial states, and \(F\subseteq Q\) is the set of final states. A run of \(A\) over a word \(w = w_1 \ldots w_n \in \Sigma ^*\) is a sequence of states \(q_0 \ldots q_n \in Q^{n+1}\) such that for all \(1 \le i \le n\) it holds that
. The run is accepting if \(q_0 \in I\) and \(q_n \in F\), and the language \(L(A)\) of \(A\) is the set of all words for which \(A\) has an accepting run. To intersect the languages of two automata, we construct their product\(A\cap A' = (Q\times Q',\delta ^\times ,I\times I',F\times F')\) where
iff
and
. The union of two NFAs, denoted \(A\cup A'\), is given as the piece-wise disjoint union of their components. The complement of \(A\) is given as \(A^\complement = (2^Q, \delta ^D, \{ I \}, F^D)\) where \(\delta ^D(S, a) = \bigcup _{q\in S}\delta (q,a)\) and \(F^D = \{ S \subseteq Q \mid S \cap F = \emptyset \}\).
Basic string constraints. In this paper, we consider basic string constraints over alphabet \(\Sigma \), string variables \(\mathbb {X}\), and integer variables \(\mathbb {I}\). The string variables range over \(\Sigma ^*\) and integer variables over \(\mathbb {Z}\). In the paper, we use the letters x, y, z to denote variables. The syntax of a string constraint \(\varphi \) is given as follows:
where \(t_s\) is a string term, \(t_i\) is linear-integer arithmetic (LIA) term, \(x_s \in \mathbb {X}\), \(a \in \Sigma \), \(x_i \in \mathbb {I}\), \(k \in \mathbb {Z}\), and \(\mathcal {R}\) is an (extended) regular expression (regex) as defined by the SMT-LIB standard [7] (classical regular expressions extended with operations such as re.compl, re.inter, ...). The semantics and satisfiability of string constraints are defined in the usual way (\(\texttt {len} (t_s)\) represents the length of the string term \(t_s\)). A formula without string terms is called a LIA formula and the set of all LIA formulae is denoted as \(\Phi _{\textsf{LIA}}\). We usually use the letters u, v, w to denote concatenations of string terms, i.e., words from \((\mathbb {X}\cup \Sigma )^*\). A string literal (or just literal) is a string term containing only symbols from \(\Sigma \). We use \(\texttt{Var}(\varphi )\) to denote the set of variables occurring in the string constraint \(\varphi \).
Other than the basic string constraints, Z3-Noodler can also handle extended string constraints (e.g., prefixof, indexof, from_int, ...) with the semantics specified by the SMT-LIB standard. We do not include these extended constraints in the definition as their solving is not subject of this paper (we briefly discuss their handling in Section 3.2).
3 Shepherding Decision Procedures
Since handling string constraints is complex and there is no universal procedure efficient on every possible constraint, Z3-Noodler implements several decision procedures, each one suitable for a different class of constraints. In this section, we describe a framework for handling the decision procedures and its position within Z3-Noodler ’s architecture.
3.1 Integration to Z3
Z3-Noodler replaces the string theory plugin of the DPLL(T)-based SMT solver Z3 [38] with the string theory handler that takes care of choosing, running, and processing the results of decision procedures. From a high-level point of view, the main solver, using the internal SAT solver, iteratively provides a conjunction of string atoms corresponding to a SAT solution of the input Boolean skeleton. The core of the string theory handler then works on a conjunction of string (dis)equations, regular constraints, and extended string predicates/functions that could not be axiomatized in preprocessing. The string theory plugin communicates with the main solver using theory lemmas, which steer the generation of further satisfiable assignments of the Boolean skeleton. See [24] for further details of Z3-Noodler ’s architecture.
3.2 Handling of Extended Constraints
Handling of extended constraints is performed in Z3-Noodler using axiomatization. Extended string functions and predicates (such as indexof, substr, contains, ...) are saturated with axioms consisting of string (dis)equations, and regular, and length constraints. In some special cases, it is not possible (e.g., the general \(\lnot \texttt {contains} \) predicate or string-integer conversions). In such cases, we receive these complex constraints as a part of the input conjunction and they are handled by the given decision procedure (if it supports the particular extended constraint). See [24] for more details.
3.3 Handling of Decision Procedures
General interface. In order to maintain extensibility of Z3-Noodler, we propose a plugin architecture for string decision procedures, which all need to implement the following simple interface:
where \(\psi \) is a string constraint, \(\theta \) is a LIA model, an assignment of integers to LIA terms (especially the lengths of string terms), and x is a string variable. The meaning of each part of this interface is explained in the rest of this section.
Procedure selection. The handler of decision procedures in Z3-Noodler selects a proper procedure by using the suitability check \(\textsf{isSuitable}(\psi )\). This check takes a string constraint \(\psi \) and decides whether a given decision procedure is suitable for it. The first suitable procedure is chosen, ordered from the most specific to the most general ones, starting with the procedure for pure regular constraints (Section 4), followed by the Nielsen transformation (Section 5) and length-based procedure (Section 6). If none of these procedures are suitable, then the stabilization-based procedure (Section 7) is chosen. Therefore, this procedure must always returns \(\top \) in \(\textsf{isSuitable}\). Furthermore, some of the decision procedures can be incomplete, i.e., they may lead to an inconclusive state. If this happens, the handler invokes the next suitable procedure.
Algorithm 1:
Decision procedure handler
Procedure execution. A simplified schema for an execution of the selected decision procedure \(\mathcal {D}\) on a string constraint \(\psi \) is shown in Algorithm 1. It starts with initializing the decision procedure with the string constraint using \(\textsf{init}(\psi )\) followed by the application of preprocessing steps tailored for given decision procedure in \(\textsf{preprocess}\), which may (significantly) simplify the formula and make the solving easier. Note that preprocessing of input formulae is performed at two levels: (i) simplifications during formula rewriting done in the core Z3 solver, and (ii) preprocessing of conjunctions of atomic string constraints done here in \(\textsf{preprocess}\). For the latter case, the preprocessing rules are independent of Z3 ’s rules, as they are tightly integrated to the string theory decision procedure.
The algorithm then iteratively computes solutions using \(\textsf{nextSolution}\), which moves the internal state of the decision procedure to the point before the next LIA check. It is the point where the decision procedure found a possible satisfiable solution, a solution of the non-LIA part of the input formula, and we need to check if it is compatible with the LIA part of the formula. The return value of \(\textsf{nextSolution}\) represents whether the generation of possible solution is finished, and if it is finished, whether it was exhaustive.
The value \(\top \) means that the generation is not finished, and we can continue with a LIA satisfiability check. This check is accomplished using \(\textsf{getLIA}\), which returns the LIA formula \(\beta \) describing the LIA part of the currently found solution with the indication p of its precision. We check the satisfiability of \(\beta \) using the Z3 ’s LIA solver (which also contains the input LIA formula, and formulae generated during the solver run) and if it is satisfiable, the string theory handler returns \(\textsf{sat}\) to the main solver with the theory lemma \(\beta \). The LIA formula \(\beta \) can sometimes be underapproximating (\(p = \textsf{underapprox}\)), which is useful for the stabilization and length-based procedures. If \(\beta \) is underapproximating, Z3-Noodler utilizes an approximation module interacting with decision procedures via the solver variable \(\mathsf {solver.precision}\). At the very end, before the final result of Z3-Noodler is given, the approximation module checks if the value of \(\mathsf {solver.precision}\) is compatible with the final answer. More precisely, if we ever underapproximated any LIA formula from \(\textsf{getLIA}\), then a final \(\textsf{unsat}\) becomes \(\textsf{unknown}\) instead.
On the other hand, the return values \(\bot \) and \(\textsf{undef}\) of \(\textsf{nextSolution}\) represent that the the generation of possible solutions is finished. For \(r = \bot \), the generation was exhaustive, therefore we return \(\textsf{unsat}\) with the length formula \(\varphi \) describing the LIA part of all string solutions provided by \(\textsf{nextSolution}\). The string theory handler then adds a new theory lemma of the form \(\psi \rightarrow \varphi \), which is used to force the internal SAT solver to find another satisfiable assignment. For \(r = \textsf{undef}\), the generation was not exhaustive, which means the decision procedure \(\mathcal {D}\) is not complete, and the string theory handler repeats this step with the next suitable decision procedure. Note that for the stabilization-based procedure, \(\textsf{nextSolution}\) never returns \(\textsf{undef}\), as it is the last procedure.
Model generation. After a decision procedure \(\mathcal {D}\) returns \((\textsf{sat}, \varphi )\), the string theory handler pushes the LIA formula \(\varphi \) as a new theory lemma, which forces Z3 to generate the correct LIA model \(\theta \) which maps LIA terms into integers (especially integer variables and length and string-integer conversion terms). Following this, Z3 iteratively asks the string theory handler for a model of some string term, which is translated into its corresponding string variable x (based on axiomatization). The handler then calls \(\mathcal {D}.\textsf{getModel}(\theta , x)\) and returns the computed model for x based on the LIA model \(\theta \).
4 Efficient Handling of Regular Constraints
In this section, we propose a procedure for handling pure regular constraints, i.e., regular constraints without (dis-)equations or length-constraints. Solving these constraints can be done just by basic automata/regex-based reasoning. Here, the most difficult operation is automata complementation, corresponding to negation in the constraint, since it may cause a state blow-up during determinization of the automaton (this happens especially for automata obtained from regexes containing loop bounds). Therefore, our procedure tries to avoid explicit complementation and handle such constraints in a different way.
Automata construction. For a regular expression \(\mathcal {R}\) we use a procedure \(\textsf{aut}\) for an inductive construction of (nondeterministic) automaton corresponding to \(\mathcal {R}\). The procedure \(\textsf{aut}\) uses eager simulation-based reduction [19], which is applied after each inductive step. In the case that some sub-expression requires future complementation (because it is under the re.compl regex operator), we eagerly determinize and minimize the automaton for the sub-expression (using Brzozowski-based minimization [18]). We use the automata library Mata [26] to handle finite automata and operations over them.
4.1 General Regular Constraint
$$ \bigwedge _{1\le i \le n} x \in \mathcal {S}_i \wedge \bigwedge _{1\le i \le m} x \not \in \mathcal {R}_i $$
We want to decide the satisfiability of the string constraint on the right. To do this, we first construct the product \(P =\bigcap _{1 \le i \le n} \textsf{aut}(\mathcal {S}_i)\) of automata on the left side (if \(n = 0\) we set P to be the universal automaton having the language \(\Sigma ^*\)). We do this iteratively, with regexes \(\mathcal {S}_i\) sorted according to an estimated size of the corresponding NFA (with the smallest being the first one). The estimation is based particularly on regex loop bounds as they affect the resulting size the most. It is possible that the product P becomes empty during this construction and then we can immediately decide unsatisfiability, without having to construct the product for larger regexes. For the right side, we would need to compute the product of complemented automata, which we want to avoid as it might be expensive. Instead, we construct the union \(U = \bigcup _{1 \le i \le m} \textsf{aut}(\mathcal {R}_i)\) (if \(m = 0\) we set U to be empty automaton with the language \(\emptyset \)). We then want to check if the difference of P and U (i.e., \(P \cap U^\complement \)) is non-empty, which is the same as checking if the inclusion \(L(P) \subseteq L(U)\) does not hold. This can be accomplished by using antichain-based algorithms [4, 54], which perform well on real-world instances.
Model generation. A model of x is any word w belonging to the language of \(P \cap U^\complement \). We construct the product \(P \cap U^\complement \) (including the complement \(U^\complement \)) lazily, until some word (the returned model) belonging to the language is found. This seems to work reasonably well, as the found models are usually quite short. Note that if there are no negated regular constraints, the model is any word from P, which can be easily found by applying depth-first search algorithm until some final state is reached.
4.2 Single Regular Constraint
For a single regular constraint, either in the positive (\(x \in \mathcal {R}\)) or the negative (\(x \not \in \mathcal {R}\)) form, we try to postpone the automaton construction and instead gather information about the regex based on its structure. In particular, we propagate triples \((e, u, \ell )\) where \(e \in \mathbb {B}_3\) is a flag denoting whether the regex includes the empty word, \(u \in \mathbb {B}_3\) is a flag denoting whether the regex is universal, and \(\ell \in \mathbb {N}\cup \{\textsf{undef}\}\) denotes the minimum length of a word recognized by the regex. The value \(\textsf{undef}\) represents that it is not possible to compute the value of the flag or the length from the given information. We then use the flag e for the positive constraint (or u for the negative one) to decide if it is satisfiable, completely avoiding automaton construction. If the flag is undefined, we continue as with the general case.
Example 1
Consider the regex \(\texttt {re.++}(R_1, R_2)\), where the propagated value for \(R_1\) is \((e_1, u_1, \ell _1)\) and for \(R_2\) it is \((e_2, u_2, \ell _2)\). The resulting propagated value corresponding to the concatenation is given as \((e_1 \wedge e_2, u, \ell _1 + \ell _2)\) where \(u = \bot \) if \( \ell _1 + \ell _2 > 0\), otherwise \(u = \textsf{undef}\). Note that \(\textsf{undef}\) behaves as an annihilating element in operations (i.e., if \(\textsf{undef}\) occurs in the expression, the result is \(\textsf{undef}\)).
Model generation. If the regex is in the positive form and does not contain more complex operations (intersection, complement, or difference), then we construct the model directly from the regex. Otherwise, we need to construct the automaton from the regex and get the model similarly as for the general case.
4.3 Implementation
The function \(\textsf{isSuitable}(\psi )\) returns \(\top \) if \(\psi \) contains only regular constraints. There is no preprocessing and because we do not work with LIA constraints, \(\textsf{getLIA}\) always returns \((\top , \textsf{precise})\). The functions \(\textsf{nextSolution}\) and \(\textsf{getModel}\) implement the procedure and model generation as explained in this section (the LIA model \(\theta \) is ignored in \(\textsf{getModel}\)).
5 Nielsen Transformation
Another decision procedure used in Z3-Noodler is the Nielsen transformation [39]. We use the Nielsen transformation for satisfiability checking of a conjunction of string equations \(\mathcal {E}\) that are not suitable for the stabilization-based procedure. After a brief description of the Nielsen transformation, we propose an approach used for a (partial) handling of length constraints within Nielsen transformation as it is currently used in Z3-Noodler. We also discuss particular implementation details, including preprocessing details, optimizations, and suitability conditions when Nielsen transformation is applied.
Let e be an equation. By \(\textsf{trim}(e)\) we denote the equation obtained by removing the longest common prefix and suffix from both sides of e. For instance, the result of \(\textsf{trim}(abxzwb = abxnvb)\) is the equation \(zw = nv\). We lift \(\textsf{trim}\) to a set of equations as usual. In this section, we represent a conjunction of string equations by a set of equations \(\mathcal {E}\). We say that a set of equations \(\mathcal {E}\) is quadratic if each variable has at most two occurrences in \(\mathcal {E}\).
Nielsen rules. The transformation uses two meta-rules, which are used to generate a (Nielsen) graph. Nodes of the graph are sets of equations and the directed edges capture the effects of the applied rules. The rules are based on the following observation: if an equation \(xu = yv\) is satisfiable, then there is a couple of (not necessarily disjoint) cases that may occur: (i) \(x = y\) meaning that \(u = v\), or (ii) the variable x or y is \(\epsilon \), or (iii) \(\texttt {len} (x) \le \texttt {len} (y)\) (the other case is analogous); in that case we have that \(y = xy'\) where \(y'\) is a fresh variable and we can apply a substitution \(y/xy'\) in the equation, followed by the substitution \(y/y'\) to avoid generation of isomorphic equations. The Nielsen rules then mimic the cases (ii) and (iii), combined with an implicit handling of the case (i). Formally, the two rules are given as
The rule \((x \hookrightarrow \alpha x)\), where \(\alpha \in \Sigma \cup \mathbb {X}\), rewrites all occurrences of x in \(\mathcal {E}\) by \(\alpha x\). Since this rule is applied if the left-hand side of an equation starts with x while the right-hand side starts with \(\alpha \), after trimming, the first occurrence of \(\alpha \) from the right-hand side is removed. The second rule \(x \hookrightarrow \epsilon \) removes all occurrences of x from the system.
Nielsen graph. Nielsen graph \(\mathcal {G}_\mathcal {E}\) of a set of equations \(\mathcal {E}\) is a (possibly infinite) graph induced by Nielsen rules, meaning that vertices are sets of equations and edges are labeled by particular Nielsen rules. The initial vertex is \(\mathcal {E}\). The system \(\mathcal {E}\) is satisfiable iff the vertex \(\{ \epsilon = \epsilon \}\) is reachable in \(\mathcal {G}_\mathcal {E}\). If \(\mathcal {E}\) is a quadratic system, \(\mathcal {G}_\mathcal {E}\) is finite [39].
5.1 Preprocessing
The number of variables and literals of an equation directly affects the size of the corresponding Nielsen graph. To reduce the size, we use the LenSplit rule to split an equation into several ones according to prefixes with the same length.
This preprocessing rule allows not only to generate smaller Nielsen graphs, but if the new equations do not share variables with the other ones, it is possible to divide \(\mathcal {E}\) into several independent sets (cf. Section 5.2). In Z3-Noodler, we approximate the length-equality check \(\texttt {len} (u)=\texttt {len} (v)\) by comparing the number of occurrences of each variable and comparing the total lengths of all literals occurring in u and v.
5.2 Optimizations
As optimizations, we propose two rules pruning the generated state space of the Nielsen graph, focusing on cutting off nodes that would not lead to the satisfiable node \(\{ \epsilon = \epsilon \}\):
$$ \textsc {SymUnsat}: \begin{array}{c} \mathcal {E}\uplus \{ au = bv \} \\ \hline \emptyset \end{array}\ a \ne b,\quad \textsc {LenUnsat}: \begin{array}{c} \mathcal {E}\uplus \{ u = v \} \\ \hline \emptyset \end{array}\ \texttt {len} (u) \ne \texttt {len} (v) $$
The rule \(\textsc {SymUnsat}\) skips vertices containing trivially unsatisfiable equations that differ in the first symbol of each side, while \(\textsc {LenUnsat}\) is used to avoid vertices containing length-unsatisfiable equations. The check \(\texttt {len} (u) \ne \texttt {len} (v)\) is approximated in a similar way as in the LenSplit rule.
In order to further reduce the state space, we split \(\mathcal {E}\) into several sets that do not share variables and we construct Nielsen graphs for them separately. For instance, we split \(\mathcal {E}= \{ x = yy, z = wa \}\) to \(\mathcal {E}_1 = \{ x = yy \}\), \(\mathcal {E}_2 = \{ z = wa \}\) and then check the satisfiability of \(\mathcal {E}_1\) and \(\mathcal {E}_2\) separately.
Example 2
Let \(\{xaby = yxxbca\}\) be a vertex of a Nielsen graph. Since \(\texttt {len} (xaby) = \texttt {len} (x)+\texttt {len} (y)+2 < 2\cdot \texttt {len} (x)+\texttt {len} (y)+3 = \texttt {len} (yxxbca)\), we can skip the generation of successors of this vertex as it is length-unsatisfiable.
5.3 Length Constraints
If we have a length formula \(\psi \), we need to check if \(\psi \) is satisfiable for a string solution generated by a constructed satisfiable Nielsen graph \(\mathcal {G}_\mathcal {E}\). In order to fit into Z3-Noodler ’s decision procedure handling, we need to infer a LIA formula describing possible lengths of string solutions induced by \(\mathcal {G}_\mathcal {E}\). In Z3-Noodler, we utilize the approach of [35], converting the Nielsen graph into a counter abstraction. We then saturate the counter system with self-loops and enumerate particular flat paths that can be directly converted to a LIA formula. We consider the counter system to be an NFA with counter updates on edges modifying the counter values during a run (we assume no guards).
Fig. 1.
A run of the counter system corresponding to Nielsen rules
and
. The counter values obtained during this run are \(x=0\) and \(y=0\). The sequence of corresponding Nielsen rules, however, describes all string solutions where \(x = y\).
Counter system construction. For the Nielsen graph \(\mathcal {G}_\mathcal {E}\) we construct a counter system \(\mathcal {C}\) s.t. states of \(\mathcal {C}\) are vertices of \(\mathcal {G}_\mathcal {E}\), transitions are reverted edges of \(\mathcal {G}_\mathcal {E}\) where the update transition action is obtained as (i) from \(x \hookrightarrow a x\) where \(a \in \Sigma \) we get \(x := x + 1\), (ii) from \(x \hookrightarrow y x\) we get \(x := x + y\), and (iii) from \(x \hookrightarrow \epsilon \) we get \(x := 0\). Note that contrary to [35] where the counter system has the same direction of edges with the subtracting semantics, we use \(\mathcal {C}\) with reversed edges and additive semantics for a better fit to usual counter system definition. The initial state of \(\mathcal {C}\) is \(\{ \epsilon = \epsilon \}\) and the accepting state is \(\mathcal {E}\). Each run of \(\mathcal {C}\) corresponds to a satisfiable length assignment. Counters of a run of \(\mathcal {C}\), however, do not represent all string solutions, as shown in Fig. 1.
Generating a LIA formula. In order to check if there exists an accepting run in \(\mathcal {C}\) that satisfies the length formula \(\psi \), we need to construct a LIA formula \(\phi _{\mathcal {C}}\) describing all possible valuations of each counter on all accepting paths of \(\mathcal {C}\). In general, such a formula cannot be constructed [35], therefore, we use an under-approximation enumerating extended runs of \(\mathcal {C}\). An extended run is a sequence of states occurring on a run of \(\mathcal {C}\) empowered with the possibility of simple self-loops on states. Simple self-loops allow only update actions of the form \(x := x + \ell \) where \(\ell \in \mathbb {N}\) and x is a counter. For the extended runs, we are able to construct a LIA formula precisely describing the counter values. In particular, for the LIA formula, we create a vector of fresh variables \(\boldsymbol{x}\) expressing counter values after each step of the extended run. Then, we connect the variables using conjunction of formulae describing counter actions on each transition: (i) for a non-self-loop transition with the counter update \(x:=x+y\), the corresponding formula looks like \(\phi (\boldsymbol{x}', \boldsymbol{x}) \mathrel {\Leftrightarrow }\boldsymbol{x}'_i = \boldsymbol{x}_i + \boldsymbol{x}_j \wedge id _{\{i,j\}}(\boldsymbol{x}', \boldsymbol{x})\) where \(\boldsymbol{x}_i = x\) and \(\boldsymbol{x}_j = y\), and \( id _I(\boldsymbol{x}', \boldsymbol{x}) \mathrel {\Leftrightarrow }\bigwedge _{i \notin I} \boldsymbol{x}_i = \boldsymbol{x}'_i\) (other updates are given analogously), and (ii) for a simple self-loop transition with the counter update \(x := x + \ell \), the formula is given as \(\phi (\boldsymbol{x}',\boldsymbol{x}) \mathrel {\Leftrightarrow } id _{\{i\}}(\boldsymbol{x}',\boldsymbol{x}) \wedge \boldsymbol{x}'_i = \boldsymbol{x}_i + k\cdot \ell \) where \(\boldsymbol{x}_i = x\) and k is a fresh LIA variable counting the number of times the self-loop was taken (we do not use existential quantification as the value of k is important for model generation).
Enumeration of extended runs. Since there might be infinitely many extended runs of \(\mathcal {C}\), we use a heuristic enumeration algorithm preferring runs having self-loops as it means that they describe more behaviour. We mark states containing simple self-loops and enumerate extended runs that contain these states. For each such run, we construct the corresponding LIA formula and check if it is satisfiable with the length constraint \(\psi \).
Self-loop saturation. Since the extended runs with self-loops yield weaker LIA formulae, we apply saturation of self-loops on the original counter system in order to generate new simple self-loops. In particular, we select cycles starting and ending in a state q, having counter updates of the same variable x with the counter updates on the cycle of the form \(x := x + \ell _1 ,\dots , x := x + \ell _n\) where \(\ell _i \in \mathbb {N}\). For each such a cycle, we introduce a new self-loop of q labeled by the counter update \(x := x + \sum _{i = 1}^n\ell _i\).
Example 3
Consider the string constraint \(xaby = yabx \wedge \texttt {len} (x)\ge 50\). Example of an enumerated extended run in the counter system is shown in the right. The red self-loop was added during the self-loop saturation. The LIA formula corresponding to this extended run obtained by the procedure above is then given as follows:
Since the formula \(\varphi (\texttt {len} (x), \texttt {len} (y)) \wedge \texttt {len} (x)\ge 50\) is satisfiable, so is the string constraint.
5.4 Implementation
During the first call of \(\textsf{nextSolution}\), the Nielsen graph together with the counter abstraction with saturated self-loops is constructed. Then, during each call of \(\textsf{nextSolution}\), another extended run containing self-loops is generated. If there are no more suitable extended runs left, \(\textsf{nextSolution}\) returns \(\textsf{undef}\) (the procedure is incomplete). The method \(\textsf{getLIA}\) then returns the length constraint corresponding to the current extended run. The \(\textsf{preprocess}\) function implements the rule LenSplit. Suitability checking function \(\textsf{isSuitable}\) checks if there are only equations and length constraints in the system and the system is quadratic. If there are no length constraints and the system is not chain-free [3] we also use this procedure (the Nielsen graph is in such cases usually smaller that the proof graph generated by the stabilization-based procedure).
5.5 Model Generation
The method \(\textsf{getLIA}\) generates the LIA formula describing values of counters of a current extended run. For each transition of the extended run, we remember the Nielsen rule corresponding to the counter updates. The rule for self-loops that were saturated by the extended rule is of the form \(x \hookrightarrow wx\), where \(w\in \Sigma ^+\) (the rule was obtained by concatenating the symbols from Nielsen rules that were used for the saturation). Moreover, for each simple self-loop in the extended run, we also remember the fresh LIA variable counting the number of times the self-loop was taken. For simplicity, for a state q of the extended graph, we define \( sl (q) = q^{ sl }\) if q has a self-loop and q otherwise. The method \(\textsf{getModel}(\theta , x)\) then builds during the first call the model for all variables and then just returns the computed value for the particular variable x. The model is constructed by following the current extended run starting from the initial state \(q_0\) and the initial model \(\nu _{q_0}:\{ x\mapsto \epsilon \mid x\in \mathbb {X}\}\). For a transition \(q \rightarrow q'\), where \(q \ne q'\), there are three possibilities for the label. If it is labeled by the counter update \(x := 0\), we construct the next model as \(\nu _{q'} = \nu _{ sl (q)} \mathop {\vartriangleleft }\{ x \mapsto \epsilon \}\). For a transition labeled by the counter update \(x := x + y\), we construct the next model as \(\nu _{q'} = \nu _{ sl (q)} \mathop {\vartriangleleft }\{ x \mapsto \nu _{ sl (q)}(x)\cdot \nu _{ sl (q)}(y) \}\). Finally, for the update \(x := x + 1\), we construct the model as \(\nu _{q'} = \nu _{ sl (q)} \mathop {\vartriangleleft }\{ x \mapsto \alpha \cdot \nu _{ sl (q)}(x) \}\), where \(x\hookrightarrow \alpha x\) is the Nielsen rule corresponding to the transition. For a self-loop \(q \rightarrow q\) that is labeled by the counter update \(x := x + \ell \), for \(\ell \in \mathbb {N}\), we construct the next model as \(\nu _{q^{ sl }} = \nu _{q} \mathop {\vartriangleleft }\{ x \mapsto w^{\theta (k)}\cdot \nu _{q}(x) \}\) where k is the LIA variable of the self-loop and \(x\hookrightarrow wx\) is the corresponding extended rule.
6 Length-based Decision Procedure
Even though the stabilization-based procedure can be fast, it may suffer from an explosion in the number of alignments, especially for large systems of equations with many unrestricted variables and literals. To deal with such formulae, we propose a length-based decision procedure, which can symbolically encode all possible alignments using LIA formulae. Solving of the string formula is hence converted to the solving of a LIA formula, which might be easier.
Block-acyclic string constraints.
$$ \bigwedge _{1\le i\le n} x = R_i $$
An equational block (or just a block for short) of a variable x is a conjunction of string equations of the form shown in the right, where for each \(i\ne j\), \(R_i \in (\mathbb {X}\cup \Sigma )^*\), each variable of \(R_i\) has at most one occurrence in \(R_i\), \(x\notin \texttt{Var}(R_i)\) , and \(\texttt{Var}(R_i) \cap \texttt{Var}(R_j) = \emptyset \).
Fig. 2.
A block-graph
A conjunction of equational blocks is called a block string constraint. We abuse the notation and treat \(\mathcal {E}_x\) as a set of atoms of the conjunction. A directed graph \(G = (\{ \mathcal {E}_x \mid x \in \mathbb {X}\}, \{ (\mathcal {E}_x, \mathcal {E}_y) \mid x \ne y \wedge (y \in \texttt{Var}(\mathcal {E}_x) \vee (\texttt{Var}(\mathcal {E}_x) \cap \texttt{Var}(\mathcal {E}_y)) \setminus \{ x,y \} \ne \emptyset ) \})\) is called the block-graph of the block string constraint. Block-graph connects blocks s.t. the values of variables of the adjacent blocks affect the value of the block variable of the predecessor. A string constraint is called block-acyclic if the corresponding block-graph is acyclic. As an example, Fig. 2 shows the block-graph for the block-acyclic constraint
.
6.1 Decision Procedure for Block-acyclic Constraints
In this section, we propose a decision procedure for block-acyclic string constraints extended with length constraints based on a translation of the string system to a LIA formula. The LIA formula symbolically encodes alignments of literals occurring in the system. Since block-acyclic constraints do not contain repetitions of variables (except the block variables connecting the blocks), every compatible alignment of literals forms a solution as the unrestricted variables adapt to the alignment of variables.
Let
be an equational block. We need to find the positions of different occurrences of literals
,
,
, and
within the word represented by x, so that they do not clash with each other, nor with the words represented by the variables y, z, u, and w occurring in \(\mathcal {E}_x\). To encode this, we use fresh integer variables
, etc., that represent the starting position of literals/variables within the word x. Then we just need to encode three things: (i) the literals/variables follow each other in the equation (for example,
), (ii) the literals are not mismatched (for example,
is not valid, as this would force both a and d to be at the fifth position), and (iii) literals occurring inside variables y, z, u, and w follow the same rules. For the last one, we need to also define starting positions of literals that occur within those variables. For this reason, we use \(\textsf{lit}(\mathcal {E}_x)\) to denote occurrences of literals in the block \(\mathcal {E}_x\) (for our example it would be
) and, for a block graph \(G = (V,E)\), we define \(\textsf{litall}_G(\mathcal {E}_x)\) as the union of all \(\textsf{lit}(\mathcal {E}_y)\) such that there is a path from \(\mathcal {E}_x\) to \(\mathcal {E}_y\) in G.
Algorithm 2:
Encoding alignments of \(\mathcal {E}\)
The construction of a LIA formula for a block graph G is given in Algorithm 2. It describes the construction of the LIA formula \(\varphi \) compactly encoding all alignments of literals occurring in the string constraint. The formula \(\psi _{ pos }\) introduces the equational length constraint for each equation of the block \(\mathcal {E}_x\) and with the subformula
it sets the beginnings of each literal/variable in the correct order for each equation.
Fig. 3.
Schematic example of encoding between \(\mathcal {E}_x\) and \(\mathcal {E}_y\).
If equations of \(\mathcal {E}_x\) contain a variable y of the block \(\mathcal {E}_y\) (hence there is an edge \((\mathcal {E}_x,\mathcal {E}_y)\) in E), it is necessary to propagate literals of \(\mathcal {E}_y\) also to the block \(\mathcal {E}_x\). Therefore, literals occurring in \(\mathcal {E}_y\) (meaning that they occur in a possible model of y) transitively appear in a possible model of x through the equivalence. Since the literals of \(\mathcal {E}_y\) in a possible model may occur in the same model of x only on positions determined by the occurrence of y in the block \(\mathcal {E}_x\), we generate the formula \(\psi _{ beg }\) expressing that literals occurring in y are shifted by \(B^{x}_{y}\). See Fig. 3 for a schematic example.
The last subformula \(\psi _{ match }\) expresses that two literals \(\ell _1\) and \(\ell _2\) are completely misaligned (\(\textsf{mis}_x\)), or they are aligned only in a compatible way (\(\textsf{comp}_x\)). Formally, these predicates are defined as
We abuse the notation and use \(\textsf{mis}_x(\ell _1, \ell _2)\) to denote \(\textsf{mis}_x(\ell _1, B^{x}_{\ell _2}, B^{x}_{\ell _2} + \texttt {len} (\ell _2))\). The formula \(\textsf{mis}_x(\ell _1, s, e)\) expresses that the literal \(\ell _1\) is not intersecting the interval (s, e). The set \(\textsf{align}(\ell _1, \ell _2)\) contains matching shiftments of \(\ell _1\) relative to \(\ell _2\). Formally, \(\textsf{align}(\ell _1, \ell _2) = \{ i \mid \exists u:\ell _2 = \ell _1^{i{:}}u \} \cup \{ -i \mid \exists u:\ell _1 = \ell _2^{i{:}}u \}\) where for a literal \(\ell = a_0\cdots a_n\), we use \(\ell ^{i{:}}\) to denote the string \(a_i\cdots a_{n}\).
Because satisfiability checking of quantifier-free LIA formulae (the occurrences of \(\texttt {len} (x)\) can be replaced with a pure integer variable) is in \(\textbf{NP}\), it is easy to see that the following lemma holds.
Lemma 1
Satisfiability checking of block-acyclic string constraints is in \(\textbf{NP}\).
6.2 Underapproximation of a Shared Variable
In this section, we generalize the length-based decision procedure to a block string fragment containing two blocks sharing a single variable that is different from block variables. In particular, in the following text, we assume two blocks \(\mathcal {E}_x\) and \(\mathcal {E}_y\) s.t. \(y \in \texttt{Var}(\mathcal {E}_x)\) and \(\texttt{Var}(\mathcal {E}_x) \cap \texttt{Var}(y) \supseteq \{ z \}\) meaning that z is a variable that is shared among the blocks \(\mathcal {E}_x\) and \(\mathcal {E}_y\) (the block-graph has a cycle between \(\mathcal {E}_x\) and \(\mathcal {E}_y\)). We further assume that z does not have its own block (in general there might be more shared variables but between two blocks only).
Fig. 4.
A schematic example of a shared variable underapproximation of the string constraint \(x = \ell _1zu \wedge x = w\ell _2 \wedge y = \ell _3 z \wedge y = \ell _5 v\). Positions of z covered by parts of literals \(\ell _2\) and \(\ell _5\) are marked by hatching. These positions are excluded for possible alignments of other literals.
For instance, consider the string constraint \( x = a y z~\wedge ~x = ab~\wedge ~y = b z\) with the shared variable z. For this system, we underapproximate the solution by ensuring that literals occurring inside a potential model of z are all misaligned with all other possible literals (since z has the same value among occurrences, parts of literals placed inside z are propagated among different occurrences of z). This yields an underapproximation since some of these completely excluded literals might be aligned with literals occurring inside z. See Fig. 4 for a schematic example. Formally, for blocks \(\mathcal {E}_x\) and \(\mathcal {E}_y\) sharing the variable z, the formula excluding alignments of literals inside occurrences of z is given as
where \(\overline{G}\) is the block graph obtained from G by removing edges induced by the shared variables, \(\ell \in _y z \mathrel {\Leftrightarrow }\lnot \textsf{mis}_y(\ell , B^{z}_{y}, B^{z}_{y} + \texttt {len} (y))\) and \(\textsf{out}_x^y(\ell ', z, \ell )\) is the formula expressing that \(\ell '\) in the block x is placed on a different position than the propagated \(\ell \) occurring inside z in y. Formally, \( \textsf{out}_x^y(\ell ', z, \ell ) \mathrel {\Leftrightarrow }\textsf{mis}_x\left( \ell ', B^{x}_{z} + j, B^{x}_{z} + k\right) \), where \(j = B^{z}_{y} - \max (B^{z}_{y}, B^{\ell }_{y})\) and \(k = B^{z}_{y} - \min (B^{z}_{y} + \texttt {len} (z), B^{\ell }_{y} + \texttt {len} (\ell ))\). The formula \(\psi _{\textsf{excl}}^{x,y}(z)\) is then conjoined with the formula obtained from Algorithm 2.
6.3 Implementation
Since the length-based procedure generates a LIA formula describing all models of the input string constraint, the method \(\textsf{nextSolution}\) generates the formula together with the precision, which are then returned by \(\textsf{getLIA}\). The precision is set to \(\textsf{underapprox}\) if an undeapproximation preprocessing rule was used, or the approach for a shared variable was applied. Further calls of \(\textsf{nextSolution}\) then return \(\bot \). On top of that, during the first call of \(\textsf{nextSolution}\), it checks whether the formula obtained by preprocessing is block acyclic (possibly with a shared variable). If not, \(\textsf{nextSolution}\) returns \(\textsf{unknown}\) (the length-based procedure is skipped). The \(\textsf{preprocess}\) method utilizes the same preprocessing rules as the preprocessing stabilization-based procedure does, except for rules introducing complex regular constraints, which are avoided. The method \(\textsf{isSuitable}\) checks whether the input constraints contain only equations and length constraints.
6.4 Model Generation
For the length-based procedure, the model of each variable (provided that the generated LIA formula is satisfiable) is determined by positions of literals. For each variable x we allocate a string skeleton of length \(\theta (\texttt {len} (x))\). Fields of the skeleton will be filled with symbols from literals occurring of the corresponding positions in a block. Starting from a possible shared variable z, we take blocks containing occurrences of z and fill skeleton fields corresponding to symbols of literals involving the value of z (given by values of the begin variables for literals of the block). Then, we iteratively process blocks that do not contain block variables of still unprocessed blocks. For the block variable we update the fields of the skeleton given by positions of literals in the block and already filled skeletons of other variables. Then, we propagate the values of the block variable to variables occurring inside the block. After each block is processed, we fill fields of each variable that remained empty with the symbol \(\texttt {a}\).
7 Stabilization-based procedure
The main and the most general decision procedure is the stabilization-based procedure applicable for any constraint. It was first introduced in [15] for handling word equations with regular constraints and then extended for handling length constraints [25] and string-integer conversions [29]. The procedure follows the framework from Section 3.3, with preprocessing rules described in [25], the \(\textsf{nextSolution}\) function follows the procedure from [25], and \(\textsf{getLIA}\) is based on [25, 29]. As the procedure is explained in these papers, we do not give details here, instead, we focus on model generation, which was not done before.
7.1 Model generation
The model generation is implemented recursively (i.e., the model of a variable might be constructed from models of different variables) and a single top-level call of \(\textsf{getModel}\) may result in computing models of more variables. In such a case, we memoize the results and return their values directly if they are required.
After the stabilization-based procedure finds a solution it ends with: (i) the set of all variables divided into three disjoint sets: \(\mathbb {X}_I\) are variables whose length or string-integer conversion value is important, \(\mathbb {X}_N\) are variables for which these values are not important, and a set of fresh variables \(\mathbb {X}_F\), (ii) the set of inclusions
that contain only variables from \(\mathbb {X}_N \cup \mathbb {X}_F\), (iii) the substitution map \(\sigma :\mathbb {X}_I \rightarrow \mathbb {X}_F^*\) that substitutes variables from \(\mathbb {X}_I\) with a concatenation of fresh variables, and (iii) the language assignment \(\textsf{Lang}:(\mathbb {X}_N \cup \mathbb {X}_F) \rightarrow 2^{\Sigma ^*}\) such that the languages of fresh variables are precise, i.e., for each combination of words from the languages of fresh variables, there is a selection of words for variables from \(\mathbb {X}_N\) such that each inclusion from I holds.
At the start of the model generation, during the very first call of \(\textsf{getModel}\), we restrict the language assignments of fresh variables so that they follow the lengths/conversion values given by the LIA model \(\theta \). In particular, for \(y \in \mathbb {X}_F\), we restrict the language \(\textsf{Lang}(y)\) only to the words of the length of \(\theta (\texttt {len} (y))\) and for conversion values, we restrict it to the singleton language containing exactly the string converted to that value.
The method \(\textsf{getModel}(\theta , x)\) used to get a model of x then first checks whether \(x \in \mathbb {X}_I\). If true, it recursively calls \(\textsf{getModel}(\theta , x_i)\) on all fresh variables from \(\sigma (x) = x_1 \cdots x_n\) and then constructs the model of x by their concatenation. Because the values of fresh variables are restricted by the LIA model \(\theta \), this means that the value of x will be correctly restricted too. If \(x \not \in \mathbb {X}_I\), we check if the variable is not on the right-hand side of any inclusion of I. In such case, we just return some word from \(\textsf{Lang}(x)\).
In the last case, when x is on the right-hand side of some inclusion
, we first recursively get models of all variables on the left-hand side (by calling \(\textsf{getModel}(\theta , y_i)\) for each \(1\le i \le n\)). The concatenated models for the left-hand side yield the word \(w = \textsf{getModel}(\theta , y_1)\cdots \textsf{getModel}(\theta , y_n)\). Subsequently, we find models of all variables on the right-hand side (including x) in a way that their concatenation matches w. This is implemented using a backtracking algorithm that reads the word w and checks whether w can be split into subwords (each subword corresponding to a variable of the right-hand side) that belong to the languages of the particular variables. Note that this algorithm works only if the variable x occurs at most once on the right-hand side of any inclusion and, furthermore, there is no cycle (e.g. for the inclusion
, we cannot get a model for x as shown above). For such cases we would need to use the algorithm from the proof of [16, Theorem 5], which is currently not implemented in Z3-Noodler. However, as experiments show, this has almost no practical impact as the stabilization-based procedure usually does not finish for such cases anyway.
Table 1.
The impact of decision procedures on solving for each benchmark set/category for solved formulae. Second column shows the number of times a string solver was called within Z3 ’s DPPL(T) procedure. Next columns show how many times (relative to the number of calls) was each decision procedure called and how many of these calls were solved by the decision procedure.
8 Experiments
We implemented the presented decision procedures and model generation in version 1.3 of Z3-Noodler and evaluated them on all string benchmarks from SMT-LIB [7]. We split the benchmarks into three categories: In Regex we gather the benchmark sets that contain mostly regular and length constraints: AutomatArk [12], Denghang, Redos, StringFuzz [17], and Sygus-qgen. The benchmark sets Kaluza [33, 46], Kepler [30], Norn [1, 2], Omark, Slent [52], Slog [53], Webapp, and Woorpje [28], consisting of mostly word equations and length constraints with some small number of more complex constraints, are in the Equations category. The last category, Predicates, contains benchmark sets FullStrInt, LeetCode, PyEx [43], StrSmallRw [41], and Transducer+ [22], which heavily feature more complex string constraints. The experiments were executed on a workstation with an AMD Ryzen 5 5600G CPU @ 3.8 GHz with 100 GiB of RAM running Ubuntu 22.04.4. The timeout was set to 120 s, memory limit was 8 GiB.
Table 2.
Numbers of solved instances and the time (in seconds) needed to solve them for each tool and benchmark category. The versions of tools with \(\mathcal {M}\) were run with model generation turned on. The total number of formulae for each benchmark category is given in parentheses.
Procedures comparison. Table 1 shows the impact of various decision procedures within Z3-Noodler on solving string constraints. We compare the number of times a particular procedure was used for solving. Note that the total number of calls might be different to the number of formulae in the particular benchmark as some formulae might have been solved directly by the theory rewriter or the initial phase (and therefore no call of a decision procedure was used) while some formulae might have resulted in multiple calls. The table shows that the decision procedure for regex constraints is (unsurprisingly) strong on the Regex benchmark. Furthermore, the length-based procedure is quite strong on Equations (and on regex-heavy benchmarks due to the StringFuzz benchmark set as it contains parts with pure equations). Even though the impact of the Nielsen transformation seems low, without it, Z3-Noodler cannot solve most of the formulae the procedure solves. Note that for Equations, there are 44 calls that were not solved by any presented procedure. Instead, they were solved by a simple procedure that is used for benchmarks that contain equations with exactly one symbol and length constraints. This procedure just asks if the lengths of both sides of each equation are the same (as there is only one symbol, it is the same as asking if the equation holds). All in all, Z3-Noodler with all procedures enabled takes 3,583 seconds less to solve 944 formulae more than Z3-Noodler with only stabilization-based procedure, which is a significant improvement (see tables in the appendix for more detailed results). From Table 1 it is also evident that the values of called and solved are close to each other, i.e., the suitability check can precisely identify a suitable decision procedure.
Fig. 5.
Comparison with cvc 5 and Z3. Times are in seconds, axes are logarithmic. Dashed lines are timeouts. Colours distinguish groups:
Regex,
Equations, and
Predicates.
Comparison with other tools. In Table 2 and Fig. 5, we compare Z3-Noodler with cvc 5 (version 1.2.0) and Z3 (version 4.13.0). The table also shows the impact of model generation. From the results we can see that Z3-Noodler is significantly better (in both the number of solved instances and the time) than other tools on Regex. Furthermore, it is better than other tools on Equations, while slightly worse than cvc 5 on Predicates. Comparing the impact of model generation, we can see that for all three tools, it is not significant, there is usually some slight slowdown with a few less solved instances. All in all, even with model generation, Z3-Noodler can solve the most number of instances the fastest.
Acknowledgments
This work has been supported by the Czech Ministry of Education, Youth and Sports ERC.CZ project LL1908, the Czech Science Foundation project 23-07565S, and the FIT BUT internal project FIT-S-23-8151.
The work of David Chocholatý, Brno Ph.D. Talent Scholarship Holder, is Funded by the Brno City Municipality.
This work has been executed under the project VASSAL: “Verification and Analysis for Safety and Security of Applications in Life” funded by the European Union under Horizon Europe WIDERA Coordination and Support Action/Grant Agreement No. 101160022.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.