Dieses Kapitel vertieft sich in die komplexe Welt der Überprüfung von Speichereigenschaften in komplexen Programmen, insbesondere solchen, die unbegrenzte Speicherbereiche betreffen. Sie befasst sich mit den Beschränkungen aktueller State-of-the-Art-Ansätze im Umgang mit Zeigern und Gedächtnismanipulationen und schlägt eine neuartige Technik des symbolischen Kreislauffaltens vor. Diese Methode führt Schleifenzähler ein, um mehr Eigenschaften auszudrücken, was die Präzision des symbolischen Denkens erhöht. Das Kapitel skizziert ein umfassendes Rahmenwerk zur Abstraktion von Speicherblockmanipulationen und beschreibt die Syntax und Semantik einer Spielzeug-Imperativsprache sowie die abstrakten Ausdrücke, die verwendet werden, um Speicherzustände darzustellen. Er diskutiert die Abstraktion von Sets von Speicherzuständen, abstrakten Transformatoren und die Konkretisierung universell quantifizierter Eigenschaften. Das Kapitel präsentiert auch Fallstudien, einschließlich Matrixtransposition, Speicherbootstrap in einem x86-Betriebssystem und Array-Suchverfahren, die die praktische Anwendung und Effektivität des vorgeschlagenen Ansatzes demonstrieren. Darüber hinaus enthält es Details zur Implementierung und diskutiert die aktuellen Beschränkungen des Rahmenwerks und bietet eine gründliche Untersuchung der Herausforderungen und potenziellen Verbesserungen bei der Überprüfung von Speichereigenschaften.
KI-Generiert
Diese Zusammenfassung des Fachinhalts wurde mit Hilfe von KI generiert.
As increasingly complex programs, such as operating systems, are successfully formally verified, the memory properties we aim to prove become progressively more subtle. The framework of abstract interpretation has been prolific at providing techniques to prove memory properties. However, the state-of-the-art approaches still lack precision when manipulating pointers, in particular when the memory regions are unbounded. As an example, during the bootstrap of the expanded memory, linking the physical memory to the virtual memory requires proving relations between those addresses, which requires extending the expressivity of symbolic reasoning.
In particular, as in the Gauge domain [19], our abstraction introduces the values of loop counters to express more properties. This improves accuracy while scaling to the analysis of large programs. The analysis of memory block manipulations requires aggregating the effect of the assignments and tests of the previous loop iterations. For this, we go beyond by quantifying universally abstract properties over the precedent values of each loop counters.
Anzeige
We present in this paper a symbolic approach that keeps track of the history of the memory manipulations that are done in the loops.
Related Work. Analyzing the content of arrays has been done by using several methods with different degrees of precision, scalability, and automation. A first one consists in smashing all the cells of the array into a single abstract value [3]. Then, the pointed value can represent more than one element and the updates are weak. To improve the precision, new techniques have been introduced such as dimension summarization [7], and dynamically partitioning arrays [6, 8, 10] to express relational properties about segments of arrays. More recently, the analysis of matrix manipulations considered in [11] was one of the first considering multidimensional arrays. However, those techniques do not allow expressing relationships among values and offsets.
The idea of the creation of this new abstract domain comes from two ideas. First, from the Gauge domain [19] that computes, for each variable, bounds that are affine forms in the loop counters. Then, by combining such reasoning with symbolic methods [4, 16], it becomes possible to express properties about parts of the memory unknown at analysis time.
Shape analysis is broadly used to reason about memory manipulation [13]. However, it heavily relies on inductive type definitions specified in separation logics [18]. Whereas our abstraction provides several grains of precision for describing memory states, shape analysis tends to lose too much information when the inductive types are not satisfied or cannot be proved.
Anzeige
Outline. We sketch as follows the outline of the paper. In Sect. 2, we give the syntax and the semantics of the toy imperative and untyped language that we will use all along this paper. Symbolic manipulations of expressions often require additional syntactic constructors to abstract parts of the expressions and to account for the information computed by the other parts of the analysis. In Sect. 3, we introduce the syntax and the semantics of the abstract expressions that are handled by our abstract domain. In Sect. 4, we define the abstraction of sets of memory states, while in Sect. 5, we describe the abstract transformers which lift the computation of the concrete semantics in the abstract. In Sect. 6, we describe the result of our analysis on three case-studies (namely, a matrix transposition, the bootstrap of the paging in an x86 operating system, and an array look-up procedure). In Sect. 7, we discuss the implementation details and the current limitations of our framework. We conclude in Sect. 8.
2 Syntax & Semantics
2.1 Syntax of a toy imperative and untyped language
In this section we present a toy imperative and untyped language.
As presented in Fig. 1, an expression is either a constant in \(\mathbb {Z}\), the address of a variable that belongs to the finite set of variables \(\mathcal {V}\), the value located at the address pointed by another expression, or a sum/difference/product of two expressions.
The statements are labelled by program labels
that are distinct and follow a total order \(\le _{\mathcal {L}}\) that corresponds to their order of appearance in the program. Statements can be either a no-op, a sequence of statements, an assignment, or a conditional branching/loop with the condition that corresponds to the comparison of two expressions. In the case of the assignment, the left-value is the dereference of the address whose content will be modified by the assignment, and the right-value is either an expression or the dynamic allocation of a memory block of size of the expression given as a parameter.
In addition, we introduce the same syntactic sugar as in the C language to ease the understanding of programs: the value of a variable is the value located at its address, and accessing the \(e_2\)th element of \(e_1\) corresponds to dereferencing the address \(e_1 + e_2\).
Representation of a simple C function’s body in the language of the analyzer
To present the language in more details we depict on the left of Fig. 2 a simple C function’s body that fills an array of size
such that each cell contains its own index. In the middle of the figure, we wrote an equivalent program in the language that we consider. Note that
allocates an array of
elements in \(\mathbb {Z}\). Because of this idealization, we do not have to multiply
by the number of bytes on which each element is represented. Then, on the right part of the figure we wrote the same program but with the syntactic sugar described previously.
2.2 Concrete semantics of the language
We now describe the concrete semantics of our language, that is a mathematical description of its behaviors.
This semantics is non-standard, because we use loop counters as ghost variables (i.e., variables that do not appear in the program) to track how many times each loop body has been evaluated since the most recent evaluation of the loop’s entry point. While these counters are not used in the concrete semantics, they play a crucial role in the abstract semantics, where they allow expressing properties universally quantified over loop counter values. In particular, loop counters are useful to express inductive invariants and prove the soundness of the analysis.
A memory state is a triple of functions \((\rho ,\sigma ,\phi ) \in \mathcal {R}\mathop {=}\limits ^{{\scriptscriptstyle \textrm{def}}}\,S_{\rho } \times S_{\sigma } \times S_{\phi }\).
The first component \(\rho \) is a map from addresses in \(\mathbb {N}\) to the corresponding value in \(\mathbb {Z}\). It corresponds to considering that the memory is an infinite array with naturals as addresses, and integers as values. Then, \(S_{\rho } \mathop {=}\limits ^{{\scriptscriptstyle \textrm{def}}}\,\mathbb {N}\rightarrow \mathbb {Z}\). Let us skip the second element for now.
The third component \(\phi \) associates each loop counter
to its value, that is the number of times the corresponding loop labelled by
has been executed. Then \(S_{\phi } : \Lambda \rightharpoonup \mathbb {N}\). Notice that the function is partial, because before entering a loop and after leaving it the corresponding loop counter does not exist in the domain of \(\phi \).
Finally, the second component \(\sigma \) maps each variable, and each dynamically allocated block, to its corresponding address in \(\rho \) and its size that represents the number of cells that are reserved to it starting from this address. As dynamically allocated blocks are identified by the program label of the associated \(\textbf{malloc}(\cdot )\) statement and the value of the loop counters at that time, we get \(S_{\sigma } \mathop {=}\limits ^{{\scriptscriptstyle \textrm{def}}}\,(\mathcal {V}\cup (\mathcal {L}\times S_\phi )) \rightharpoonup (\mathbb {N}\times \mathbb {N})\).
The definition of the initial set of memory states \(R_0\) in Def. 1 that is made of:
\(\rho _0\) that is any function from \(\mathbb {N}\) to \(\mathbb {Z}\),
\(\sigma _0\) that maps only the variables to their corresponding memory blocks that consist in a non-deterministic address in \(\mathbb {N}\) and a constant size of 1, but such that the same address is not given to two different variables,
\(\phi _0\) that is the partial function with empty domain. Such functions will be denoted \(\varnothing \) in the rest of the article.
Fig. 3.
Semantics of the language as a denotational semantics
We now introduce the concrete semantics in Fig. 3. We first define the concrete semantics of expressions
. It maps a memory state to the evaluation of the expression in that memory state, that is an element of \(\mathbb {Z}\), if the dereferences are legal. The evaluation of a constant just returns the constant, the evaluation of the address of a variable V returns the address that was stored as the first element of \(\sigma (V)\), dereferencing an expression corresponds to evaluating the address corresponding to the expression in parameter and retrieving the value at this address in the memory map \(\rho \). Then, evaluating binary arithmetic operators consists in evaluating both operands and applying the operator on the evaluations.
Then, we define the concrete semantics of statements
, that maps each set of memory states to another one on which the transformations corresponding to the statements have been performed.
2.3 Consistency of the semantics
The consistency of the semantics relies on the preservation of two properties. The first one, established by Theorem 1, concerns the absence of aliases between the different allocated regions. The second one, established by Theorem 2, ensures that each allocation identifier is introduced at most once, along a given trace.
Firstly, we introduce few auxiliary definitions to specify the scope of loop counters, to state the absence of aliasing in allocated memory regions, and to state that the allocation identifiers are consistent with the current execution.
Definition 2 (Alive loop counters at each program label)
For every
, let
be the subset of \(\Lambda \) that represents all the loop counters alive at the program point
,
with
the program label just after the end of the loop statement labelled by
. We also define a total order \(\le _{\Lambda }\) on loop counters that mirrors the order of their corresponding program points, i.e.,
.
Definition 3 (Non-aliasing memory state property)
A set of memory states \(R \in \wp (\mathcal {R})\) is said to be \({{\,\mathrm{non\text {-}aliasing}\,}}(R)\) when none of its allocated memory regions (described by \(\sigma \) in each memory set in R) aliases with another one.
In particular, \({{\,\mathrm{non\text {-}aliasing}\,}}(R_0)\).
Definition 4 (Program label and loop counters identify memory allocations)
with \(\preccurlyeq \) the lexicographic order of the loop counter values scanned in the order described by \(\le _{\Lambda }\) (defined in Def. 2). That is to say, two functions \(\phi \) and \(\phi '\) are considered as equal if they map the same value for each loop counter, otherwise they are ordered as the values of the minimal loop counter (with respect to the total order \(\le _{\Lambda }\)) that maps to different values. Intuitively, \(\phi \preccurlyeq \phi '\) when \(\phi '\) is reachable only after \(\phi \).
In particular,
.
This also implies that, before executing a statement
,
: the identifier of the newly allocated memory block is used for the first time.
Theorem 1 (Concrete semantics does not introduce aliasing)
Let \(R \in \wp (\mathcal {R})\) be a set of memory states and
be a statement, then
This directly implies that, if p is a program, then
.
Theorem 2 (Consistency of the identification of memory allocations)
Let \(R \in \wp (\mathcal {R})\) be a set of memory states and
be a statement, then
This directly implies that if
is a program then
.
Theorems 1 and 2 state that the properties \({{\,\mathrm{non\text {-}aliasing}\,}}\) and \({{\,\mathrm{alloc\text {-}consistent}\,}}\) are preserved by the concrete semantics of statement, which means that, if the property holds initially, it will continue to hold at each program label.
3 Abstract expressions
In the following section, we introduce abstract expressions. They stand both for left and right values, and they allow expressing and propagating intermediary reasoning about them.
The goal of the abstract domain is to output properties that are universally quantified over the loop counter values. Then, we introduce
, with
that corresponds to a universally quantified value of
. Typically, we will infer properties like
, with p a predicate over the memory.
We introduce the elements of
that are either:
\(\texttt {NULL}\) which is used to represent expressions that do not depend on the address of a memory block, with \(\texttt {NULL}+ 0\) that refers to the value 0. It is used to represent integers,
\( { \& }V\) which corresponds to the address where the value of the variable V is stored,
which corresponds to the first address of the memory block dynamically allocated by the statement at label
when the loop counters had the values \(\phi \). This identifies uniquely the memory block because each statement can be evaluated only once for fixed loop counters values (as a consequence of Theorem. 2).
As the goal is to represent expressions as affine forms of loop counters and dereferences, we introduce \( affine_{ctr} [K]\) and \( affine_{deref} [K]\) that are parametric non-terminals, with K a non-terminal that can be interpreted as an integer.
\( affine_{ctr} [K]\) encodes an affine expression with coefficients in K and variables that are the number of times the corresponding loop has been evaluated. As an example, if
appears in a program, then
represents the integer value: “twice the number of times the loop’s body has been evaluated until now, plus one”.
\( affine_{deref} [K]\) encodes an affine expression with coefficients in K and variables that are dereferences in the memory block starting from a
and an offset that itself is in \( affine_{deref} [K]\). As an example, the expression
can be represented by
that is in \( affine_{deref} [\mathbb {Z}]\).
From those bricks, we construct the key elements of
and
that are \( affine_{deref} [ affine_{ctr} [\mathbb {Z}]]\) and \( affine_{ctr} [ affine_{deref} [\mathbb {Z}]]\). They differ in that only the first one allows loop counters (elements of \(\Lambda \) or \(\Pi \) depending on whether they are generalized) to appear in the dereferences. This difference will be discussed in Sect. 5.
In particular, elements of
represent integers/addresses that are affine forms in symbolic dereferences, with coefficients that are affine forms in loop counters (generalized or not), with coefficients that are in \(\mathbb {Z}\). Then, an abstract right-value in
is just an abstract base plus an abstract offset starting from this base. In Example. 1 we show an element of
.
Example 1
The translation of the right-value
, with
a variable and
the value of a loop counter, can be:
We now introduce the evaluation of abstract expressions into \(\mathbb {Z}\), provided the concrete memory state. This is important to prove the soundness of the translation into abstract expressions in Theorem. 3. The definitions in Fig. 5 really follow the intuition, except for \(S_{\varphi } \mathop {=}\limits ^{{\scriptscriptstyle \textrm{def}}}\,(\Lambda \cup \Pi ) \rightharpoonup \mathbb {N}\) that is a valuation of the (possibly generalized) loop counters. It will be used in the following section while defining the concretization of universally quantified properties: they have to hold for every \(\varphi \) that is a possible valuation of the generalized loop counters.
Fig. 5.
Evaluation of abstract expressions as integers
3.2 Translation from expressions to abstract right-values
We now define the translation of expressions into abstract right-values. At this point, the translation is purely syntactic and does not depend on the memory state. We denote it
, and we provide a simple definition in Fig. 6. To that extent, we use the helper functions
(resp.
) which adds (resp. returns the difference between) two abstract offsets. Those functions only accumulate the dereferences that appear in the operands and merge the remaining constants. In the case of a difference, we take the opposite of the coefficients in head of the dereferences of the second operand.
Fig. 6.
Simple translation from expressions to abstract right-values
The soundness of the translations is expressed by Theorem 3.
Theorem 3 (Soundness of expression translation)
For every expression
and every memory state \((\rho ,\sigma ,\phi ) \in \mathcal {R}\),
Note that, neither this definition of the translation of dereferences is precise, because the abstract base cannot be different from \(\texttt {NULL}\), nor the translation of multiplications is precise. Indeed, such translations are purely syntactic. They could gain precision by relying on semantic properties provided by abstract domains.
We provide some hints about how to improve the translation of multiplications of elements of
. There are two cases:
if one operand is a constant in \(\mathbb {Z}\), then we distribute the multiplication by the constant to the coefficients of the other operand (for example, 5 multiplied by
would give
),
otherwise, if one operand is an affine form in the loop counters while the other one has no loop counters that appear in head, then we can distribute the multiplication by the affine form to the coefficients of the other operand (for example,
multiplied by
would give
).
4 Description of the abstract memory states
An abstract memory state is represented by an element of \(\mathcal {R}^\sharp \) that is a tuple of 5 elements that describe constraints over the memory state. The first component is the program point in which the memory state is observed. The second one expresses properties about the loop counters values. The third and fourth components express properties about the assignments and conditions that were previously evaluated but still effective. The last component describes information about the size of dynamically allocated memory blocks in case they were allocated. In the following section, we provide a detailed explanation of how these constraints are stored. Throughout this depiction, we describe the components of a possible abstract memory state of the program presented in Fig. 8, focusing on the program state at label
. This program is given both with C syntax (on the left), and in the language presented in this article (on the right). It makes use of dynamic memory allocation, loops and conditional assignments to cover most of the properties described by the abstract memory state.
To ensure the coherence of abstract elements, the loop counters that appear in an abstract state are constrained by some side conditions about the order of appearance in the program (the order \(\le _{\Lambda }\)). These side-conditions are provided in the explanations of each component of the tuple \(R^\sharp \in \mathcal {R}^\sharp \).
Fig. 7.
Definition of the abstract memory states
Fig. 8.
Program with a dynamic allocation, a loop, and a conditional assignment
4.1 Abstraction of loop counter sections
The second component of an abstract memory state \(s^\sharp : sections ^\sharp \) represents an over-approximation of the loop counters values. It is a partial map that associates each loop counter that is alive with its possible values. We recall that a loop counter is alive if it occurs in a loop whose body is being evaluated, that is to say, if
stands for the program label, that this loop counter belongs to
. The possible values are over-approximated by intervals with
as bounds, with possibly no upper-bound. Additionally, the abstraction of sections is constrained by the following side-condition: the image of a loop counter shall only depend on the values of the loop counters that are defined before (that is to say: for every
occurring in the image of
,
). Only the functions satisfying this side-condition will be considered in the paper.
For example, in the program presented in Fig. 8, at label
, the abstract memory state’s sections could be
. This means that
has its value within 0 and
(excluded).
We now introduce the concretization function of the corresponding part of the abstract memory state. We denote it \(\gamma _{ sections ^\sharp }: sections ^\sharp \rightarrow \wp (\mathcal {R})\), and it is defined for each element \(\phi ^\sharp \in sections ^\sharp \) as:
We notice that the side condition of \(\phi ^\sharp \) is necessary to ensure the well-formedness of the definition, when evaluating the abstraction of offset expressions.
4.2 Abstraction of assignments and tests
We now define the abstraction of constraints coming from previous assignments and tests. This abstraction relies on the description of regions for the potential value of generalized loop counters (described by elements of \( gen\text {-}sections ^\sharp \)), and a tracking of the syntactic comparisons that had to hold to reach a program state (described using \( branch\text {-}key \) and \( comp ^\sharp \) elements).
Elements of\( gen\text {-}sections ^\sharp \). They are used to express properties that apply to multiple values of loop counters. The image of
represents the values
can take, with
the generalized version of the loop counter value
. It can be either:
an abstract offset
: it is used to express that a property holds “
”, with as a side condition \( off _1\) which depends only on the live loop counters (not-generalized) defined before
(that are in
),
a couple of abstract offsets
: it is used to express that a property holds “
”, with as a side condition \( off _1\) and \( off _2\) which depend only on the live loop counters (not-generalized) defined strictly before
(that are in
).
If a live loop counter
does not appear in the domain of the element of \( gen\text {-}sections ^\sharp \), it means that the loop counter is not generalized, thus
.
We now introduce a helper function that, given the set of live loop counters, an element of \( gen\text {-}sections ^\sharp \) and a memory state, returns all the valuation functions of the generalized loop counters. We denote it \(\gamma _{ gen\text {-}sections ^\sharp }: \wp (\Lambda ) \times \wp ( gen\text {-}sections ^\sharp ) \times S_\rho \times S_\sigma \times S_\phi \rightarrow \wp (\Pi \rightharpoonup \mathbb {N})\), and it is defined for each set of live loop counters
, each element \(\Phi _g^\sharp \in \wp ( gen\text {-}sections ^\sharp )\) and each memory state \((\rho ,\sigma ,\phi ) \in (S_\rho \times S_\sigma \times S_\phi )\) as:
Elements of\( branch\text {-}key \)and\( comp ^\sharp \). They are used together to represent the validity of comparisons. First, \( branch\text {-}key \) is used to determine where the comparison comes from, that is from which conditional branching (through its program label in \(\mathcal {L}\)) and with which outcome (\(\texttt {t}\) if the condition was fulfilled, and \(\texttt {f}\) otherwise). Then, the comparison is represented by the comparison operator and the two compared abstract right-values.
Elements of\( assigns ^\sharp \). They describe for each program point
corresponding to an assignment statement that, for each possible value of the generalized loop counters (described by elements of \( gen\text {-}sections ^\sharp \)), if the proof obligations (described by an element of \( branch\text {-}key \rightharpoonup comp ^\sharp \)) hold, then the addresses represented by a given abstract left-value point to the values represented by a given abstract right-value. As side conditions, the only loop counters allowed to appear in the abstract left/right-values are the generalized versions of the live loop counters, that are the elements of
.
For example, in the program presented in Fig. 8, at label
, the abstract memory state’s assignments could be the following:
Let us describe in details this element of \( assigns ^\sharp \):
The first line describes the assignment at program label
. At this point, no loop counters are defined, so none of them can be generalized. This is what the first occurrence of the symbol \(\varnothing \) indicates. Moreover, this assignment is evaluated unconditionally. This is what is embedded in the second occurrence of the symbol \(\varnothing \). The two other elements of the tuple are respectively the left and right values after their translations into abstract forms. The left-value corresponds to the variable
and the right-value to the memory block allocated in the statement labelled by
.
The second line is really similar to the first, except that the left-value is
and the right-value is the integer 0.
The third line differs in the presence of a non-\(\varnothing \)\( gen\text {-}sections ^\sharp \) element. It indicates that the corresponding property holds for all
between 0 and
(excluded, because the considered abstract memory state is in
). Indeed,
is equal to the number of times the loop’s body has been entirely executed.
The rest is a bit more complex because the assignment is conditional, and because it might have already been executed for the current value of the loop counter
. It states that, both for
(indicated by the symbol \(\varnothing \)), but also for all
within 0 and
(indicated by
), if the condition evaluated at program label
was true (that is
), then the
th cell of the memory block allocated at program label
points to the value
.
We now introduce the concretization function of the corresponding part of the abstract memory state. We denote it \(\gamma _{ assigns ^\sharp }: assigns ^\sharp \rightarrow \wp (\mathcal {R})\), and it is defined for each element \(a^\sharp \in assigns ^\sharp \) as
with \((\rho ,\sigma ,\varphi ) \vdash c^\sharp \) that represents that all the proof obligations embodied by \(c^\sharp \) are satisfied in the memory state \((\rho ,\sigma ,\varphi )\). More formally:
Elements of\( guards ^\sharp \). They are used to indicate that a conditional branching statement produced the same result across multiple iterations when evaluated within loops. It is represented by, for each comparison and output (represented by an element of \( branch\text {-}key \)), and for each range of values the loop counters can take (described by an element of \( gen\text {-}sections ^\sharp \)), the comparison that holds (described by an element of \( comp ^\sharp \)).
For example, in the program presented in Fig. 8, at label
, the abstract memory state’s guards could be the following.
The domain of this function is the singleton
which indicates that the condition labelled by
(that is
) holds for the current value of the loop counter (denoted by the element \(\varnothing \)), but also for every
between 0 and the current value of the loop counter (denoted by the element
).
We now introduce the concretization function of the corresponding part of the abstract memory state. We denote it \(\gamma _{ guards ^\sharp }: guards ^\sharp \rightarrow \wp (\mathcal {R})\), and it is defined for each element \(g^\sharp \in guards ^\sharp \) as
4.3 Abstraction of memory blocks dynamically allocated
The elements of \( mallocs ^\sharp \) describe, for each program point
that is associated to a \(\textbf{malloc}\) statement, that if for a given valuation of the loop counters at that time the memory allocation was performed, then the size of the corresponding memory block is described by a given
. If the size cannot be expressed precisely, then \(\top \) is used. As side conditions, the only loop counters allowed to appear in the representation are the ones of
. This is because, as explained in the concrete semantics, the memory allocation is identified uniquely by the program label and the value of the loop counters at that time.
For example, in the program presented in Fig. 8, at label
, the corresponding abstract memory state’s element could be
. This represents that, if it has been allocated, the memory block allocated at program label
and identified by the loop counters’ values at that time, is of size
.
We now introduce the concretization function of the corresponding part of the abstract memory state. We denote it \(\gamma _{ mallocs ^\sharp }: mallocs ^\sharp \rightarrow \wp (\mathcal {R})\), and it is defined for each element \(m^\sharp \in mallocs ^\sharp \) as
4.4 Combining all the constraints described by abstract memory states
We finally introduce the concretization function of abstract memory states. It is denoted \(\gamma : \mathcal {R}^\sharp \rightarrow \wp (\mathcal {R})\), and it is defined for each abstract memory state
as
The initial abstract memory state is
with
the first program label. It is a sound abstraction of \(R_0\), that is \(R_0 \subseteq \gamma (R_0^\sharp )\), because the only property enforced by \(\gamma (R_0^\sharp )\) is to be a non-aliasing memory state, which is verified for \(R_0\).
5 Operations on abstract memory states
In this section we describe the main abstract transfer functions that compute a sound approximation of their concrete counterparts. The abstract semantics is described by
that is the abstract counterpart of the concrete semantics introduced in Fig. 3. The soundness of the abstract transfer functions is expressed by Theorem 4 and elements of proof are given along their definitions.
Theorem 4 (Soundness of the abstract transfer functions)
For every statement
and every abstract memory state \(R^\sharp \in \mathcal {R}^\sharp \),
Definition 5
We introduce
that, given two abstract memory states, sets of universally quantified loop counters, and abstract expressions in
, returns an abstract expression that matches the semantics of the two original ones when evaluated on their respective domains. We provide an overview of the process that is guided by heuristics:
we identify parts of the expressions that trivially match in order to reduce the size of expressions to be merged,
for each loop counter that both maps to a constant value and that is not generalized in only one abstract memory state, we replace the occurrences of that loop counter by the constant,
then, if both expressions are evaluated in the context of abstract memory states in which a loop counter is not generalized and maps to two different constant values, we linearly interpolate the remaining parts of the expression. As an example, if we try to merge the expressions 0 (with
) and
(with
), then we output the expression
that is indeed semantically equal to both expressions on their respective abstract memory states.
Abstract join We describe the abstract transfer function corresponding to the union after a branching at the condition at program point
. Given two abstract memory states
and
, it is defined as the following:
the new section of each loop counter
, if
, is (l, u) defined as follows. First the lower-bound l is the minimum between \(l_1\) and \(l_2\) if it exists, otherwise 0 (that is a lower-bound over the values of the loop-counters). Symmetrically, we define u as the maximum value between \(u_1\) and \(u_2\) if it exists, or \(+\infty \),
the universally quantified predicates expressed by the elements of \( assigns ^\sharp \) and \( guards ^\sharp \) are joined. For all the elements of the domain that appear in the domain of both operands, the expressions involved (either the left/right-values, or the compared right-values) are merged using the \( merge \) operator described above. In case there are proof obligations in the property, the compared expressions are also merged. In the case where the assignment/test only appears in one operand, if the test located at label
still exists in the corresponding element of \( guards ^\sharp \), it means we can keep the property if we add the proof obligation corresponding to the element of \( guards ^\sharp \),
lastly, the new element of \( mallocs ^\sharp \) is just the result of the \( merge \) operator with no loop counters generalized, or \(\top \) if they could not get merged.
Widening Applying consecutive abstract joins to compute the abstract transfer function corresponding to loops could loop forever. To mitigate this problem, after computing the abstract join, if the lower-bound (resp. upper-bound) of a loop counter’s section keeps decreasing (resp. increasing), we widen it to 0 (resp. \(+\infty \)). However, to limit the loss of precision, we use widening thresholds. To that matter, during the first loop iterations we record the array lookups, and in particular the abstract expressions the loop counters would have to take so the offset exceeds the bounds of the memory block. Then, we first try to widen the loop counter bounds to those values. In order to ensure that the suite of abstract memory states is ultimately stationary, we limit the number of such widenings to a fixed number.
Abstract assignments When performing an assignment in the abstract, we first translate the left and right values, then we remove (or replace by its previous value if it can be expressed) all the occurrences of the possibly assigned cell both in the abstract memory state and in the translated left and right values.
To avoid losing the properties about the previous assignments at the same program point but from past iterations, it is necessary to prove that the assigned cell does not alias with the previously assigned cells. To that matter, allowing loop counters to appear only in head of
is helpful: if we can order the loop counters such that each coefficient is bigger than the part of the expression with the smaller loop counters, then the wanted property is proven.
Once the possibly affected cells’ contents forgotten/rewritten, we can add the assignment (with no generalized counters) to the element of \( assigns ^\sharp \) and (if existing) merge the previous and new abstract left/right values while keeping the previous proof obligations. Finally, if the assignment also corresponds to a memory allocation, the size of the memory block has to be merged with the previously existing one. Note that, for the soundness proof, we use that the memory states in the concretization have to be \({{\,\mathrm{alloc\text {-}consistent}\,}}\). This gives us that the memory block being allocated has never been allocated before, so the expression can indeed be merged.
Abstract filter When filtering the abstract memory state by a comparison, we apply the same strategy as in the assignment case: we translate the expressions, and merge them with the existing abstract comparison (if it was already existing) with no loop counters generalized (that is, adding the element
to \(\Phi _g^\sharp \)).
Loop counter handling To define the abstract transfer function for the loop statements, we still need to define the abstract counterparts of
,
,
. The first one only adds the loop counter
to the living counters, with default value 0. The second one tries to determine an abstract expression for the value of
. If it can find one, it is used to replace the loop counter everywhere it was defined, otherwise all the properties that involve it are forgotten. Finally, the section corresponding to the loop counter is removed from the abstract memory state. The third one is a bit more complicated: when the loop counter
is incremented, all its occurrences in the abstract memory state have to be replaced by
. Then, the predicates quantified by
that ranges over an interval with
as the upper-bound remain only if the predicate still held for the current value of
. The predicates that weren’t quantified by
that ranges over an interval with
as the upper-bound but that hold for the current value of
that is a constant expression are generalized with
that ranges over the interval with the current expression of
as the lower bound, and
as the upper-bound. If the current value of
couldn’t be proven a constant expression, the predicate is generalized with
that ranges over the interval with
as the lower-bound and
as the upper-bound.
6 Case studies and analysis
In the following section, we describe the properties inferred by the abstract domain presented in the paper on some simple, but still realistic, programs. We first demonstrate the capability of the abstract domain to prove relational properties about multidimensional arrays (here matrices). In a second time, we consider a program that is used to set up the expanded memory of an x86 operating system. In particular, this setting takes into account the memory needed by the BIOS, the bootloader, and the code of the operating system. We are particularly interested in this example that was the initial motivation for this abstract domain. Then, we present a program that computes the index of the first element greater than a given value in a sorted array.
6.1 Relational properties about multidimensional arrays
Fig. 9.
Transposition of a square matrix
The program depicted in Fig. 9 computes the transposition a square matrix
into another one
. The abstract domain presented in this paper is able to prove that, at the end of the program, the following properties about the assignments hold:
To detail the property, we have :
both
and
are arrays (of size
, which is given by a different component of the abstract memory state),
they both contain, in each of their cells pointers to fresh arrays (again of size
) with
that is just
when we unfold the definition of
,
then each cell of the matrix
contains the value of the corresponding cell in the matrix
that has been transposed, with
that corresponds to
after unfolding the definition of
and the one of
,
the values of
and
are not described in the property above because, at the end of the loop, if
their values would be 0 and not
.
6.2 Bootstrap of the paging in an x86 operating system
Fig. 10.
Paging bootstrap in an x86 operating system
The program introduced in Fig. 10 is an excerpt of an x86 operating system, in particular the bootstrap of the expanded memory. In this context, we consider an array
of size 1024, where the entry at index
holds the value
. Each entry of this array, as a first approximation, is used to map the 4 KiB memory region starting at virtual address
to the corresponding 4 KiB region beginning at physical address
, with some extra parameters defined by
. Then, setting up
allows to describe the first 4 MiB of the virtual memory (each page being 4 KiB long). In Fig. 10, all the pages of the first 4 MiB are mapped such that each virtual address matches its physical address. However, the flags of the memory regions differ. The first 8 pages (32 KiB) are reserved to the BIOS and the bootloader with flag 1 (superviser-only, read-only). The code of the operating system is then placed (by the bootloader) just after. It is of size
that is only known at link-time, and the flag for this region is 5 (user-accessible, read-only). Then, all the remaining virtual addresses of the first 4 MiB are given as free space to the operating system with flag 7 (user-accessible, read/write). In reality, the memory layout would be more complex (in particular, we need to deal with more range of addresses that are reserved), but this example is representative of what one would find in an operating system.
Analyzing such programs requires to be precise about the content of the paging structures in order to prove that memory accesses are legitimate. In particular, techniques like array smashing [3], or dynamic partitioning of arrays [6, 8, 10] do not provide enough relations between the indexes of arrays and their contents.
The abstract domain presented in this paper is able to prove that, at the end of the program execution, and if we assume that
, the following properties about the assignments hold:
These properties can be understood as follows:
is an array (of size
, which is given by a different component of the abstract memory state),
the cell at offset \(i \in [0,8)\) contains the value \(4096 \times i + 1\),
the cell at offset
contains the value \(4096 \times i + 5\), which is translated into: the cell at offset
contains the value \(4096 \times j + 8 \times 4096 + 5\),
the cell at offset
contains the value \(4096 \times i + 7\), which is translated into: the cell at offset
contains the value
,
the value of
and the one of
is 1024.
6.3 Index of the first element greater than a given value
Fig. 11.
Index of the first element greater than a given value in a (sorted) array
We consider the program introduced in Fig. 11 that, given an integer
and a sorted array
of size
(supposed greater than 0), returns the index of the first element of
which is greater than
, or
in case no element is greater. Note that, even if
is not sorted, no out-of-bound accesses are performed. The properties that are computed by the domain and presented below are also useful to analyze linear interpolations of tabulated functions. Indeed, it is required to prove that the point in which a function is interpolated (here x) is between the indexes of two consecutive tabulated keys in order to prove that the result of the interpolation is between the corresponding tabulated values.
The abstract domain is able to express the following loop invariant at program label
:
with
, thanks to the widening threshold (described in Sect. 5) that is used because of
is of size
. The reason why
cannot take the value of
is because the second line of the invariant contradicts the fourth line. This contradiction is found by using a heuristic that, when applying
, checks whether the negation appears in the still-effective comparison predicates represented by the \( guards ^\sharp \) element of the abstract memory state.
Then, after exiting the loop at program label
we get the following property:
The two last lines of the property indicate that
is the index of the first element that is greater than
.
Representation of partial functions. Partial functions are widely used in the abstract memory state. In the implementation, those functions are represented by functional maps implemented as balanced binary search trees. This allows to access or update elements in logarithmic time in the number of elements (that is always bounded by the number of program labels times the number of nested loops).
Forward goto. Such instructions are supported in the implementation by using a conservative approach: we apply the abstract transfer functions that correspond to the loops exited and entered between the source and destination of the
statement within the AST.
Complex typesC programs may include complex elements like casts and struct or union types. Because the abstract domain presented in this paper requires handling pointers as left-values, it was not possible to fully leverage the abstraction layer provided by the struct domain described in [15]. At this time, retrieving the value of a previously assigned left-value is possible only when the types trivially match. In particular, it is currently not possible to recompose an expression from multiple assignments. It would be an interesting addition in the future.
Memoization. When an abstract expression has to be dereferenced and a matching assignment is found, the identifier of the assignment is stored in a mutable field of the abstract expression so that, if it is later dereferenced again, the memoized assignment will be checked before going through the usual process.
Types that cannot be represented Some expressions, as floats, cannot be represented by abstract expressions. Still, in the implementation,
can either represent an integer/pointer (as in this paper), but also the dereference of a pointer in a specified type. It is then possible, for example, to copy an array of floats and keep the property that the two arrays contain the same values.
Using information of other domains. In order to enhance the expressiveness of the abstract domain, we use properties inferred by the other domains to refine the abstract memory state. In particular, for each dereference that appears in an abstract expression that is in the abstract state, we add to it its abstract value by using only the non-relational value domains. The properties computed by the interval domain [5], the congruence domain [9] and the bitfield domain [17] are particularly useful to search efficiently which assignments and guards could be affected by a new assignment. The properties inferred by the Andersen’s pointer analysis [1] are also used to reduce the loss of information when assigning a left-value which cannot be translated into an abstract left-value, or which cannot be proved within the bounds of its base.
Sharing information to other domains. In order to be beneficial to the analysis, sharing discovered properties to other domains is essential. When a right-value associated to an assignment or a guard is computed thanks to universal eliminations, or when the value of a loop counter is fixed when exiting a loop, then the information is propagated to the other domains. In addition, when an alarm (division by 0, integer overflow, out-of-bound access, ...) was raised by another domain and is proved false in this domain, then it is removed.
We deeply thank the anonymous referees for the precise reviews. We also thank Marco Campion and Guannan Wei for the insightful conversations.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.