nach oben

International Journal on Software Tools for Technology Transfer

Erschienen in:

Open Access 14.02.2021 | General

Correct program parallelisations

verfasst von: S. Blom, S. Darabi, M. Huisman, M. Safari

Erschienen in: International Journal on Software Tools for Technology Transfer | Ausgabe 5/2021

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

A commonly used approach to develop deterministic parallel programs is to augment a sequential program with compiler directives that indicate which program blocks may potentially be executed in parallel. This paper develops a verification technique to reason about such compiler directives, in particular to show that they do not change the behaviour of the program. Moreover, the verification technique is tool-supported and can be combined with proving functional correctness of the program. To develop our verification technique, we propose a simple intermediate representation (syntax and semantics) that captures the main forms of deterministic parallel programs. This language distinguishes three kinds of basic blocks: parallel, vectorised and sequential blocks, which can be composed using three different composition operators: sequential, parallel and fusion composition. We show how a widely used subset of OpenMP can be encoded into this intermediate representation. Our verification technique builds on the notion of iteration contract to specify the behaviour of basic blocks; we show that if iteration contracts are manually specified for single blocks, then that is sufficient to automatically reason about data race freedom of the composed program. Moreover, we also show that it is sufficient to establish functional correctness on a linearised version of the original program to conclude functional correctness of the parallel program. Finally, we exemplify our approach on an example OpenMP program, and we discuss how tool support is provided.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

A common approach to handle the complexity of parallel programming is to write a sequential program augmented with parallelisation compiler directives that indicate which part of the code might be parallelised. A parallelising compiler consumes the annotated sequential program and automatically generates a parallel version. This parallel programming approach is often called deterministic parallel programming, as the parallelisation of a deterministic sequential program augmented with correct compiler directives is always deterministic. Deterministic parallel programming is supported by different languages and libraries, such as, for example, OpenMP [20], and is often used for financial and scientific applications (see e.g. [4, 11, 17, 21]).

Although it is relatively easy to write parallel programs in this way, careless use of compiler directives can easily introduce data races¹ and consequently non-deterministic program behaviour. This paper proposes a tool-supported static verification technique to prove that parallelisation as indicated by the compiler directives does not introduce such non-determinism. Our technique is not fully automatic: the user has to add some additional annotations, and verification of these annotations gives the guarantee that program behaviour is not changed by the compiler directives. Moreover, we also show that it is sufficient to prove functional correctness on a sequential version of the program, in order to conclude functional correctness of the parallel program. We develop a verification technique to reason about data race freedom and functional correctness on an intermediate representation language, called PPL (for Parallel Programming Language), which captures the core features of deterministic parallel programming. We then show that a commonly used subset of a deterministic programming language such as OpenMP can be encoded into this intermediate representation, and thus, our verification technique allows us to reason about the correctness of compiler directives in OpenMP. The verification technique is implemented as part of our program verifier VerCors. That means, if we (manually) annotate an OpenMP program with specifications, data race freedom and functional correctness can be verified automatically. We illustrate this approach on some characteristic examples.

In essence, our intermediate representation language PPL is defined in terms of the composition of code blocks. We identify three kinds of basic blocks: a parallel block, a vectorised block and a sequential block. Basic blocks are composed by three binary block composition operators: sequential composition, parallel composition and fusion composition where the fusion composition allows two parallel basic blocks to be merged into one. An operational semantics for PPL is presented.

Our verification technique requires that users specify each basic block by an iteration contract that describes which memory locations are read and written by a thread. We introduce these contracts and present verification rules for basic blocks. Moreover, the program itself can be specified by a global contract. To verify the global contract, we show that the block compositions are memory safe (i.e. data race free) by proving that for all the iterations that might run in parallel, all accesses to shared memory are non-conflicting, meaning that they are disjoint or they are read accesses. If all block compositions are memory safe, then it is sufficient to prove that the sequential composition of all the basic blocks w.r.t. program order is memory safe and functionally correct to conclude that the parallelised program is functionally correct.

The main contributions of this paper are the following:

An intermediate representation language PPL that captures the core features of deterministic parallel programming, with a suitable operational semantics.
An algorithm that encodes a commonly used subset of OpenMP into its PPL intermediate representation.
A tool-supported verification approach for reasoning about data race freedom and functional correctness of OpenMP programs by using the encoding of OpenMP into PPL.

This paper is an extended version of our paper presented at NFM 2017 [12]. In addition, it contains (1) a rephrasing of the verification rules for parallel and vectorised loops, presented at FASE 2015 [5] in the setting of PPL, i.e. rephrasing them for basic blocks, and (2) an algorithm that encodes a commonly used subset of OpenMP into PPL.

This paper is organised as follows. After some background information on OpenMP and our program specification language, Sect. 3 introduces our intermediate representation language PPL, presenting syntax and semantics. Then, Sect. 4 shows how OpenMP programs are encoded into PPL. Section 5 presents the verification rules for basic blocks, while Sect. 6 presents the verification rules for block compositions. Section 7 provides more information on how the tool support is provided, while Sect. 8 uses our technique on an OpenMP program. Finally, Sect. 9 presents related work, and Sect. 10 concludes the paper and discusses future work.

2 Background

This section provides some background information on the OpenMP compiler directives and briefly introduces syntax and semantics of our program specification language.

2.1 OpenMP

As mentioned above, in this paper we consider a frequently used subset of OpenMP constructs, using only the following pragmas: omp parallel, omp for, omp simd, omp for simd, omp sections, and omp single, as well as all allowed clauses. We illustrate these OpenMP features by means of examples. For full details on OpenMP, we refer to [20]. Later, Sect. 4 shows how programs in this subset are encoded into our core parallel programming language, and Sect. 8 shows how to verify that these programs can safely be parallelised, after the user has added the necessary program contracts.

Example 1

Figure 1 presents a sequential C program augmented by OpenMP compiler directives (called pragmas). The pivotal parallelisation annotation in OpenMP is omp parallel which denotes a parallelisable code block (called parallel region). Threads are forked upon entering a parallel region and joined back into a single thread at the end of the region.

This example shows a parallel region with three for-loops $\textsf {L1}$, $\textsf {L2}$, and $\textsf {L3}$. The loops are marked as omp for meaning that they are parallelisable (i.e. their iterations are allowed to be executed in parallel). To precisely define the behaviour of threads in the parallel region, omp for annotations are extended by clauses. For example the combined use of the nowait and schedule(static) clauses indicates the fusion of the parallel loops $\textsf {L1}$ and $\textsf {L2}$, meaning that the corresponding iterations of $\textsf {L1}$ and $\textsf {L2}$ are executed by the same thread without waiting. The clause nowait implies that the implicit barrier at the end of omp for is eliminated. The clause schedule(static) ensures that the OpenMP compiler assigns the same thread to corresponding iterations of the loops.

In OpenMP, all variables which are not local to a parallel region are considered as shared by default unless they are explicitly declared as private (using the private clause) when they are passed to a parallel region.

Since OpenMP 4.0, support for the single instruction multiple data (SIMD) execution model has been added to the OpenMP standard. The SIMD execution model is a well-known technique to speed up vector arithmetics, specifically in scientific applications.

Example 2

Figure 2 presents an OpenMP example to illustrate this. The first loop uses the omp simd annotation to vectorise the for-loop $\textsf {L1}$, which partitions the iterations of the loop into smaller chunks, where the size of each chunk is equal to the vectorisation size given by the extra clause simdlen (i.e. ${\textsf {M}}$ in this example). The loop execution is defined as the sequential execution of chunks, where each chunk is executed in a vectorised fashion.

The second for-loop ($\textsf {L2}$) shows the other form of OpenMP vectorisation using the omp for simd annotation. In this case, the loop execution is defined similarly, however the iteration chunks are executed in parallel rather than sequentially. Figure 3 visualises the execution of these loops.

Example 3

Figure 4 presents how the parallel execution of two parallel regions is defined in OpenMP. The example consists of three parallel regions: $\mathsf {P_{1}}$ in lines 4–11, $\mathsf {P_{2}}$ in lines 14–23 and $\mathsf {P_{3}}$ in lines 26–29. Similar to the previous examples, the behaviour of each thread is defined by further OpenMP compiler directives. We use the omp sections annotation, which defines the blocks of the code (marked by omp section) which are executed in parallel. For example, two threads are forked upon entering the parallel region $\mathsf {P_{1}}$, one executes the method $\mathsf {add}$ and the other one executes the method $\mathsf {mul}$. Note that the bodies of the methods are also parallel regions. Therefore, the threads executing the $\mathsf {add}$ and $\mathsf {mul}$ methods fork more threads upon entering the parallel region $\mathsf {P_{2}}$ and $\mathsf {P_{3}}$. The parallel region $\mathsf {P_{2}}$ is a fusion and the parallel region $\mathsf {P_{3}}$ is a single parallel loop where omp parallel for is a shorthand for an omp parallel with a single omp for.

Example 4

Figure 5 shows an OpenMP program using incorrect compiler directives, which results in data races. As there is a data dependence between the two loops, we need a barrier between them when we parallelise the loops. However the clause schedule(static) nowait explicitly removes the barrier, which results in an erroneous parallelisation. Using our approach, as a user has to specify iteration contracts for the two loops, we can detect that parallelisation of this program would lead to data races.

2.2 Program specifications: syntax and semantics

Our program specification language is based on permission-based separation logic, combined with the look-and-feel of the java modeling language (JML) [18]. In this way, we exploit the expressiveness and readability of JML, while using the power of separation logic to support thread-modular reasoning. We briefly explain the syntax and semantics of the permission-based separation logic formulas and how they extend the standard JML-program annotations in first-order logic.

Syntax Threads hold permissions to access memory locations. Permissions are encoded by fractional values, as introduced by Boyland [9]: any fraction in the interval $(0, 1)$ denotes a read permission, while 1 denotes a write permission. Permissions can be split and combined, but soundness of the logic ensures that for every memory location the total sum of permissions over all threads to access this location does not exceed 1. This guarantees that if the permission specifications can be verified, the program is data-race-free. The set of permissions that a thread holds are typically called its resources.

Formulas F in our program specification language are built from first-order logic formulas b, permission predicates ${\textsf {Perm}(e_1,e_2)}$, conditional expressions ($\cdot ?\cdot :\cdot $), separating conjunction $\mathop {\star }$, and universal separating conjunction $\bigstar $ over a finite set I. The syntax of formulas is formally defined as follows:

$$\begin{aligned} \begin{array}{l} F :{:}{=} b \mid \textsf {Perm}({e_1}, {e_2}) \mid b ? F : F \mid F \mathop {\star }F \mid {\bigstar _{i\in I} F(i)} \\ b :{:}{=} \mathbf{true} \mid \mathbf{false} \mid e_1 == e_2 \mid e_1 \le e_2 \mid \lnot b \mid b_1 \wedge b_2 \mid \dots \\ e :{:}{=} v \mid n \mid [e] \mid e_1+e_2 \mid e_1-e_2 \mid \dots \end{array} \end{aligned}$$

where b is a side-effect free Boolean expression, e is a side-effect free arithmetic expression, [.] is a unary dereferencing operator—thus [e] returns the value stored in the address e in shared memory—v ranges over variables and n ranges over numerals. We assume the first argument of the ${\textsf {Perm}(e_1,e_2)}$ predicate is always an address and the second argument is a fraction. For convenience, we often use the keyword read instead of an explicit fraction to specify an arbitrary read permission, and the keyword write instead of 1 to denote a write permission.

We use the array notation a[e] as syntactic sugar for $[a+e]$ where a is a variable containing the base address of the array a and e is the subscript expression; together they point to the address $a+e$ in shared memory.

Semantics Our semantics mixes concepts of implicit dynamic frames [25] and separation logic with fractional permissions, which makes it different from the traditional separation logic semantics and more aligned towards the way separation logic is implemented using traditional first order logic tooling. For further reading on the relationship between separation logic and implicit dynamic frames, we refer to the work of Parkinson and Summers [22].

To define the semantics of formulas, we assume the existence of the following domains: $\textsf {Loc}$, the set of memory locations, $\textsf {VarName}$, the set of variable names, $\textsf {Val}$, the set of all values, including memory locations, and $\textsf {Frac}$, the set of fractions ([0, 1]).

We define memory as a map from locations to values $h:\textsf {Loc}\rightarrow \textsf {Val}$. A memory mask is a map from locations to fractions $\pi : \textsf {Loc}\rightarrow \textsf {Frac}$ with unit element $\pi _0: l \mapsto 0$ with respect to the point-wise addition of heap masks. A store is a function from variable names to values: $\sigma : \textsf {VarName} \rightarrow \textsf {Val}$.

Formulas can access the memory directly; the fractional permissions to access the memory are provided by the $\mathsf {Perm}$ predicate. A strict form of self-framing is enforced, meaning that the Boolean formulas expressing the functional properties in pre- and postconditions and invariants should be framed by sufficient resources (i.e. there should be sufficient access permissions for the memory locations that are accessed by the Boolean formula, in order to evaluate this formula).

The semantics of an expression e depends on a store $\sigma $, a memory h, and a memory mask $\pi $ and yields a value: $\sigma ,h,\pi \mathop {[e\rangle }v$. The store $\sigma $ and the memory h are used to determine the value v, and the memory mask $\pi $ is used to determine if the expression is correctly framed, i.e. sufficient access permissions are available. For example, the rule for array access is:

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_Equ18_HTML.png

where $\sigma (a)$ is the initial address of array a in the memory and i is the array index that is the result of evaluating of index expression e. Apart from the check for correct framing as explained above, the evaluation of expressions is standard and we do not explain it any further.

The semantics of a formula F, given in Fig. 6, depends on a store, a memory, and a memory mask and yields a memory mask: $\sigma ,h,\pi \mathop {[F\rangle }\pi '$. The given mask $\pi $ denotes the permissions by which the formula F is framed. The yielded mask $\pi '$ denotes the additional permissions provided by the formula. Thus, a Boolean expression is valid if it is true and yields no additional permissions, (rule Boolean), while evaluating a $\mathsf {Perm}(e_1, e_2)$ predicate yields additional permissions to the location, provided the expressions $e_1$ and $e_2$ are properly framed (rule Permission). Note that evaluation of expression $e_1$ results in a location $l$, while evaluation of expression $e_2$ results in a fraction f. The rule checks that the permissions already held on location $l$ plus the additional fraction $f$ does not exceed 1. The rules for evaluation of a conditional formula are standard (rules Cond 1 and Cond 2). We overload standard addition $+$, summation $\varSigma $, and comparison operators to be, respectively, used as pointwise addition, summation and comparison over the memory masks. These operators are used in the rules SepConj and USepConj. In the rule SepConj, each formula $F_1$ and $F_2$ yields a separate memory mask, $\pi '$ and $\pi ''$, respectively, where the final memory mask is calculated by pointwise addition of two memory masks, $\pi ' + \pi ''$. The rule checks if $F_1$ is framed by $\pi $ and $F_2$ is framed by $\pi +\pi '$. Note that since $F_2$ is framed by $\pi +\pi '$, this implicitly guarantees that the permissions per location never exceed 1. Finally, the rule USepConj extends the similar evaluation by quantifying over a set of formulas conjoined by the universal separating conjunction operator. Again, rule USepConj checks that the permission fractions on any location in the memory cannot exceed 1.

Finally, a formula F is valid for a given store $\sigma $, memory h and memory mask $\pi $ if starting with the empty memory mask $\pi _0$, the required memory mask of F is less than $\pi $:

$$\begin{aligned} \sigma , h, \pi \models F \text{, } \text{ if } (\sigma ,h,\pi _0\mathop {[F\rangle }\pi ') \wedge (\pi ' \le \pi ) \end{aligned}$$

Example 5

Figure 7 presents an example of how we annotate a sequential program using our specification language. The formulas in the annotations are interpreted using the semantics as defined in Fig. 6. The program logic rules are the basic proof rules from separation logic (an extension of Hoare logic).

This sequential program has a loop (lines 11–17) that adds the corresponding elements of two arrays (named a and b) and stores it in a different array (named c) in line 17. Annotations are provided to give a function specification (lines 1–7) and a loop invariants (lines 12–16). Note that $\backslash $forall* indicates universal separating conjunction, ${\bigstar _{i\in I} }$, over permission predicates and $\backslash $forall denotes standard universal conjunction over logical predicates. Preconditions and postconditions, using keywords requires and ensures (lines 3–6), should hold at the beginning and the end of the function, respectively. We use the keyword context to abbreviate both requires and ensures clause. This is convenient to have, because permission pre- and postconditions are often the same. The keyword context_everywhere is used to specify an invariant property (lines 1–2) that must hold throughout the function. As pre- and postcondition, we have read permissions over all elements in arrays a and b (lines 3–4) and write permissions over all elements in array c (line 5). The loop invariants specifies the permissions that are used in the loop (lines 12–14). Further the loop invariant specifies that when iteration i starts, we have added the elements from a and b from the beginning up to location $i-1$ (line 15). Therefore, at the end of the loop (and the function), we have added all elements (specified as a postcondition in line 6).

3 Syntax and semantics of deterministic parallelism

As mentioned before, we define our verification technique over an intermediate representation language that captures precisely the main features of deterministic parallelism. This section presents the abstract syntax and semantics of PPL, our Parallel Programming Language. In Sect. 4, we show how an important fragment of OpenMP can be encoded into this intermediate representation language.

3.1 Syntax

Figure 8 presents the PPL syntax. The basic building block of a PPL program is a block. Each block has a single entry point and a single exit point. Blocks are composed using three binary composition operators:

parallel composition ||;
fusion composition $\oplus $; and
sequential composition .

The entry block of the program is the outermost block. Basic blocks are:

a parallel block $\textsf {Par}$ (${\textsf {N}} $) ${\textsf {S}}$;
a vectorised block $\textsf {Vec}$ (${\textsf {N}} $) ${\textsf {V}}$; and
a sequential block ${\textsf {S}}$,

where ${\textsf {N}} $ is a positive integer variable that denotes the number of parallel threads, i.e. the block’s parallelisation level, ${\textsf {S}}$ is a sequence of statements and ${\textsf {V}}$ is a sequence of guarded assignments $b \Rightarrow \textsf {assg} $.

In the grammar, we define a vectorised block at a different level than the other basic blocks, because this allows us to define the semantics in a more convenient way, while it does not prevent us from writing programs such as the parallel or fusion composition of a parallel and a vectorised block.

We assume a restricted syntax for fusion composition such that its operands are parallel basic blocks with the same parallelisation levels. This is checked by an extra well-formedness condition over PPL programs. Each basic block has a local read-only variable $\textsf {tid} \in \mathsf {[0..{\textsf {N}})} $ called thread identifier, where ${\textsf {N}} $ is the block’s parallelisation level. We (ab)use the term iteration to refer to the computations of a single thread in a basic block. So a parallel or vectorised block with parallelisation level ${\textsf {N}}$ has ${\textsf {N}}$ iterations. For simplicity, but without loss of generality, threads have access to a single shared array which we refer to as heap. We assume all memory locations in the heap are allocated initially. A thread may update its local variables by performing a local computation ($v\,{:}{=}\,e$), or by reading from the heap ($v\,{:}{=}\,\textsf {mem}(e)$). A thread may update the heap by writing the value of one of its local variables to it ($\textsf {mem}(e){:}{=}\,v$). For the arrays, we use notation a[e] as syntactic sugar for [a+e] where a is a variable containing the base address of the array a and e is the subscript expression.

Example 6

Figure 9, line 1 and 2, contains a PPL expression that captures the program in lines 4–13. In this example, the two basic blocks are composed using (||). Figure 10 shows another example of a PPL expression and its corresponding OpenMP program where the basic parallel and vectorised blocks are composed sequentially (lines 1–3). Note that $\mathsf {tid}_1$ refers to the thread identifier of the parallel block, while $\mathsf {tid}_2$ refers to the thread identifier of the vectorised block.

3.2 Semantics

The behaviour of PPL programs is described using a small step operational semantics. For a convenient and understandable definition, the operational semantics is defined in several layers, as defined below. Throughout, we assume existence of the finite domains:

$\textsf {VarName}$, the set of variable names,
$\textsf {Val}$, the set of all values, which includes the memory locations,
$\textsf {Loc}$, the set of memory locations, and
$\mathsf {[0..{\textsf {N}})}$ for thread identifiers.

We write

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_IEq103_HTML.gif

to concatenate two statement sequences (

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_IEq104_HTML.gif

Program State To define the program state, we use the following definitions.

We model the program state as a triple of block state, program store and heap

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_IEq105_HTML.gif

and thread state as a pair of local state and heap $(\textsf {LS},h)$. The program store is constant within a block and it contains all global variables (e.g. the initial addresses of arrays).

BlockState We distinguish various kinds of block states: an initial state $\textsf {Init}$, composite block states $\textsf {ParC}$ and $\textsf {SeqC}$, a state in which a parallel basic block should be executed $\textsf {Par}$, a local state $\textsf {Local} $ in which a vectorised or a sequential basic block should be executed, and a terminated block state $\textsf {Done}$.

The $\textsf {Init}$ state consists of a block statement ${\mathcal {P}}$. The $\textsf {ParC}$ state consists of two block states, while the $\textsf {SeqC}$ state contains a block state and a block statement ${\mathcal {P}}$; they capture all the states that a parallel composition and a sequential composition of two blocks might be in, respectively. The basic block state $\textsf {Par}$ captures all the states that a parallel basic block $\textsf {Par}$ (${\textsf {N}}$) ${\textsf {S}}$ might be in during its execution. It contains a mapping $\mathbb {LS} \in \mathsf {[0..{\textsf {N}})} \rightarrow \textsf {LocalState} $, which maps each thread to its local state, to model the parallel execution of the threads. There are three kinds of local states: a vectorised state $\textsf {Vec}$, a sequential state $\textsf {Seq}$, and a terminated sequential state $\textsf {Done}$.

The $\textsf {Vec}$ block state captures all states that a vectorised basic block $\textsf {Vec}$ (${\textsf {N}} $) ${\textsf {V}}$ might be in during its execution. It consists of $\varSigma \in \mathsf {[0..{\textsf {N}})} \rightarrow \textsf {PrivateMem} $, which maps each thread to its private memory, the body to be executed ${\textsf {V}}$, a private memory $\sigma $, and a statement ${\textsf {S}}$. As vectorised blocks may appear inside a sequential block, keeping $\sigma $ and ${\textsf {S}}$ allows continuation of the sequential basic block after termination of the vectorised block. To model vectorised execution, the state contains an auxiliary set ${\textsf {E}} \subseteq \mathsf {[0..{\textsf {N}})} $ that models which threads have already executed the current instruction. Only when ${\textsf {E}}$ equals $\mathsf {[0..{\textsf {N}})}$, the next instruction is ready to be executed. Finally, the $\textsf {Seq}$ block state consists of private memory $\sigma $ and a statement ${\textsf {S}}$.

To simplify our notation, each thread receives a copy of the program store as part of its private memory when it initialises. This is captured in rules Init Par and Init Seq (Fig. 11), where the local store $\gamma $ is passed as an argument to the Seq block state.

Operational Semantics The operational semantics is defined as a transition relation between program states: $\rightarrow _{p} \subseteq (\textsf {BlockState} \times \textsf {Store} \times \textsf {SharedMem}) \times (\textsf {BlockState} \times \textsf {Store} \times \textsf {SharedMem})$, (Fig. 11), and using an auxiliary transition relation between thread local states: $\rightarrow _{\textit{s}} \subseteq (\textsf {LocalState} \times \textsf {SharedMem}) \times (\textsf {LocalState} \times \textsf {SharedMem})$, (Fig. 12), and then a standard transition relation: $\rightarrow _{\textit{assg}} \subseteq (\textsf {PrivateMem} \times {\textsf {S}} \times \textsf {SharedMem}) \times (\textsf {PrivateMem} \times \textsf {SharedMem})$ to evaluate assignments (Fig. 13). The semantics of expression e and Boolean expression b over private memory $\sigma $, written ${\mathcal {E}}\llbracket e \rrbracket _{\sigma }$ and ${\mathcal {B}}\llbracket b \rrbracket _{\sigma }$, respectively, is standard and not discussed any further. We use the standard notation for function update: given a function $f:A \rightarrow B$, $a\in A$, and $b \in B$:

$$\begin{aligned} f [ a {:}{=} b ] = x \mapsto \left\{ \begin{array}{l@{,~}l} b &{} x = a \\ f(x) &{} \text{ otherwise }\end{array}\right. \end{aligned}$$

As mentioned, the main transition relation between program states is defined in Fig. 11. Program execution starts in a program state

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_IEq152_HTML.gif

where ${\mathcal {P}} $ is the program’s entry block. Depending on the form of ${\mathcal {P}} $, a transition is made into an appropriate block state, leaving the heap unchanged (see rules Init ParC, Init SeqC, Init Fuse, Init Par and Init Seq).

The evaluation of a $\textsf {ParC}$ state non-deterministically evaluates one of its block states (i.e. $\textsf {EB} _1$ or $\textsf {EB} _2$), until both blocks are done (rule ParC Done).

Evaluation of a sequential block is done by evaluating the local state. The evaluation of a $\textsf {SeqC}$ state evaluates its block state $\textsf {EB} $ step by step. When this evaluation is done, evaluation of the subsequent block is initialised.

Rule Lift Seq captures that evaluation of a thread local state is defined in terms of the local thread execution (as defined in Fig. 12). When the local thread state is fully evaluated, this results in a terminated block state (rule Local Done).

The evaluation of a parallel basic block is defined by the rules Par Step and Par Done. To allow all possible interleavings of the threads in the block’s thread pool, each thread has its own local state $\textsf {LS}$, which can be executed independently, modelled by the mapping $\mathbb {LS}$. A thread in the parallel block terminates if there are no more statements to be executed and a parallel block terminates if all threads executing the block are terminated.

The evaluation of sequential basic block’s statements as defined in Fig. 12 is standard except when it contains a vectorised basic block. A sequential basic block terminates if there is no instruction left to be executed (Seq Done). The execution of a vectorised block (defined by the rules Init Vec, Vec Step1, Vec Step2, Vec Sync and Vec Done in Fig. 12) is done in lock-step, i.e. all threads execute the same instruction no thread can proceed to the next instruction until all are done, meaning that they all share the same program counter. As explained, we capture this by maintaining an auxiliary set, ${\textsf {E}}$, which contains the identifier of the threads that have already executed the vector instruction (i.e. the guarded assignment $b \Rightarrow \textsf {assg} $). When a thread executes a vector instruction, its thread identifier is added to ${\textsf {E}}$ (rules Vec Step). The semantics of vector instructions (i.e. guarded assignments) is the semantics of assignments if the guard evaluates to true and it does nothing otherwise. When all threads have executed the current vector instruction, the condition ${\textsf {E}} = \textsf {dom} (\varSigma )$ holds, and execution moves on to the next vector instruction of the block (with an empty auxiliary set) (rule Vec Sync). The semantics of assignments as defined in Fig. 13 is standard and does not require further discussion.

4 Encoding OpenMP into PPL

In order to show that PPL indeed captures the core of deterministic parallel programming languages, this section shows how a widely used subset of OpenMP can be encoded into PPL.

4.1 Subset of OpenMP

Figure 14 defines a grammar which captures a commonly used subset of OpenMP [2]. This grammar defines the OpenMP programs that can be encoded into PPL (and thus can be verified using the verification technique presented below).

Our grammar supports the following OpenMP annotations: omp parallel, omp for, omp simd, omp for simd, omp sections, and omp single. Every program is a finite and non-empty list of Jobs enclosed by omp parallel. The body of omp for, omp simd, and omp for simd, is a for-loop. The body of omp single is either a program in our OpenMP subset or it is a sequential code block $\mathsf {SpecS}$. The omp sections block is a finite list of omp section sub-blocks, where the body of each omp section is either a program in our OpenMP subset or it is a sequential code block $\mathsf {SpecS}$. For our translation, the relevant clauses are simdlen(M), schedule static, and nowait, all other clauses are ignored.

4.2 OpenMP to PPL encoding

This section discusses the encoding of OpenMP programs that can be derived from the grammar in Fig. 14 into PPL. The encoding algorithm is presented in Fig. 15 in a functional programming-like style.

Line 2 to 7 of the algorithm define some syntactic macros of several program patterns, to improve readability of the algorithm. Note that in the macro ParVec, $\mathsf {tid}_1$ refers to the thread identifier of the parallel block, while $\mathsf {tid}_2$ refers to the thread identifier of the vectorised block. The algorithm consists of two steps: a recursive translate step, and a compose step. The translation step recursively encodes all Jobs into their equivalent PPL code blocks without caring about how they will be composed. Later, the compose step conjoins the translated code blocks together to build a PPL program.

The translation step is a map, which applies the function match to the list of input jobs and returns a list of equivalent PPL code block. The input jobs are encoded in the form (A, C) where A is an OpenMP annotation and C is a code block written in C. The translation returns a list of the form (P, [A]), where P is the PPL program corresponding to the C code, and [A] are the OpenMP annotations that are needed to decide how to combine this PPL block with the other code blocks. Notice that the resulting PPL program is not necessarily a single basic block. The function match works as follows:

an OpenMP for annotation for a for-loop is translated into a parallel block;
an OpenMP simd annotation for a for-loop is translated into a loop of vectorised statements (taking into account the simdlen(M) argument);
an OpenMP for simd annotation for a for-loop is translated into a parallel composition of several vectorised statements (taking into account the simdlen(M) argument);
an OpenMP sections annotation is translated into the parallel composition of the individual statements; and
an OpenMP single annotation encodes the statements in the single block recursively.

The match function uses the function sec which recursively calls match on nested parallel blocks. A sequence of sequential statements with a contract is encoded as a parallel block with a single thread. Notice that in these cases, any nested OpenMP clauses are passed on; therefore, the match function returns a pair of a PPL program and a list of OpenMP annotations.

The compose step takes as its input a list of tuples in the form (P, [A]) (the output of the translate step); then it inserts appropriate PPL composition operators between adjacent program blocks in the list, provided certain conditions hold. To properly bind tuples to the composition operators, the operators are inserted in three individual passes; one pass for each composition operator, based on the binding precedence of the operators from high to low as follows:

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_IEq170_HTML.gif

Operator insertion is done by the function bundle (lines 40–44). In each pass, bundle consumes the input list recursively. Each recursive call takes the two first tuples of the list and inserts a composition operator if the tuples satisfy the conditions of the composition operator; otherwise, it moves one tuple forward and starts the same process again. Notice that ultimately the head of the list x is composed with the head of the recursive call, rather than with the second element of the list. This is okay, because the composition to be applied is determined locally, and not affected by the compositions of the other blocks.

For each composition operator, the conditions are different. The conditions for parallel and fusion compositions are checked by the functions fusible and par_able. As explained in Sect. 2, fusion of two parallel loops $\mathsf {L1}$ and $\mathsf {L2}$ means that the corresponding iterations of $\mathsf {L1}$ and $\mathsf {L2}$ are executed by the same thread without waiting. Therefore, fusion composition is inserted between two consecutive tuples $(P_i,[A_i])$ and $(P_j,[A_j])$ if:

both $[A_i]$ and $[A_j]$ are single-element lists containing an omp for annotation,
the clauses of both annotations include schedule(static), and
the clauses of $[A_i]$ include nowait.²

The parallel composition is inserted between any two tuples in the program where the clauses of the first tuple include a nowait. Otherwise, the sequential composition is inserted. The final outcome is a single merged tuple (P, [A]) where P is the result of the encoding and [A] can be eliminated.

4.3 Example translations

To illustrate the encoding, we discuss the translation of two small OpenMP programs into PPL.

Example 7

To translate the OpenMP program in Fig. 1 (in Sect. 2.1), we first apply the translate function to it:

Next, applying the compose function results in the following PPL program:

Example 8

As another example, we translate the OpenMP program in Fig. 2 (in Sect. 2.1) into PPL. First we use the translate function:

Using the compose function on the list with these two pairs results in the following PPL program:

5 Verification of basic blocks

The first step of our verification technique deals with the verification of basic blocks. As mentioned above, there are three types of basic blocks: a sequential block, a vectorised block and a parallel block.

For each basic block, we specify an iteration contract, which is a contract for each thread executing in the block. Thus, for a sequential block, the iteration contract coincides with a standard block contract (as there is only one thread executing the block), while for parallel and vectorised blocks, the iteration contract specifies the behaviour of one single thread executed in parallel or in lock-step, respectively. We call this an iteration contract, as it corresponds to the specification of a single iteration of a parallelisable or vectorisable block.

5.1 Iteration contracts

An iteration contract consists of: a resource contract $\textit{rc(i)}$, and a functional contract $\textit{fc(i)}$, where i is the block’s iteration variable. A resource contract indicates the permissions to access memory locations and a functional contract is related to values in the memory locations. Both the resource contract and the functional contract consist of a precondition and a postcondition. We use $P(i)$ to denote the functional precondition, and $Q(i)$ to denote the functional postcondition. In case the resource pre- and postcondition are the same, we simply write $rc(i)$; otherwise, we distinguish them by $\textit{rc}_{\textsf {pre}}(i)$ and $\textit{rc}_{\textsf {post}}(i)$.

Example 9

Consider the PPL program in Example 7. An iteration contract for basic block $\mathsf {B_1}$ would be:

where the first two lines show a resource contract and the last line indicates a functional contract. Note that $\mathsf {**}$ is the ASCII-notation for $\mathop {\star }$.

5.2 Verification rules for basic blocks

As mentioned above, a sequential block is executed by a single thread, thus its iteration contract coincides with its block contract, and no special verification rule is needed.

Parallel basic blocks are verified by the rule ParBlock presented in Fig. 16, where ${\textsf {S}} (i)$ is the body of the $i^{th}$ iteration of the parallel basic block. This rule states that if each single thread respects its iteration contract, the contract for the basic block is composed by the universal separating conjunction of the iteration contract’s precondition and postcondition, respectively. As the threads execute completely independently, there is no permission transfer, and the resource pre- and postcondition coincide. Notice further that soundness of this rule implies that all threads in a parallel block must be independent, because otherwise the universal separating conjunction would not be satisfiable.

For vectorised blocks, the ParBlock rule can be used in case there are no inter-iteration data dependencies. If there are inter-iteration data-dependencies, we need to provide extra annotations that indicate how permissions are transferred inside the vectorised block. In a vectorised block, implicitly all threads synchronise between every instruction. During such a synchronisation, permissions may be transferred from the iteration containing the source of a dependence to the iteration containing the sink of that dependence. To specify such a transfer we introduce

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_Figl_HTML.gif

ghost statements.³ Remember that according to the PPL grammar, the body of a vectorised block is a sequence of guarded assignments $b \Rightarrow \textsf {assg} $. A guard $b_s(i)$ denotes the guard of statement s in iteration i.

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_Fign_HTML.gif

annotation specifies that at label $L_s$, if a guard $b_s(i)$ is true, the permissions and properties denoted by formula $\phi $ are transferred to the statement labelled $L_r$ in iteration $i+d$, where i is the current iteration and d is the distance of dependence. A

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_Figo_HTML.gif

annotation specifies that the permissions and properties denoted by formula $\psi $ are received by the current iteration from iteration $i - d$. These annotations always come in pairs. In practice, the information provided by either the

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_Figp_HTML.gif

annotation is sufficient to infer the other. Therefore, to reduce the annotation overhead, optionally only one of them has to be provided by the developer. However, by providing them both, we make the specifications easier to understand.

Example 10

Suppose we have a basic block

$$\begin{aligned} \textsf {Vec(N)(x[tid + 1] = tid; a[tid] = x[tid] + 3;)} \end{aligned}$$

where ${\mathsf {N}} - 1 == \mathsf {x.length}$. We can verify that this block annotated with

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_Figq_HTML.gif

respects the following iteration contract:

In order to verify this example, we need a proof rule for vectorised blocks, as well as for the

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_Figs_HTML.gif

ghost statements.

The rule for the verification of vectorised blocks is given in Fig. 17. It is similar in spirit to the ParBlock rule, but does not require the resource pre- and postcondition to be the same.

The rules for the

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_Figt_HTML.gif

ghost statements are similar in spirit to the rules that are typically used for permission transfer upon lock acquiring and release (see e.g. [15]). In particular,

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_Figu_HTML.gif

is used to give up resources that the

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_Figv_HTML.gif

acquires. This is captured by the following two proof rules:

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_Equ1_HTML.png

(1)

Receiving permissions and properties that were not sent is unsound. Therefore, send and receive annotations have to be properly matched, meaning that:

(i)

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_Figw_HTML.gif

annotations always come in pairs;

(ii)

if the

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_Figx_HTML.gif

is enabled in iteration $j$, then d iterations earlier, the

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_Figy_HTML.gif

should be enabled, i.e.,

$$\begin{aligned} \forall j \in [0..N) . b_r(j) \implies j \ge d \wedge b_s(j-d) \end{aligned}$$

(2)

(iii)

the information and resources received should be implied by those sent:

$$\begin{aligned} \forall j \in [d..N) . \phi (j-d) \implies \psi (j) \end{aligned}$$

(3)

In other words, the rules in Eq. 1 cannot be used unless the syntactic criterion (i) and the proof obligations (ii) and (iii) hold.

5.3 Soundness

This section discusses the soundness of the proof rules ParBlock and VecBlock above. To show soundness of these rules, we have to show that in order to prove correctness of a parallel or vectorised block, it is sufficient to reason about the body of the block, and to prove independence or inter-iteration data dependence of that body. As always, the interpretation of a Hoare triple $\{P\}S\{Q\}$ is the following: if the precondition $P$ holds in a state $s$, and if execution of statement $S$ from state $s$ terminates in a state $s'$, then the postcondition $Q$ holds in this state $s'$. As the proof rules are adapted from the proof rules for parallel and vectorised loops presented in [5], the soundness argument is also similar.

To construct the proof, we define the set of possible execution traces of atomic steps over the vectorised and parallel blocks. In addition, we also define the instrumented sequentialised execution traces for those blocks, which are the executions (1) if all iterations are executed in order and (2) such that validity of each iteration contract is checked for each separate iteration.

To prove soundness of the rule ParBlock, we show that the all execution traces of this statement are equivalent to the instrumented sequentialised execution trace of the parallel block. To prove soundness of the rule VecBlock, we show that all execution traces of this statement are equivalent to the instrumented sequentialised execution trace of the vectorised block.

Functional equivalence of the two traces is shown by transforming the computations in one trace into the computations in the other trace by swapping adjacent independent execution steps.

5.3.1 Denotational semantics of blocks

To phrase the soundness proof, we prefer to use a denotational semantics for the parallel and vectorised blocks, where the semantic domain is a set of traces, seen as sequences of instructions. The denotational semantics that is defined in this section is equivalent to the operational semantics as defined in Sect. 3, but the proof is omitted from the paper. We develop our formalisation for non-nested blocks with $K$ guarded statements. We instantiate the block body for each iteration of the block; thus, we have $(L_{i}^{j}$:

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_Figz_HTML.gif

($b_{i}^{j}$) $I_{i}^{j}$;) as the instantiation of the $i^{th}$ instruction in the $j^{th}$ iteration of the block. We refer to this instance of statements as $S_{i}^{j}$.

Definition 1

The semantics of a statement instance $\llbracket S_{i}^{j} \rrbracket $ is defined as the atomic execution of the instruction $I_{i}^{j}$ labelled by $L_{i}^{j}$ provided its guard condition $b_{i}^{j}$ holds; otherwise, it behaves as a skip.

Definition 2

An execution trace c is a finite sequence $t_1, t_2, \ldots , t_{m}$ of statement instances such that $t_1$ is executed first, then $t_2$ is executed and so on until the last statement $t_m$. We write $\epsilon $ for an empty execution trace.

To characterise the set of execution traces for parallel and vectorised blocks, we define auxiliary operators concatenation and interleaving.

First, we define two versions of concatenation, plain concatenation ($++$) and synchronised concatenation ($\#$).

Definition 3

The plain concatenation ($++$) operator is defined as $C_1 \mathrel {++} C_2 = \{ c_1 \cdot c_2 \mid c_1 \in C_1 \,\wedge c_2 \in C_2 \}$.

Plain concatenation takes two sets of execution traces and creates a new set that concatenates all execution traces in the first set with all execution traces in the second set.

Definition 4

The synchronised concatenation ($\#$) operator inserts a barrier b between the execution traces. It is defined as $C_1 \mathrel {\#} C_2 = \{ c_1 \cdot b \cdot c_2 \mid c_1 \in C_1 \,\wedge c_2 \in C_2 \}$.

The intuition here is that the insertion of a barrier b indicates an implicit synchronisation point. When defining the interleaving of traces, the barrier restricts what interleavings are possible.

We lift concatenation to multiple sets as follows:

$$\begin{aligned} \begin{array}{lll} \mathsf {Concat}_{i=1}^{N} C_i &{}=&{} C_1 \mathrel {++} \cdots \mathrel {++} C_N \\ \mathsf {SyncConcat}_{i=1}^{N} C_i &{}=&{} C_1 \mathrel {\#} \cdots \mathrel {\#} C_N \end{array} \end{aligned}$$

Next, interleaving defines how to weave several execution traces into a single execution trace. This uses a happens-before order <, in order not to violate restrictions imposed by the program semantics. This happens-before order < is defined such that it maintains program order ($\mathsf {PO}$), i.e. it maintains the order of statements executed by the same thread, and it also maintains synchronisation order ($\mathsf {SO}$), i.e. it maintains the order between a barrier and the statements preceding and following it.

To define the interleaving operator ($\mathsf {Interleave}$), we first define an auxiliary operator $\mathsf {Interleave}^i$ that denotes interleaving with a fixed first statement $s$ of thread $i$:

$$\begin{aligned} \begin{array}{l} \mathsf {Interleave}^i(\epsilon , \cdots , \epsilon ) = \{ \epsilon \} \\ \mathsf {Interleave}^i(c_1, \cdots , c_{i-1}, \epsilon , c_{i+1}, \cdots , c_N) = \emptyset \text{, } \text{ if } \exists c_{j \ne i} \ne \epsilon \\ \mathsf {Interleave}^i(c_1,\cdots , c_{i-1}, s \cdot c_{i}{'}, c_{i+1}, \cdots , c_N) = \\ \quad \begin{array}{ll} \{ s_1 \cdot x \mid &{} x \in \mathsf {Interleave}(c_1, \cdots , c_{i-1}, c_{i}{'}, c_{i+1}, \cdots , c_N) \,\wedge \\ &{} \not \exists s' \in x. s' < s \} \end{array} \end{array} \end{aligned}$$

If the complete execution trace of thread $i$ has been interleaved, there are two possible cases. If all other threads are also done, then this returns an empty execution trace (as a base case). If any other thread can still take a step, then this call for thread $i$ returns an empty set of interleavings. If thread $i$ has a non-empty execution trace to interleave, i.e. it is of the form $s_1 \cdot c_{i'}$, then we obtain all interleavings that start with $s_1$, extended with the (recursive) interleaving of all other execution traces and the remainder of this execution trace $c_{i'}$. Note that this extension is only allowed if it does not violate the happens-before order <. Next we define the full interleaving operator, which basically considers all interleavings for all threads.

$$\begin{aligned} \begin{array}{l} \mathsf {Interleave}^{i=1..N} c_i = \\ \quad \mathsf {Interleave}(c_1, \cdots , c_N) = \\ \quad \bigcup _{i=1}^{N} \mathsf {Interleave}^i(c_1, \cdots , c_N) \end{array} \end{aligned}$$

Now we can define the denotational semantics of parallel and vectorised blocks. The semantics of a parallel block is any interleaving of all statement instances that preserve the program order $\mathsf {PO}$. The semantics of a vectorised block is any interleaving of the synchronised concatenation of the execution traces of the individual traces, thus with an implicit barrier added after the execution steps of each statement instance. Formally, these are defined as follows.

Definition 5

The denotational semantics of a parallel block is defined as

$$\begin{aligned} \llbracket Par(N) S \rrbracket = \mathsf {Interleave}^{j=1..N} \mathsf {Concat}^{K}_{i=1} \llbracket S_{i}^{j} \rrbracket \end{aligned}$$

Definition 6

The denotational semantics of a vectorised block is defined as

$$\begin{aligned} \llbracket Vec(N) S \rrbracket = \mathsf {Interleave}^{j=1..N} \mathsf {SynchConcat}^{K}_{i=1} \llbracket S_{i}^{j} \rrbracket \end{aligned}$$

Next, we define the sequentialised execution trace of a parallel and vectorised block. This is the sequential execution of all iterations in a parallel and vectorised block.

Definition 7

The sequential execution trace of a parallel and vectorised block is

$$\begin{aligned} \llbracket Par(N) S \rrbracket ^{Seq} = \mathsf {Concat}^{N}_{j=1} \mathsf {Concat}^{K}_{i=1} \llbracket S_{i}^{j} \rrbracket \\ \llbracket Vec(N) S \rrbracket ^{Seq} = \mathsf {Concat}^{N}_{j=1} \mathsf {SynchConcat}^{K}_{i=1} \llbracket S_{i}^{j} \rrbracket \end{aligned}$$

Finally, we define the instrumented sequentialised execution trace of a parallel and vectorised block. This is the sequential execution of all iterations, where in addition all precondition and postcondition are checked. Below we will show that all parallel and vectorised execution traces are equivalent to this instrumented sequentialised execution trace.

Definition 8

The instrumented sequentialised execution traces of a parallel and vectorised block are

$$\begin{aligned} \begin{array}{l} \llbracket Par(N) S \rrbracket ^{Seq}_{Spec} = \mathsf {Concat}^{N}_{j=1} ( \mathsf {Assert} rc(j) \mathop {\star }\textit{P(j)} \mathrel {++} \\ \mathsf {Concat}^{K}_{i=1} \llbracket S_{i}^{j} \rrbracket \mathrel {++} \\ \mathsf {Assert} rc(j) \mathop {\star }\textit{Q(j)}) \end{array} \\ \begin{array}{l} \llbracket Vec(N) S \rrbracket ^{Seq}_{Spec} = \mathsf {Concat}^{N}_{j=1} ( \mathsf {Assert} rc(j) \mathop {\star }\textit{P(j)} \mathrel {++} \\ \mathsf {SynchConcat}^{K}_{i=1} \llbracket S_{i}^{j} \rrbracket \mathrel {++} \\ \mathsf {Assert} rc(j) \mathop {\star }\textit{Q(j)}) \end{array} \end{aligned}$$

where $\mathsf {Assert}$ checks the pre- and postcondition before and after each iteration. If the asserted property $\phi $ holds, $\mathsf {Assert} \,\phi $ behaves as a skip; otherwise, it aborts (i.e. there is no execution). Note that the sequential execution trace is in happens-before order.

5.3.2 Correctness of parallel blocks

In the previous section, we defined a denotational semantics of parallel and vectorised blocks in terms of possible traces of atomic steps. In addition, we defined the instrumented sequentialised execution of parallel and vectorised blocks. Now, we argue correctness of the rules for parallel and vectorised blocks (Figs. 16 and 17).

We prove that every execution trace in $\llbracket $Par(N) $ S \rrbracket $ is functionally equivalent to the single execution trace $\llbracket $Par(N) $ S \rrbracket ^{Seq}_{Spec}$ if all contracts hold, by showing that any execution trace can be reordered until it is the sequential execution order.

Theorem 1

All execution traces in $\llbracket Par(N) $ $ S \rrbracket $ and $\llbracket Par(N) $ $ S \rrbracket ^{Seq}_{Spec}$ are functionally equivalent only if all contracts hold.

Proof sketch 1

Assume that the first n steps of the given execution trace are in the same order as the sequential execution trace. Then, step $t_{n+1}$ in the sequential execution has to be somewhere in the given sequence. Because each sequence contains the same steps and the sequential execution trace is in happens-before order, all the steps that have to happen before $t_{n+1}$ are already included in the prefix. Hence, in the given sequence, all the steps between the end of the prefix and $t_{n+1}$ are independent of step $t_{n+1}$ itself. Therefore, step $t_{n+1}$ can be swapped with all these intermediate steps. We then repeat until the whole sequence matches.

We proved that any legal execution trace of parallel block can be reordered into the sequential one, i.e. $\llbracket $Par(N) $ S \rrbracket $ = $ S_0 \mathop {\star } S_1 \mathop {\star } S_2 \mathop {\star }\ldots \mathop {\star } S_N $. Now suppose that in the initial state $ P_0 \mathop {\star } P_1 \mathop {\star }\ldots \mathop {\star } P_N $ holds. Since all instructions are independent, after the execution of $ S_0 $, $ Q_0 $ holds and $ P_1 \mathop {\star } P_2 \mathop {\star }\ldots \mathop {\star } P_N $ is preserved. After the execution of $ S_1 $, $ Q_1 $ holds and $ P_2 \mathop {\star } P_3 \mathop {\star }\ldots \mathop {\star } P_N $ is preserved. Moreover, $ S_1 $ will not make $ Q_0 $ invalid. After the execution of $ S_2 $, $ Q_2 $ holds and $ P_3 \mathop {\star } P_4 \mathop {\star }\ldots \mathop {\star } P_N $ is preserved. In addition, $ S_2 $ will not make $ Q_0 \mathop {\star } Q_1 $ invalid. By continuing in this way, in the final state of the execution trace $ Q_0 \mathop {\star } Q_1 \mathop {\star }\ldots \mathop {\star } Q_N $ holds. Therefore, we can conclude for any legal execution trace in $\llbracket $Par(N) $ S \rrbracket $ starting in the precondition, the postcondition will hold for the final state.

As a corollary of Theorem 1, we can also conclude that all executions in $\llbracket $Par(N) $ S \rrbracket $ are data-race-free. We can apply the same argument for the vectorised blocks, but as the vectorised blocks is defined in terms of $\mathsf {SynchConcat}$, swapping past barriers is never necessary.

Theorem 2

All execution traces in $\llbracket Vec(N) $ $ S \rrbracket $ and $\llbracket Vec(N) $ $ S \rrbracket ^{Seq}_{Spec}$ are functionally equivalent.

Note that the sequentialised instrumented execution trace now also contains

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_Figaa_HTML.gif

ghost annotations and barriers between each iteration.

6 Verification of block composition

Now that we have seen how correctness of a basic block can be verified in isolation, the next step is to verify their composition. We show how this can be done on the basis of the block iteration contracts only, by proving that all the heap accesses of all iterations which are not ordered sequentially are non-conflicting (i.e. they are disjoint or they are read accesses). If this condition holds, correctness of the PPL program can be derived from the correctness of a linearised variant of the program.

We first discuss how we can verify programs where the resources in the iteration contracts are constant, i.e. the resource pre- and postconditions are always the same. Next, we sketch how to extend the approach to the case where the resource pre- and postconditions of an iteration contract differ.

6.1 Verification of block composition without resource transfers

As mentioned above, we first assume that each basic block of a program is specified by an iteration contract with constant resources $ rc(i) $ for iteration $i$. Further, we assume that the program is globally specified by a contract $G$ which consists of the program’s resource contract $ RC_{{\mathcal {P}}} $ and the program’s functional contract $ FC_{{\mathcal {P}}} $ with the program’s precondition $ P_{{\mathcal {P}}} $ and the program’s postcondition $ Q_{{\mathcal {P}}} $.

Let ${\mathbb {P}}$ be the set of all PPL programs and ${\mathcal {P}} \in {\mathbb {P}} $ be an arbitrary PPL program assuming that each basic block in ${\mathcal {P}}$ is identified by a unique label. We define $ {\mathbb {B}} _{\mathcal {P}} =\{b_1,b_2,\ldots ,b_n\} $, as the finite set of basic block labels of the program ${\mathcal {P}}$. For a basic block $ b $ with parallelisation level $ m $, we define a finite set of iteration labels $\textit{I} _b = \{0^b,1^b,\ldots ,(m-1)^b\} $ where $ i^b $ indicates the $ i^{th} $ iteration of the block $ b $. Let ${\mathbb {I}} _{{\mathcal {P}}} = \bigcup _{b \in {\mathbb {B}} _{{\mathcal {P}}}} \textit{I} _b$ be the finite set of all iterations of the program ${\mathcal {P}}$.

To state our proof rule, we first define the set of all iterations that are not ordered sequentially, the incomparable iteration pairs, $ {\mathfrak {I}}_{\perp } ^{{\mathcal {P}}} $ as:

$$\begin{aligned}&\small {\mathfrak {I}}_{\perp } ^{{\mathcal {P}}} = \{(i^{b_1},j^{b_2})| i^{b_1},j^{b_2} \in {\mathbb {I}} _{{\mathcal {P}}}\wedge b_1 \ne b_2 \wedge i^{b_1} \nprec _{e} j^{b_2} \wedge \\&\quad \small j^{b_2} \nprec _{e} i^{b_1}\} \end{aligned}$$

where $ \prec _{e} \subseteq {\mathbb {I}} _{{\mathcal {P}}} \times {\mathbb {I}} _{{\mathcal {P}}} $ is the least partial order which defines an extended happens-before relation. The extension addresses the iterations which are happens-before each other because their blocks are fused. We define $ \prec _{e} $ based on two partial orders over the program’s basic blocks: $ \prec \subseteq {\mathbb {B}} _{{\mathcal {P}}} \times {\mathbb {B}} _{{\mathcal {P}}} $ and $ \prec _{\oplus } \subseteq {\mathbb {B}} _{{\mathcal {P}}} \times {\mathbb {B}} _{{\mathcal {P}}} $. The former is the standard happens-before relation of blocks where they are sequentially composed by

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_IEq331_HTML.gif

, and the latter is an happens-before relation w.r.t. fusion composition $ \oplus $. They are defined by means of an auxiliary partial order generator function

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_IEq333_HTML.gif

such that:

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_IEq334_HTML.gif

and $ \prec _{\oplus } = {\mathcal {G}}({\mathcal {P}},\oplus ) $. We define ${\mathcal {G}}$ as follows:

$$\begin{aligned}\small {\mathcal {G}}({\mathcal {P}},\delta ) = {\left\{ \begin{array}{ll} \emptyset ,\qquad \text {if}~{\mathcal {P}} \in \{\textsf {\textsf {Par} (N) S}, \textsf {S}\}\\ {\mathbb {G}},\qquad \text {if}~{\mathcal {P}} = {\mathcal {P}} ' \delta {\mathcal {P}} '' \ne \bullet \\ {\mathbb {G}} \cup ({\mathbb {B}} _{{\mathcal {P}} '} \times {\mathbb {B}} _{{\mathcal {P}} ''}), \qquad otherwise\\ \end{array}\right. } \end{aligned}$$

where ${\mathbb {G}} = {\mathcal {G}}({\mathcal {P}} ',\delta ) \cup {\mathcal {G}}({\mathcal {P}} '',\delta )$.

The function ${\mathcal {G}}$ computes the set of all iteration pairs of the input program ${\mathcal {P}}$ which are in relation w.r.t. the given composition operator $ \delta $. This computation is basically a syntactical analysis over the input program. Now we define the extended partial order $ \prec _{e} $ as:

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_Equ19_HTML.png

This means that the iteration $ i^b $ happens-before the iteration $ j^{b'} $ if $ b $ happens-before $ b' $ (i.e. $ b $ is sequentially composed with $ b' $) or if $ b $ is fused with $ b' $ and $ i $ and $ j $ are corresponding iterations in $ b $ and $ b' $.

We define the block level linearisation (b-linearisation for short)

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_IEq354_HTML.gif

as a program transformation which substitutes all non-sequential compositions by a sequential composition. We define

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_IEq355_HTML.gif

as a subset of ${\mathbb {P}} $ in which only sequential composition

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_IEq357_HTML.gif

is allowed as composition operator.

Example 11

As an example, the b-linearisation of the PPL in Example 7 is as follows:

Figure 18 presents the rule b-linearise. In this rule, $ rc_b(i) $ and $ rc_{b'}(j) $ are the resource contracts of two different basic blocks b and $b'$ where $i^{b} \in \textit{I} _b$ and $j^{b'} \in \textit{I} _{b'}$. Application of the rule results in two new proof obligations. The first ensures that all heap accesses of all incomparable iteration pairs (the iterations that may run in parallel) are non-conflicting (i.e. all block compositions in ${\mathcal {P}}$ are memory safe). This reduces the correctness proof of ${\mathcal {P}}$ to the correctness proof of its b-linearised variant $blin ({\mathcal {P}})$ (the second proof obligation). Then, the second proof obligation is discharged in two steps: (1) proving the correctness of each basic block against its iteration contract (using the proof rules discussed above) and (2) proving the correctness of $blin ({\mathcal {P}})$ against the program contract.

6.2 Soundness

Now we are ready to show that a PPL program with provably correct iteration contracts and a global contract that is provable in our logic (including the rule b-linearise) is indeed data race free and functionally correct w.r.t. its specifications. To show this, we prove (i) soundness of the b-linearise rule and (ii) that each verified program is free of data races.

For the soundness proof, we show that for each program execution there exists a corresponding b-linearised execution with the same functional behaviour (i.e. they end in the same terminal state if they start in the same initial state) if all independent iterations are non-conflicting. From the rule’s assumption, we know that if the precondition holds for the initial state of the b-linearised execution (which is also the initial state of the program execution), then its terminal state satisfies the postcondition. As both executions end in the same terminal state, the postcondition thus also holds for the program execution. To prove that there exists a matching b-linearised execution for each program execution, we first show that any valid program execution can be normalised w.r.t. program order and second that any normalised execution can be mapped to a b-linearised execution. To formalise this argument, we first define: an execution, an instrumented execution, and a normalised execution.

We assume all program’s blocks including basic and composite blocks have a block label and program’s statements are labelled by the label of the block to which they belong. Also there exists a total order over the block labels.

Definition 9

(Execution). An execution of a program ${\mathcal {P}}$ is a finite sequence of state transitions

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_IEq368_HTML.gif

To distinguish between valid and invalid executions, we instrument our operational semantics with heap mas-ks (memory masks). A heap mask models the access permissions to every heap location. It is defined as a map from locations to fractions $\pi : \textsf {Loc}\rightarrow \textsf {Frac}$ where $\textsf {Frac}$ is the set of fractions ([0, 1]). Any fraction (0, 1) is a read and 1 is a write permission. The instrumented semantics ensures that each transition has sufficient access permissions to the heap locations that it accesses. We first add a heap mask $\pi $ to all block state constructors ($\textsf {Init}$, $\textsf {ParC}$, $\textsf {SeqC}$ and so on) and local state constructors ($\textsf {Vec} $, $\textsf {Seq} $ and $\textsf {Done} $). Then, we extend the operational semantics rules such that in each block initialisation state with heap mask $\pi $ an extra premise should be discharged, which states that there are $n \ge 2$ heap masks $\pi _1, \ldots ,\pi _n$, one for each newly initialised state such that $\varSigma _i^n \pi _i \le \pi $. The heap masks are carried along by the computation and termination transitions without any extra premises, while in the termination transitions heap masks of the terminated blocks are forgotten as they are not required after termination. As an example, Fig. 19 presents the instrumented versions of the rules Init ParC, ParC Done, rdsh, and wrsh, where $\rightarrow _{p,i}$ and $\rightarrow _{assg,i}$ denote program and assignment transition relations in the instrumented semantics, respectively. If a transition cannot satisfy its premises, it blocks.

Definition 10

(Instrumented Execution). An instrumented execution of a program ${\mathcal {P}}$ is a finite sequence of state transitions

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_IEq385_HTML.gif

where the set of all instrumented executions of ${\mathcal {P}}$ is written as $\mathbb {IE}_{{\mathcal {P}}}$.

Lemma 1

Assuming that (1). $\forall (i^b,j^{b'}) \in {\mathfrak {I}}_{\perp } ^{{\mathcal {P}}}.RC_{{\mathcal {P}}} \rightarrow rc_b(i) \mathop {\star }rc_{b'}(j)$ and (2). $\forall b \in {\mathbb {B}}_{{\mathcal {P}}}.\{{\bigstar _{i\in [0..N_b)} rc_b(i)}\}{\mathcal {P}} _b \{{\bigstar _{i\in [0..N_b)} rc_{b}(i)}\}$ are valid for a program ${\mathcal {P}}$ (i.e. every basic block in ${\mathcal {P}}$ respects its iteration contract), for any execution E of the program ${\mathcal {P}}$, there exists a corresponding instrumented execution.

Proof sketch 2

Given an execution E, we assign heap masks to all program states that the execution E might be in. The program’s initial state is assigned by a heap mask $\pi \le 1$. Assumption (1) implies that all iterations which might run in parallel are non-conflicting which implies that for all Init ParC transitions, there exist $\pi _1$ and $\pi _2$ such that $\pi _1+\pi _2 \le \pi '$ where $\pi '$ is the heap mask of the state in which Init ParC evaluates. In all computation transitions the successor state receives a copy of the heap mask of its predecessor. Assumption (2) implies that all iterations of all parallel and vectorised basic blocks are non-conflicting. This implies that for an arbitrary Init Par or Init Vec transition which initialises a basic block b, there exists $\pi _1, \ldots ,\pi _n$ such that $\varSigma _i^n \pi _i \le \pi _b$ holds in b’s initialisation transition and in all computation transitions of an arbitrary iteration i of the block b the premises of rdsh and wrsh transitions is satisfiable by $\pi _i$. $\square $

Lemma 2

All instrumented executions of a program ${\mathcal {P}}$ are data-race-free.

Proof sketch 3

The proof proceeds by contradiction. Assume that there exists an instrumented execution that has a data race. Thus, there must be two parallel threads such that one writes to and the other one reads from or writes to a shared heap location e. Because all instrumented executions are non-blocking, the premises of all transitions hold. Therefore, $\pi _1(e)=1$ holds for the first thread, and $\pi _2(e) > 0$ for the second thread either it writes or reads. Also because the program starts with one single main thread, both threads should have a single common ancestor thread z such that $\pi _x(e) + \pi _y(e) \le \pi _z(e)$ where x and y are the ancestors of the first and the second thread, respectively. A thread only gains permission from its parent; therefore $\pi _1(e) + \pi _2(e) \le \pi _z(e)$ holds. Permission fractions are in the range [0, 1] by definition; therefore, $\pi _1(e) + \pi _2(e) \le 1$ holds. This implies that if $\pi _1(e)=1$, then $\pi _2(e) \le 0$ which is a contradiction. $\square $

A normalised execution is an instrumented execution that respects the program order, which is defined using an auxiliary labelling function ${\mathcal {L}}: {\mathbb {T}} \rightarrow {\mathbb {B}}^{all}_{{\mathcal {P}}} \times {\mathbb {L}}$ where ${\mathbb {T}}$ is the set of all transitions, ${\mathbb {L}}$ is the set of labels $\{I,C,T\}$, and ${\mathbb {B}}^{all}_{{\mathcal {P}}}$ is the set of block labels (including both composite and basic block labels).

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_Equ20_HTML.png

where $ LB $ returns the label of each block or statement in the program. We say transition t with label (b, l) is less than $t'$ with label $(b',l')$ if $(b \le b') \vee (l'=T \wedge b \in LB_{sub}(b'))$ where $LB_{sub}(b)$ returns the label set of all blocks of which b is composed.

Definition 11

(Normalised Execution). An instrumented execution labelled by ${\mathcal {L}}$ is normalised if the labels of its transitions are in non-decreasing order.

We transform an instrumented execution to a normalised one by safely commuting the transitions whose labels do not respect the program order.

Lemma 3

For each instrumented execution of a program ${\mathcal {P}}$, there exists a normalised execution such that they both end in the same terminal state.

Proof sketch 4

Given an instrumented execution $ IE = IE_1 : (s_1,t_1) : (s_2,t_2) : IE_2 $, if ${\mathcal {L}}(t_1) > {\mathcal {L}}(t_2)$, a state $s_x$ exists such that a new instrumented execution $ IE' = IE_1 : (s_1,t_2) : (s_x,t_1) : IE_2 $ can be constructed by swapping two adjacent transitions $t_1$ and $t_2$. As the swap is on an instrumented execution, from Lemma 2 we know that this is data-race-free, thus any accesses of $t_1$ and $t_2$ to a shared heap location must be reads. Because $t_1$ and $t_2$ are adjacent transitions, no other write may happen in between; therefore, the swap preserves the functionality of $ IE $, yielding the same terminal state for $ IE $ and $ IE' $. Thus, the corresponding normalised execution of $ IE $ obtained by applying a finite number of such swaps yields the same terminal state as $ IE $.$\square $

Lemma 4

For each normalised execution of a program ${\mathcal {P}}$, there exists a b-linearised execution $blin ({\mathcal {P}})$, such that they both end in the same terminal state.

Proof sketch 5

An execution of $blin ({\mathcal {P}})$ is constructed by applying the map ${\mathcal {M}}:\textsf {BlockState} \rightarrow \textsf {BlockState} $ to each state of a normalised execution. ${\mathcal {M}}$ is defined as:

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_Equ21_HTML.png

where $\mathbb {LS} ^{0}_2$ is the initial mapping of thread local states of ${\mathcal {P}} _2$ and

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_IEq446_HTML.gif

indicates the state of two fused parallel blocks $\textsf {Par} (\mathbb {LS} _1)$ and $\textsf {Par} (\mathbb {LS} ^{0}_2)$ where

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_IEq449_HTML.gif

is overloaded and indicates pairwise concatenation of statements in the local states $\mathbb {LS} _1$ and $\mathbb {LS} ^{0}_2$ (i.e.

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_IEq452_HTML.gif

). $\square $

Definition 12

(Validity of Hoare Triple). The Hoare triple $ \{RC_{{\mathcal {P}}} \mathop {\star }P_{{\mathcal {P}}}\} {\mathcal {P}} \{RC_{{\mathcal {P}}} \mathop {\star }Q_{{\mathcal {P}}}\}$ is valid if for any execution $ E $ (i.e.

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_IEq456_HTML.gif

) if

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_IEq457_HTML.gif

is valid in the initial state of $ E $, then

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_IEq459_HTML.gif

is valid in its terminal state.

The validity of

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_IEq460_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_IEq461_HTML.gif

is defined by the semantics of formulas presented in 2.2.

Theorem 3

The rule b-linearise is sound.

Proof sketch 6

Assume that (1). $\forall (i^b,j^{b'}) \in {\mathfrak {I}}_{\perp } ^{{\mathcal {P}}}. RC_{{\mathcal {P}}} \rightarrow rc_b(i) \mathop {\star }rc_{b'}(j)$ and (2). $\{RC_{{\mathcal {P}}} \mathop {\star }P_{{\mathcal {P}}}\}$ $blin ({\mathcal {P}}) \{RC_{{\mathcal {P}}} \mathop {\star }Q_{{\mathcal {P}}}\}$. From assumption (2) and the soundness of the program logic used to prove it [5], we conclude (3). $\forall b \in {\mathbb {B}}_{{\mathcal {P}}}.\{{\bigstar _{i\in [0..N_b)} rc_b(i)}\} {\mathcal {P}} _b$ $\{{\bigstar _{i\in [0..N_b)} rc_{b}(i)}\}$. Given a program ${\mathcal {P}}$, implication (3), assumption (1) and Lemma 1 imply that there exists an instrumented execution $ IE $ for ${\mathcal {P}}$. Lemma 3 and Lemma 4 imply that there exists an execution $E'$ for the b-linearised variant of ${\mathcal {P}}$, $blin ({\mathcal {P}})$, such that both $ IE $ and $E'$ end in the same terminal state. The initial states of both $ IE $ and $E'$ satisfy the precondition $\{RC_{{\mathcal {P}}} \mathop {\star }P_{{\mathcal {P}}}\}$. From assumption (2) and the soundness of the program logic used to prove it [5], $\{RC_{{\mathcal {P}}} \mathop {\star }Q_{{\mathcal {P}}}\}$ holds in the terminal state of $E'$ which thus also holds in the terminal state of $ IE $ as they both end in the same terminal state. $\square $

Finally, we show that a verified program is indeed data-race-free.

Proposition 1

A verified program is data-race-free.

Proof sketch 7

Given a program ${\mathcal {P}}$, with the same reasoning steps mentioned in Theorem 3, we conclude that there exists an instrumented execution $ IE $ for ${\mathcal {P}}$. From Lemma 2 all instrumented executions are data-race-free. Thus, all executions of a verified program are data-race-free. $\square $

6.3 Verification of block composition with resource transfers

Next we look at how to adapt this rule in case there are intra-block dependencies; thus, the resource pre- and postconditions of individual iterations are different, and we need send/receive annotations in order to verify the blocks.

This makes the independence check more involved: instead of just checking that the resource contracts for independent iterations are non-conflicting ($\forall (i^b,j^{b'}) \in {\mathfrak {I}}_{\perp } ^{{\mathcal {P}}}.(RC_{{\mathcal {P}}} \rightarrow rc_b(i) \mathop {\star }rc_{b'}(j))$), we now need to check the absence of conflicts for all combinations of resource pre- and postconditions. In case there is only a single resource transfer, we can replace this condition in the rule b-linearise by the following condition:

$$\begin{aligned} \begin{array}{rl} \forall (i^b,j^{b'}) \in {\mathfrak {I}}_{\perp } ^{{\mathcal {P}}}.(RC_{{\mathcal {P}}} \rightarrow &{} \textit{rc}_{\textsf {pre},{b}}(i) \mathop {\star }\textit{rc}_{\textsf {pre},{b'}}(j) \,\wedge \\ &{} \textit{rc}_{\textsf {pre},{b}}(i) \mathop {\star }\textit{rc}_{\textsf {post},{b'}}(j) \,\wedge \\ &{} \textit{rc}_{\textsf {post},{b}}(i) \mathop {\star }\textit{rc}_{\textsf {pre},{b'}}(j) \,\wedge \\ &{} \textit{rc}_{\textsf {post},{b}}(i) \mathop {\star }\textit{rc}_{\textsf {post},{b'}}(j)) \end{array} \end{aligned}$$

This new version of the rule b-linearise is sound, because:

the check guarantees that the resource precondition of iteration $i$ is disjoint from the resource pre- and postcondition of iteration $j$;

the check also guarantees that the resource postcondition of iteration $i$ is disjoint from the resource pre- and postcondition of iteration $j$;

the resources specified in the resource precondition of iteration $i$ either are send to another iteration (say $k$) in the same block or they should be part of the resource postcondition of iteration $i$. The rule guarantees that it will also be checked that the resource pre- and postconditions of iteration $k$ are disjoint from the resource pre- and postconditions of iteration $j$ (because if $i$ and $j$ are independent, then also $k$ and $j$ will be independent.

However, if multiple resource transfers happen within a block, it can happen that at an intermediate point in the block, the thread holds more permissions than it holds at the beginning and the end of the block. To address this, we need to define the intermediate maximal resource contract for an intermediate statement S as the universal separating conjunction of the iteration’s precondition, and all the resources that are received by all statements that happen-before S. Absence of conflicts is then defined as a check over all intermediate resource contracts. It is future work to define this formally.

7 Tool support

As mentioned above, our verification technique is supported by the VerCors program verifier.⁴ This section briefly discusses how our approach is implemented in VerCors.

VerCors is a verifier to specify and verify (concurrent and parallel) programs written in a high-level language such as (subsets of) Java, C, OpenCL, OpenMP and PVL, where PVL is VerCors’ internal language for prototyping new features. The programs are annotated with pre-/postconditions in permission-based separation logic [1, 6]. Then, VerCors encodes annotated programs via several program transformation steps into the intermediate representation language (Silver) of the Viper framework [19, 26], and then the encoded program is verified using the Viper technology (Fig. 20).

Using this approach, OpenMP programs are verified with VerCors in the following steps:

Specify the OpenMP program (i.e. provide an iteration contract for each block and write the program contract for the outermost OpenMP parallel region.

Encode the specified OpenMP program into its PPL counterpart (carrying along the original OpenMP specifications) (as discussed in Sect. 4).

Check the PPL program against its specifications, by transforming the PPL program into a Viper program.

Steps 2 and 3 are fully automatic, the user only has to provide the specifications for the OpenMP program. This section provides more details about the encoding of PPL programs into Viper.

7.1 Encoding of basic blocks into viper

To verify our iteration contracts using Viper, we encode the behaviour of the basic blocks and the

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_Figac_HTML.gif

annotations as method contracts. The idea is that every block annotated with an iteration contract is encoded by a call to the method

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_Figad_HTML.gif

, whose contract encodes the application of the suitable Hoare Logic rule for basic blocks, instantiated for the specific iteration contract.

We also need to verify that every iteration respects the iteration contract. This is encoded by a method, parametrised by the thread identifier, containing the basic block’s body, and specified by the iteration contract.

Within the body of the basic block there may be

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_Figag_HTML.gif

statements.

The guards are untouched, but the statements are replaced by method calls where

Finally, we need to check that the proof obligations in Eq. 2 and 3 hold.

7.2 Encoding of the b-linearise rule into viper

Finally, for the verification of block composition, we implemented the rule b-linearize as part of the encoding into Viper. This means we implemented in VerCors:

a function to compute the set ${\mathfrak {I}}_{\perp } ^{{\mathcal {P}}}$, and
the program transformation $blin$, resulting in a Viper program called blin_program().

This implementation basically follows the formal definition as presented above in Sect. 6.

Next, as part of the Viper encoding, we encode the first proof obligation

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_Equ22_HTML.png

as lemmas for all independent iteration pairs $(i^b,j^{b'})$, i.e. these are encoded as specifications for empty method bodies of the following form:

Finally, for the b-linearised program, we prove that it satisfies the global method specification.

8 Example: verification of an OpenMP program

To conclude, we show how our verification technique for PPL and the encoding of OpenMP into PPL can be used to verify OpenMP programs. As mentioned above, our approach requires the user to specify a program contract and an iteration contract for each $\mathsf {SpecS}$ block in the OpenMP program, from which all the required PPL contracts can be obtained. We demonstrate this in detail on two of the OpenMP programs presented in Sect. 2.1, which are successfully verified by VerCors.

Figure 21 shows the required contracts for the example discussed in Fig. 1 (in Sect. 2.1). There are four specifications. The first one is the program contract attached to the outermost parallel block. The other contracts are the iteration contracts of the loops $\mathsf {L1}$, $\mathsf {L2}$ and $\mathsf {L3}$, where the context keyword is used as a shorthand notation for both requiring and ensuring the same predicate, and $\mathsf {\backslash forall*}$ denotes the universal separating conjunction ${\bigstar _{i\in I} }$. Example 7 already showed how this OpenMP program was encoded into PPL. After adding the annotations in Fig. 21 to the OpenMP program, VerCors generates the following PPL program $\mathsf {{\mathcal {P}}}$:

Program ${\mathcal {P}}$ contains three parallel basic blocks $\mathsf {B_1}$, $\mathsf {B_2}$ and $\mathsf {B_3}$. The fusion of $\mathsf {B_1}$ and $\mathsf {B_2}$ creates a composite block that is enclosed by the parentheses. Then, the composite block is composed with the basic block $\mathsf {B_3}$ using the parallel composition operator. It is verified by discharging two proof obligations:

prove that all heap accesses of all incomparable iteration pairs (i.e. all iteration pairs except the identical iterations of $\mathsf {B_1}$ and $\mathsf {B_2}$) are non-conflicting, which implies that the fusion of $\mathsf {B_1}$ and $\mathsf {B_2}$ and parallel composition of $\mathsf {B_1} \oplus \mathsf {B_2}$ and $\mathsf {B_3}$ are memory safe, and

prove that each parallel basic block by itself satisfies its iteration contract $\forall {\mathsf {b}} \in \{1,2,3\}.\{{\bigstar _{i\in [0..L)} \mathsf {IC}_{b}(i)}\} \mathsf {B}_b$ $\{{\bigstar _{i\in [0..L)} \mathsf {IC}_{b}(i)}\}$, and second proving the correctness of the b-linearised variant of ${\mathcal {P}}$ against its program contract $\{RC_{{\mathcal {P}}} \mathop {\star }P_{{\mathcal {P}}}\}$

https://static-content.springer.com/image/art%3A10.1007%2Fs10009-020-00601-z/MediaObjects/10009_2020_601_IEq527_HTML.gif

Figure 22 illustrates the necessary contract for the other example in Sect. 2.1 (Fig. 2). We have implemented a slightly more general variant of PPL in our VerCors tool, which supports variable declarations and method calls. To check the first proof obligation in the tool we quantify over pairs of blocks which allows the number of iterations in each block to be a parameter rather than a fixed number. Our implementation successfully verified the example in 25 seconds.

Botincan et al. propose a proof-directed parallelisation synthesis, which takes as input a sequential program with a proof in separation logic and outputs a parallelised counterpart by inserting barrier synchronisations [7, 8]. Hurlin uses a proof-rewriting method to parallelise a sequential program’s proof [16]. Compared to them, we prove the correctness of parallelisation by reducing the parallel proof to a b-linearised proof. Moreover, our approach allows verification of sophisticated block compositions, which enables reasoning about state-of-the-art parallel programming languages (e.g. OpenMP), while their work remains rather theoretical.

Raychev et al. use abstract interpretation to make a non-deterministic program (obtained by naive parallelisation of a sequential program) deterministic by inserting barriers [23]. This technique over-approximates the possible program behaviours which ends up in a determinisation whose behaviour is implied by a set of rules which decide between feasible schedules rather than the behaviour of the original sequential program. Unlike them, we do not generate any parallel program. Instead we prove that parallelisation annotations can safely be applied and the parallelised program is functionally correct and exhibits the same behaviour as its sequential counterpart.

Barthe et al. synthesise SIMD code given pre- and postconditions for loop kernels in C++ STL or C# BCL [3]. We alternatively enable verification of SIMD loops, by encoding them into vectorised basic blocks. Moreover, we address the parallel or sequential composition of those loops with other forms of parallelised blocks.

Dodds et al. introduce a higher-order variant of concurrent abstract predicates (CAP) to support modular verification of synchronisation constructs for deterministic parallelism [13]. While their proofs make explicit use of nested region assertions and higher-order protocols, they do not address the semantic difficulties introduced by these features. As mentioned in the paper, the reasoning is unsound in certain corner cases, which was fixed in an expanded version of their paper using iCAP [14]. Their approach relies on a powerful program logic and focuses much less on automation of the verification process.

Salamanca et al. [24] propose a run-time loop-carried dependence checker as an extension to OpenMP which helps programmers to detect hidden data dependencies in omp parallel for. Compared to them, we statically detect any violation of data dependencies without any run-time overhead and we address a larger subset of OpenMP constructs.

Bubel et al. [10] provide a formal trace semantics for data dependences and a program logic to analyse and reason about dependences in imperative programming languages. They benefit from ghost variables to extend the program states to keep track of heap memories. The authors implement their approach in the KeY verifier and show the effectiveness of their approach by experimenting on Java programs. Their approach for loop-free programs is highly automatic, but for programs containing loops, user interaction is required. In comparison with our work, for programs with loops, users need to provide loop invariants, while we only require iteration contracts (which we believe are often easier to specify).

Praun et al. [27] propose an abstract model to capture data dependences. The model represents these dependences as a density metric to predict potential concurrency of programs. This metric categorises the programs into high, medium and low densities. Programs with high density are good candidates for parallelism, while those with low density are not. Programs with medium density requires a scheduler that is aware of the algorithmic dependences. In contrast to our approach, their model abstracts from runtime aspects such as the number of threads and concurrency control and does not prove correctness of parallelised programs. Their work can benefit from our approach to guarantee correctness after discovering dependencies and parallelising the programs.

10 Conclusion and future work

We have presented the PPL language that captures the main forms of deterministic parallel programming, and we have shown how a commonly used subset of OpenMP can be encoded into PPL. Then, we proposed a verification technique to reason about data race freedom and functional correctness of PPL programs. The verification technique consists of two parts: reasoning about the correctness of basic blocks, and reasoning about the composition of blocks. Finally, we illustrate the technique to verify the correctness of an example OpenMP program.

As future work, we plan to look into adapting annotation generation techniques to automatically generate iteration contracts, including both resource formulas and functional properties. This will lead to fully automatic verification of deterministic parallel programs. Moreover, our technique can be extended to address a larger subset of OpenMP programs by supporting more complex OpenMP patterns for scheduling iterations and omp task constructs. We also plan to identify the subset of atomic operations that can be combined with our technique that allows verification of the widely used reduction operations.

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Vorheriger Artikel Model-based optimization of ARINC-653 partition scheduling

Nächster Artikel Energy characterization of IoT systems through design aspect monitoring

A data race is a situation when two or more threads may access the same memory location simultaneously where at least one of them is a write.

Note that this condition is independent of whether $A_i$ is actually an omp for annotation.

Ghost statements are specification-only statements. They are not part of the program, but are used purely for verification purposes.

The tool and a list of case studies and verified examples is available at: https://github.com/utwente-fmt/vercors.

Amighi, A., Haack, C., Huisman, M., Hurlin, C.: Permission-based separation logic for multithreaded Java programs. LMCS 11(1), (2015)

Aviram, A., Ford, B.: Deterministic OpenMP for Race-free Parallelism. In HotPar’11 (2011)

Barthe, G., Crespo, J.M., Gulwani, S., Kunz, C., Marron, M.: From relational verification to SIMD loop synthesis. In: PPoPP, pp. 123–134 (2013)

Berger, M.J., Aftosmis, M.J., Marshall, D.D., Murman, S.M.: Performance of a new CFD flow solver using a hybrid programming paradigm. J. Parallel Distrib. Comput. 65(4), 414–423 (2005)CrossRef

Blom, S., Darabi, S., Huisman, M.: Verification of loop parallelisations. In: Egyed, A., Schaefer, I. (eds.) FASE, Volume 9033 of LNCS. Springer, pp. 202–217 (2015)

Bornat, R., Calcagno, C., O’Hearn, P., Parkinson, M.: Permission accounting in separation logic. In: POPL, pp. 259–270 (2005)

Botincan, M., Dodds, M., Jagannathan, S.: Resource-sensitive synchronization inference by abduction. In: Field, J., Hicks, M. (eds.) Principles of Programming Languages (POPL 2012), pp. 309–322 (2012)

Botinčan, M., Dodds, M., Jagannathan, S.: Proof-directed parallelization synthesis by separation logic. ACM Trans. Program. Lang. Syst. 35, 1–60 (2013)CrossRef

Boyland, J.: Checking interference with fractional permissions. In: SAS, Volume 2694 of LNCS. Springer, pp. 55–72 (2003)

10.

Bubel, R., Hähnle, R., Heydari Tabar, A.: A program logic for dependence analysis. In: Ahrendt, W., Tapia Tarifa, S.L. (eds.) Integrated Formal Methods. Springer International Publishing, Cham, pp. 83–100 (2019)

11.

Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.-H., Skadron, K.: Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization. IISWC 2009, pp. 44–54 (2009)

12.

Darabi, S., Blom, S., Huisman, M.: A verification technique for deterministic parallel programs. In: Barrett, C., Davies, M., Kahsai, T. (eds.) NASA Formal Methods (NFM), Volume 10227 of LNCS, pp. 247–264 (2017)

13.

Dodds, M., Jagannathan, S., Parkinson, M.J.: Modular reasoning for deterministic parallelism. In ACM SIGPLAN Notices, pp. 259–270 (2011)

14.

Dodds, M., Jagannathan, S., Parkinson, M.J., Svendsen, K., Birkedal, L.: Verifying custom synchronization constructs using higher-order separation logic. ACM Trans. Program. Lang. Syst. 38(2), 4:1–4:72 (2016)CrossRef

15.

Haack, C., Huisman, M., Hurlin, C.: Reasoning about Java’s reentrant locks. In: Ramalingam, G., (ed.) Programming Languages and Systems, 6th Asian Symposium, APLAS 2008, Bangalore, India, December 9–11, 2008. Proceedings, Volume 5356 of LNCS. Springer, pp. 171–187 (2008)

16.

Hurlin, C.: Specification and Verification of Multithreaded Object-Oriented Programs with Separation Logic. PhD thesis, Université Nice Sophia Antipolis (2009)

17.

Jin, H.-Q., Frumkin, M., Yan, J.: The OpenMP Implementation of NAS Parallel Benchmarks and its Performance (1999)

18.

Leavens, G., Poll, E., Clifton, C., Cheon, Y., Ruby, C., Cok, D.R., Müller, P., Kiniry, J., Chalin, P.: JML Reference Manual (2007). Dept. of Computer Science, Iowa State University. http://www.jmlspecs.org

19.

Müller, P., Schwerhoff, M., Summers, A.: Viper—a verification infrastructure for permission-based reasoning. In VMCAI (2016)

20.

OpenMP architecture review board, OpenMP API specification for parallel programming. Last accessed 18 Oct 2016. http://openmp.org/wp/

21.

LLNL OpenMP Benchmarks. Last accessed 28 Nov 2016. https://asc.llnl.gov/CORAL-benchmarks/

22.

Parkinson, M., Summers, A.: The relationship between separation logic and implicit dynamic frames. In Barthe, G. (ed.) Programming Languages and Systems—20th European Symposium on Programming, ESOP 2011, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2011, Saarbrücken, Germany, March 26–April 3, 2011. Proceedings, volume 6602 of LNCS. Springer, pp. 439–458 (2011)

23.

Raychev, V., Vechev, M., Yahav, E.: Automatic synthesis of deterministic concurrency. In: Static Analysis—20th International Symposium, SAS 2013, Seattle, WA, USA, June 20–22, 2013. Proceedings. Springer, pp. 283–303 (2013)

24.

Salamanca, J., Mattos, L., Araujo, G.: Loop-carried dependence verification in OpenMP. In: International Workshop on OpenMP 2014, pp. 87–102 (2014)

25.

Smans, J., Jacobs, B., Piessens, F.: Implicit dynamic frames. ACM Trans. Program. Lang. Syst. 34(1), 2:1–2:58 (2012)CrossRef

26.

Viper project website. http://www.pm.inf.ethz.ch/research/viper

27.

von Praun, C., Bordawekar, R., Cascaval, C.: Modeling optimistic concurrency using quantitative dependence analysis. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 185–196 (2008)

Titel: Correct program parallelisations
verfasst von: S. Blom
S. Darabi
M. Huisman
M. Safari
Publikationsdatum: 14.02.2021
Verlag: Springer Berlin Heidelberg
Erschienen in: International Journal on Software Tools for Technology Transfer / Ausgabe 5/2021
Print ISSN: 1433-2779
Elektronische ISSN: 1433-2787
DOI: https://doi.org/10.1007/s10009-020-00601-z

Springer Professional

Correct program parallelisations

Abstract

Publisher's Note

1 Introduction

2 Background

2.1 OpenMP

2.2 Program specifications: syntax and semantics

3 Syntax and semantics of deterministic parallelism

3.1 Syntax

3.2 Semantics

4 Encoding OpenMP into PPL

4.1 Subset of OpenMP

4.2 OpenMP to PPL encoding

4.3 Example translations

5 Verification of basic blocks

5.1 Iteration contracts

5.2 Verification rules for basic blocks

5.3 Soundness

5.3.1 Denotational semantics of blocks

5.3.2 Correctness of parallel blocks

6 Verification of block composition

6.1 Verification of block composition without resource transfers

6.2 Soundness

6.3 Verification of block composition with resource transfers

7 Tool support

7.1 Encoding of basic blocks into viper

7.2 Encoding of the b-linearise rule into viper

8 Example: verification of an OpenMP program

10 Conclusion and future work

Publisher's Note

Premium Partner

Springer Professional

Abstract

Publisher's Note

1 Introduction

2 Background

2.1 OpenMP

2.2 Program specifications: syntax and semantics

3 Syntax and semantics of deterministic parallelism

3.1 Syntax

3.2 Semantics

4 Encoding OpenMP into PPL

4.1 Subset of OpenMP

4.2 OpenMP to PPL encoding

4.3 Example translations

5 Verification of basic blocks

5.1 Iteration contracts

5.2 Verification rules for basic blocks

5.3 Soundness

5.3.1 Denotational semantics of blocks

5.3.2 Correctness of parallel blocks

6 Verification of block composition

6.1 Verification of block composition without resource transfers

6.2 Soundness

6.3 Verification of block composition with resource transfers

7 Tool support

7.1 Encoding of basic blocks into viper

7.2 Encoding of the b-linearise rule into viper

8 Example: verification of an OpenMP program

9 Related work

10 Conclusion and future work

Publisher's Note

Weitere Artikel der Ausgabe 5/2021

Verification of randomized consensus algorithms under round-rigid adversaries

On methods and tools for rigorous system design

Model-based optimization of ARINC-653 partition scheduling

Correction to: An integrated specification and verification technique for highly concurrent data structures

Specifying and verifying usage control models and policies in TLA

Energy characterization of IoT systems through design aspect monitoring

Premium Partner