Dieses Kapitel bietet eine eingehende Untersuchung der Remote Direct Memory Access (RDMA) -Technologie, einer modernen Netzwerklösung, die direkten Speicherzugriff zwischen Maschinen über ein Netzwerk ermöglicht. Der Text beginnt mit der Einführung von RDMA und seiner weit verbreiteten Verbreitung aufgrund seiner Kapazitäten mit hohem Durchsatz und geringer Latenz, insbesondere in massiv parallelen Computerclustern wie Rechenzentren. Das Kapitel geht dann auf die einzigartigen Parallelmodelle und schwachen Verhaltensweisen von RDMA-Netzwerken ein und beleuchtet die Herausforderungen beim Schreiben korrekter und effizienter RDMA-Programme. Ein zentraler Schwerpunkt liegt auf der Entwicklung ausreichender syntaktischer Bedingungen, die die Robustheit von RDMA-Programmen gewährleisten und es Entwicklern ermöglichen, anhand eines einfacheren, intuitiveren Modells wie sequentieller Konsistenz darüber nachzudenken. Das Kapitel stellt zwei Arten von Beschränkungen vor: eine, die restriktiv, aber universell anwendbar ist, und eine andere, die diese Anforderungen lockert, aber eine baumförmige Netzwerktopologie erfordert. Diese Bedingungen sind entscheidend, um nützliche Paradigmen in der RDMA-Programmierung zu ermöglichen, wie das Server-Client-Modell, und um spezifische Netzwerktopologien wie Local Area Networks (LAN) zu modellieren. Der Text diskutiert auch verwandte Arbeiten und bietet einen umfassenden Überblick über bestehende Forschungen zu RDMA-Semantik, schwachen Speichermodellen und Robustheit. Anhand detaillierter Beispiele und formaler Modelle bietet das Kapitel eine gründliche Analyse der RDMA-Semantik und praktische Richtlinien zur Gewährleistung der Robustheit, was es zu einer unverzichtbaren Lektüre für jeden macht, der an Hochleistungsrechnern und -netzwerken beteiligt ist.
KI-Generiert
Diese Zusammenfassung des Fachinhalts wurde mit Hilfe von KI generiert.
Abstract
Remote Direct Memory Access (RDMA) is a modern technology enabling high-performance inter-node communication. Despite its widespread adoption, theoretical understanding of permissible behaviours remains limited, as RDMA follows a very weak memory model. This paper addresses the challenge of establishing sufficient conditions for RDMA robustness. We introduce a set of straightforward criteria that, when met, guarantee sequential consistency and mitigate potential issues arising from weak memory behaviours in RDMA applications. Notably, when restricted to a tree topology, these conditions become even more relaxed, significantly reducing the need for synchronisation primitives. This work provides developers with practical guidelines to ensure the reliability and correctness of their RDMA-based systems.
1 Introduction
Remote Direct Memory Access (RDMA) is a modern technology that enables a machine to have direct read/write access to the memory of another machine over a network, bypassing the operating systems on both ends. This allows such direct memory accesses (reads/writes) to be performed with far fewer CPU cycles, leading to high-throughput, low-latency networking, which is especially useful in massively parallel computer clusters (e.g. data centres). RDMA has achieved widespread adoption as of 2018 [69], thanks to efficient implementations available at comparable cost to traditional infrastructures (e.g. TCP/IP sockets) [32], with several RDMA technologies such as InfiniBand and RDMA over Converged Ethernet (RoCE) readily available.
RDMA networks directly interact with the hardware through read (get) and write (put) operations on remote memory. As a result, programming RDMA systems is conceptually similar to shared memory systems of existing hardware architectures (e.g. Intel-x86 or ARM). A key difference, however, is that on encountering a remote operation, the CPU forwards it onto the network interface card (NIC), which subsequently handles the remote operation without further CPU involvement.
Anzeige
The performance gains of RDMA, as well as its wide range of implementations, have led to a surge of RDMA research [4, 27, 71, 73]. RDMA networks exhibit different degrees of concurrency, depending on whether the concurrent threads reside on different nodes (machines) over the network (inter-node concurrency) or on the same node (intra-thread concurrency). To understand the behaviour of RDMA programs and their various notions of concurrency, Ambal et al. [10] recently developed rdmatso, a formal semantics of RDMA programs where each node comprises an Intel-x86 CPU and thus intra-node-inter-thread concurrency is governed by the TSO (total store ordering) model [68].
As the real power of RDMA networks is their ability to run parallel programs over different nodes, writing efficient RDMA programs hinges on utilising inter-node concurrency. However, writing such programs correctly is far from straightforward. A key challenge is that local operations (accessing the local memory of the executing node) are handled by the CPU, while remote operations (accessing remote memory on other nodes) are handled by the NIC independently and in parallel to CPU operations. Hence, operations in the same thread may not be executed in the intended (program) order, leading to surprising outcomes. As Ambal et al. [10] note, this can result in counter-intuitive behaviours even in the case of sequential programs comprising a single thread. This is in stark contrast to all previously existing concurrency models (be they of CPU architectures or programming languages), where sequential programs do behave sequentially.
The permissive nature of RDMA semantics requires developers to carefully consider potential instruction reorderings. Reasoning about concurrent programs and ensuring proper synchronisation between threads is inherently complex, even without instruction reordering. Accounting for instruction reorderings adds another layer of complexity to this challenge.
As such, we should ideally enable reasoning about RDMA programs under a simpler, more intuitive model such as sequential consistency (SC) [43], where no instruction reordering is allowed, and thus instructions in each thread always execute in order. To this end, a common approach to simplify reasoning is to ensure robustness. A program \(\textsf{P} \) is robust under a consistency model CM, if its set of possible behaviours under CM coincide with those of its behaviours under SC; i.e. \(\textsf{P} \) is robust under CM if it exhibits no non-SC behaviours. If a program is robust under CM, then we can simply reason about it under SC, without considering the complexities of CM.
Anzeige
Contributions. In this paper, we close this gap and simplify reasoning about RDMA program through robustness. To simplify our presentation and not distract the reader from the RDMA complexities by the orthogonal intricacies of CPU concurrency, we first present rdmasc, a simplification of the rdmatso model of Ambal et al. [10], where intra-node concurrency follows the simpler SC model [43], while inter-node concurrency is analogous to that of rdmatso. We then identify two sets of sufficient constraints that, if satisfied, ensure the robustness of rdmasc programs. Our proposed constraints are purely syntactic, in that they do not require an understanding of the complex RDMA semantics and can be established by simply checking the syntax of the program. The first set of constraints is restrictive, but can be applied to any RDMA program. The second relaxes the requirements of the first, but requires the RDMA network to follow a tree topology. Our conditions enable a number of useful paradigms for RDMA programs such as the server-client model, which we show can be used for automatically translating existing concurrent algorithms to distributed ones over RDMA, as well as for modelling star network topologies used e.g. in Local Area Networks (LAN). Finally, we adapt our results to the rdmatso model and accordingly propose analogous syntactic and topological constraints.
Outline. In §2 we present an intuitive account of the weak RDMA semantics through examples and discuss how we ensure robustness through syntactic constraints. In §3 we present our formal rdmasc model. In §4 we establish sufficient syntactic conditions that ensure the robustness of rdmasc programs. In §5 we apply these findings to tree-shaped network topologies, offering a further streamlined set of conditions under rdmasc. We discuss related work in §6. The proofs of all theorems stated in this paper, as well as the extension of all our results to the rdmatso model, are available in the extended version [11].
2 Overview
We present an intuitive account of RDMA semantics through several examples, showing the counter-intuitive and unexpected behaviours they can exhibit due to possible instruction reorderings (§2.1). We then discuss how we can tame this complexity by introducing syntactic constraints that, if fulfilled, prohibit problematic instruction reorderings, pre-empting unexpected behaviours and thus simplifying the task of reasoning about RDMA programs for developers (§2.2).
2.1 RDMA Semantics at a Glance
Consistency (Concurrency) Models andWeakBehaviours. In the literature of shared-memory concurrent (multi-threaded) programming, the set of possible behaviours (i.e. semantics) of a concurrent program is defined via a consistency model (a.k.a. memory model or concurrency model), with a number of such models available in different domains such as hardware architectures (e.g. Intel and ARM) and programming languages (e.g. C/C++ and Java). The most well-known and intuitive consistency model is sequential consistency (SC, a.k.a. interleaving concurrency) [43], where the instructions are interleaved in program order. That is, under SC the instructions in each thread cannot be reordered. While simple, SC is too strong in that it precludes many common hardware/compiler optimisations and thus unduly hinders performance. As such, modern hardware architectures and programming languages adhere to weaker, more lenient models, admitting more behaviours than SC. In this context a program behaviour (outcome) is referred to as weak, if it is not allowed under SC. Such weak behaviours can typically be understood in terms of instruction reorderings within a thread or visibility delays (where the effects of an instruction (e.g. a write) is not observed at the same time by all threads), both of which are disallowed under SC.
Conceptual RDMA Model. We model concurrent RDMA programs running over a network of nodes (i.e. computers), where each node hosts zero, one, or more threads, and each thread can directly access remote memory of other nodes through its network interface card (NIC). As we discuss below, RDMA programs exhibit three sources of weak behaviours: 1. CPU weak behaviours, due to the usual interactions (and reordering) of multiple threads on a single node; 2. intra-thread weak behaviours, due to RDMA operations being reordered or delayed; and 3. inter-node weak behaviours, due to multiple nodes executing concurrently. Here we focus on the latter two sources as they are specific to RDMA programs, and discuss how such weak behaviours may be prevented.
CPU Concurrency. RDMA enacts data transfers between nodes via the NIC subsystems of the constituent nodes, which are independent from the CPU subsystems. Consequently, the RDMA technology can be combined with different CPU architectures governed by different memory models (e.g. TSO or ARM). The first validated formal model of RDMA programs, rdmatso [10], assumes that CPU concurrency is governed by the TSO model [68]. To simplify our presentation and not distract the reader from the RDMA complexities by the orthogonal intricacies of CPU concurrency, we present the simpler rdmasc model, where CPU concurrency follow the stronger SC model [43]. We generalise our results to rdmatso in the extended version [11].
Almost all weak behaviours introduced by RDMA stem from the NIC and are independent of CPU concurrency (i.e. CPU and RDMA concurrency can often be decoupled). As such, the distinction between rdmatso and rdmasc is often irrelevant, in which case we write rdma* to encompass both models. In particular, in this overview section we focus on nodes with at most one thread each, i.e. with no CPU concurrency, so all behaviours discussed below hold of both rdmasc and rdmatso (i.e. for rdma*). Note that this is merely a presentational choice we have made in this section, and our formal models, theorems, and examples in subsequent sections also account for CPU concurrency.
Litmus Test Outcome Notation. We frequently present small representative examples (known as litmus tests in the literature). In each example, the outcomes annotated with
are allowed by the RDMA model under discussion, while those annotated with
are disallowed.
Remote Direct Memory Access (RDMA). RDMA programs comprise operations that access remote memory, as well as various synchronisation operations. As such, programming RDMA networks is conceptually similar to shared memory systems. To distinguish remote (RDMA) operations from CPU ones, we refer to RDMA reads and writes as get and put operations, respectively. To distinguish local and remote memory locations, we assume nodes do not reuse location names, we write \(x^n \) for a location on a remote node \(n \), and write x for a location on the local node. A put operation is of the form \(x^n :=y\) and consists of reading from a local location y and writing to a remote location x on \(n \). Similarly, a get operation is of the form \(x :=y^n \) and consists of reading from a remote location y on \(n \) and writing to a local location x. We write \(\overline{n}\) to identify a node other than \(n \). When node \(n \) issues a remote operation to be executed on node \(\overline{n}\), we state that the operation is by\(n \)towards\(\overline{n}\).
Fig. 1.
Sequential rdma* litmus tests, where each column (separated by ||) denotes a distinct node, the statement on the top line of each column denotes the initial values of locations.
Sequential (Single-Threaded)* Behaviours. When a thread issues a get or put operation, it is handled by the NIC, in contrast to local reads and writes handled by the CPU. As such, the interaction between CPU and remote operations lead to further behaviours even within a sequential (single-threaded) program. We demonstrate this in the examples of Fig. 1, where each column represents a distinct node, numbered from left to right starting from 1. For instance, the example in Fig. 1a comprises a single thread on node 1 (the left-most column) that writes to the local location x (\(x :=1\)) and puts x towards the remote location z on node 2 (\(z^2 :=x\)).
Intuitively, when a thread \(t \) on \(n \) issues remote operations towards node \(\overline{n}\), one can view these remote operations as if being executed by a thread running in parallel to \(t \). As such, when a remote operation follows a CPU one, the order of the two operations is preserved since the parallel thread is spawned only after the CPU operation is executed. This is illustrated in Fig. 1a. By contrast, when a remote operation precedes a CPU one, the remote operation is performed by a ‘separate thread’ run in parallel to the later CPU operation in the main thread, and thus may execute before or after the CPU operation, meaning that in the latter case the execution order is not preserved. This is illustrated in Fig. 1b.
Therefore, before using the result of a get or reusing the memory location of a put, it is desirable to avoid such reorderings and to wait for the remote operation to complete. This can be done through a CPU poll operation, \(\texttt{poll} (n)\), that blocks until the earliest (in program order) remote operation towards node \(n \) has completed. This is shown in Fig. 1c, obtained from Fig. 1b by inserting a poll after the remote operation: \(\texttt{poll} (2)\) waits for \(z^2 :=x\) to complete before proceeding with \(x :=1\), and thus \(z^2 :=x\) can no longer be reordered after \(x :=1\).
Note that each \(\texttt{poll} (n)\) waits for only one (the earliest) and not all pending remote operations towards \(n \) to complete. For instance, in Fig. 1d, \(\texttt{poll} (2)\) only blocks until the first\(z^2 :=x\) is complete, and thus \(z=1\) is once again possible.
Two remote operations towards different nodes are independent and can execute in either order, as illustrated in Fig. 1e. The only way to prevent this reordering is to poll the first operation before running the second.
The ordering guarantees on remote operations towards the same node are stronger and only certain reorderings are allowed. Recall that a put operation \(x^n :=y\) comprises two steps: a local read (on y) and a remote write (on \(x^n \)). Similarly, a get operation \(x :=y^n \) comprises two steps: a remote read (on \(y^n \)) and a local write (on x). Intuitively, NIC operations follow the precedence order: i) local read; ii) remote write; iii) remote read; iv) local write.
If a step with a higher precedence (e.g. a local read) is in program order before one with a lower precedence (e.g. a local write), then their order is preserved and they cannot be reordered. This is illustrated in Fig. 1g. Otherwise the order is not necessarily preserved and these steps can be reordered, as shown in Fig. 1h where an earlier local write on x can occur after the later local read.
As before, the reordering of the two remote operations in Fig. 1h can be prevented by polling the first operation before the second. However, polling is costly as it blocks the current thread, including the submission of remote operations towards any node. Alternatively, we can use a remote fence, \(\texttt{rfence} (n)\), that blocks only the NIC and only towards node \(n \). This in turn ensures that earlier (before the fence) remote operations by the thread towards \(n \) are executed before later (after the fence) remote operations towards \(n \). This is illustrated in Fig. 1i, obtained from Fig. 1h by inserting \(\texttt{rfence} (2)\) stopping the reordering.
Fig. 2.
Concurrent rdma* litmus tests.
Concurrent (Multi-Threaded)* Behaviours. The real power of RDMA comes from programs running on different nodes, introducing a wide range of weak behaviours. A network can comprise several nodes, each running several concurrent threads. We limit the examples of Fig. 2 to two nodes, each having a single thread.
As shown in Figs. 2a and 2b, well-known weak behaviours such as store buffering (Fig. 2a) and load buffering (Fig. 2b) are possible. This is because earlier RDMA operations can be delayed after later CPU operations.
As one could expect, most weak behaviours can be prevented by polling the remote operations as needed, as shown for load buffering in Fig. 2c. However, this strategy is not enough to prevent the store buffering weak behaviour, as show in Fig. 2d. This is because the specification of polling offers different guarantees for get and put operations. Polling a get operation \(a :=x^n \) offers the strong intuitive guarantee that the operation completed, i.e. the value of \(x^n \) is fetched from node \(n \) and written to a. By contrast, polling a put operation \(x^n :=a\) does not guarantee the write on \(x^n \) has completed. When sending the value of a towards node \(n \) to be put in \(x^n \), the remote NIC merely acknowledges having received the data, but this data may still reside in a buffer (i.e. the PCIe fabric) of the remote node, pending to be written \(x^n \). Polling a put operation only awaits the acknowledgement of the data receipt. As such, it is possible to poll a put operation successfully before the associated remote write has fully completed. In the case of store buffering in Fig. 2d, it is possible for both poll operations to complete before the values of x and y are updated (to 1) in memory.
We also assume NICs are connected to memory though the Peripheral Component Interconnect Express (PCIe) fabric, the de facto standard for this category of hardware [10]. This ensures that (PCIe) reads cannot overtake (PCIe) writes. As such, a remote read flushes (commits) all pending remote writes to memory, and similarly on local memory. This can be used to prevent weak behaviours such as store buffering, as shown in Fig. 2e, obtained from Fig. 2d by adding additional gets and subsequently polling them. Polling a (seemingly unrelated) later get (e.g. \(c :=z^2\)) ensures previous remote writes (e.g. \(x^2 :=1\)) have been committed to the remote memory.
2.2 Robustness: Taming Weak rdma* Behaviours
Given the permissive nature of the rdma* semantics and the numerous weak behaviours it exhibits (even in the case of single-threaded programs), the task of writing correct RDMA programs is laborious. Reasoning about concurrent programs is already challenging even in the absence of weak behaviours. Accounting for potential instruction reorderings (which requires experience with rdma* semantics) introduces yet another layer of complexity for developers.
As such, we should ideally enable reasoning about RDMA programs under a simpler, more intuitive model such as SC (sequential consistency [43]). Specifically, to simplify program reasoning, a common approach is to ensure robustness. A program \(\textsf{P} \) is robust under a consistency model CM, if its set of possible behaviours under CM coincide with those of its behaviours under SC; i.e. \(\textsf{P} \) is robust if it exhibits no weak behaviours. If a program is robust, then we can reason about it as if it were executed under SC, without considering the complexities of rdma*.
To ensure robustness, we must prevent observable reorderings, i.e. those leading to weak behaviours. We can achieve this through syntactic requirements (e.g. by inserting sufficient remote fences and poll operations). A naive solution is to wait for each remote operation to fully complete before proceeding further, thereby preventing all reorderings. Unfortunately, this serialises these operations, and thus defeats the benefits of RDMA, which is designed to parallelise CPU instructions and data transfers by offloading them to the NIC. Instead, we should account for the rdma* semantics and only add restrictions when necessary, while allowing non-observable reorderings.
Fig. 3.
Examples showing that necessary restrictions depend on other threads.
Certain reorderings are observable even when considering a single thread in isolation, as in the examples of Figs. 1b, 1e, 1f, and 1h. Specifically, these examples contain data races within a single thread. Beyond robustness, these patterns should be avoided in any sensible program. However, most weak behaviours arise from the interaction of several threads. For instance, in the single-threaded example of Fig. 3a, although the two remote operations \(a :=y^2\) and \(z^3 :=1\) on node 1 may be reordered, this reordering is not observable: it does not lead to additional weak behaviours, and thus no additional constraints are necessary for robustness. By contrast, in the multi-threaded variant of Fig. 3b (with a thread on node 2), nodes 2 and 3 can exchange data and thus we can observe the weak behaviour \(a = 1\) due to this reordering. As such, to prohibit this, we must prevent the two operations on node 1 from being reordered, e.g. by polling the first operation, as shown in Fig. 3c.
As seen before, preventing reorderings can be done in different ways. In cases like Fig. 1i, a remote fence is enough. In cases like Fig. 2e, we need dummy get operations. Determining when and how to prevent reorderings is not straightforward. As illustrated in the examples of Fig. 3, it cannot be done thread-locally: one must account for the communication between other nodes and thus must take the whole program into account. This raises two questions:
How do we prevent weak behaviours through simple purely syntactic restrictions? Specifically, how can we ensure that a program has enough constraints (e.g. polls) to prevent weak behaviours, and how do we make sure that waiting for a specific remote operation (as in Fig. 3a) is unnecessary?
How do we structure RDMA programs to minimise the amount of necessary restrictions in order to maintain efficient implementations?
We set out to answer these questions in the remainder of this paper. Specifically, after defining several formal preliminaries in §3, we present a theorem in §4 stating sufficient syntactic conditions guaranteeing robustness (i.e. the absence of weak behaviours). In §5 we then build on this theorem and present a useful RDMA network topology where fewer limitations are necessary to prevent weak behaviours. Notably, following our prescribed network topology ensures that it is never necessary to poll a remote operation to prevent multi-threaded weak behaviours.
3 rdmasc: A Declarative Semantics for RDMA Programs
We present the syntax of RDMA programs (taken from [10]) in §3.1. In §3.2 we then present a formal declarative semantics for our rdmasc model. As we describe in the extended version [11], we obtain rdmasc by strengthening the rdmatso model of Ambal et al. [10] whereby we make a few simple adjustments to ensure that local (CPU) concurrency follows the SC rather than TSO model.
3.1 rdmasc: Programming Language
Nodes and Threads. We consider a system with N nodes and M threads in total across all nodes. Let \(\textsf{Node}=\{1, \ldots , N\}\) and \(\textsf{Tid}=\{1, \ldots , M\}\) denote the sets of node and thread identifiers, respectively. We use \(n \) and \(t \) to range over \(\textsf{Node}\) and \(\textsf{Tid}\), respectively. Given a node \(n \), we write \({\overline{n}}\) to range over \(\textsf{Node}{\setminus }\left\{ {n}\right\} \). Each thread \(t \in \textsf{Tid}\) is associated with a node, written \(n (t)\).
Memory Locations. Each node \(n \) has a set of locations, \(\textsf{Loc}_n \), accessible by all nodes. We define \(\textsf{Loc}\triangleq \biguplus _n \textsf{Loc}_n \) and \(\textsf{Loc}_{\overline{n}} \triangleq \textsf{Loc}\setminus \textsf{Loc}_n \). We use \(x^n,y^n,z^n,w^n \) and \(x^{\overline{n}},y^{\overline{n}},z^{\overline{n}},w^{\overline{n}}\) to range over \(\textsf{Loc}_n \) and \(\textsf{Loc}_{\overline{n}}\), respectively. When the choice of \(n \) is clear, we write \(x \) for \(x^n \) and \(\overline{x}\) for \(x^{\overline{n}}\). For clarity, we use distinct location names across nodes and write \(n (x)\) for the unique \(n \!\in \! \textsf{Node}\) where \(x \!\in \! \textsf{Loc}_n \). We assume all locations can be accessed by all threads on all nodes. However, for readability, we use a, b, c, and d for (private) locations that are only accessed by a single thread (on a single node).
Values and Expressions. We assume a set of values, \(\textsf{Val}\), with \(\mathbb {N} \subseteq \textsf{Val}\), and use \(v \) to range over \(\textsf{Val}\). We assume a language of expressions over \(\textsf{Val}\) and \(\textsf{Loc}\), and elide its exact syntax and semantics. We use \(e \) to range over expressions, and \(e ^n \) to range over expressions whose locations are all included in \(\textsf{Loc}_n \).
Sequential Commands and Programs.Sequential programs on node \(n \) are described by the \(\textsf{C}^{n}\) grammar below and include primitive commands (\(\textsf{c}^{n}\)), sequential composition (\(\textsf{C}^{n}_1; \textsf{C}^{n}_2\)), non-deterministic choice (\(\textsf{C}^{n}_1 {\displaystyle {+}}\textsf{C}^{n}_2\), executing either \(\textsf{C}^{n}_1\) or \(\textsf{C}^{n}_2\)), and non-deterministic loops (\({\textsf{C}^{n}}^{\displaystyle {*}}\), executing \(\textsf{C}^{n}\) any number of times). A (concurrent) program, \(\textsf{P}\), is a map from thread identifiers to commands, associating each thread \(t \!\in \!\textsf{Tid}\) with a command on node \({n (t)}\).
Primitive commands include CPU (\(\textsf{cc}^{n}\)) and RDMA (\(\textsf{rc}^{n}\)) operations. A CPU operation on \(n \) may be a no-op (\(\texttt{skip}\)), an assignment to a local location (\(x:=e \)), an assumption on the value of a local location (\(\texttt{assume} (x=v)\) and \(\texttt{assume} (x\ne v) \)), an atomic CAS (‘compare-and-set’) operation (\(x:=\texttt{CAS} (y,e _1,e _2) \)), or a ‘poll’, \(\texttt{poll} {(\overline{n})}\), that awaits the completion notification of the earliest put/get that is pending (not yet acknowledged). An RDMA operation may be (i) a ‘get ’, \(x:={\overline{y}}\), reading from remote location \(\overline{y}\) and writing the result to local location \(x\); (ii) a ‘put ’, \({\overline{y}} :=x\), reading from local location \(x\) and writing the result to remote location \(\overline{y}\); or (iii) a ‘remote fence’, \(\texttt{rfence} (\overline{n})\), which ensures that all later (in program order) RDMA operations towards \(\overline{n}\) will await the completion of all earlier RDMA operations towards \(\overline{n}\). \(\texttt{poll} {(\overline{n})}\) is executed by the CPU and blocks its thread (and prevents the requests of later remote operations), while \(\texttt{rfence} (\overline{n})\) blocks the NIC for the execution of remote operations towards \(\overline{n}\).
3.2 rdmasc: Declarative Semantics
Events and Executions. In the literature of declarative models, the traces of a program are commonly represented as a set of executions, where an execution is a graph comprising: i) a set of events (graph nodes); and ii) a number of relations on events (graph edges). Each event is associated with the execution of a primitive command (in \(\textsf{PComm}\)) and is a tuple \((\iota ,t,l)\), where \(\iota \) is the (unique) event identifier, \(t \!\in \! \textsf{Tid}\) identifies the executing thread, and \(l\!\in \! \textsf{ELab}\) is the event label, defined below.
Definition 1
(Labels and events). An event, \(\textsf{e}\in \textsf{Event}\), is a triple \((\iota ,t,l)\), where \(\iota \in \mathbb {N}\), \(t \in \textsf{Tid}\) and \(l\in \textsf{ELab}_{n (t)}\). The set of event labels is \(\textsf{ELab}\triangleq \bigcup _{n} \textsf{ELab}_n \) for all nodes \(n \). An event label of \(n \), \(l\in \textsf{ELab}_n \), is a tuple of one of the following forms:
Each event label denotes whether the associated primitive command is handled by the NIC (left column, prefixed with \(\texttt{n}\)), or the CPU (right column). A poll instruction is handled by the CPU. A put operation \(x^{\overline{n}} :=y^n \) by node n towards node \(\overline{n}\) comprises a NIC local read from \(y^n \) and a NIC remote write on \(x^{\overline{n}}\) and is thus modelled as two events with labels \(\texttt{nlR} (y^n, v, \overline{n})\) and \(\texttt{nrW} (x^{\overline{n}}, v)\), where v denotes the value read from \(y^n \) and written to \(x^{\overline{n}}\). Similarly, a get \(x^n :=y^{\overline{n}}\) is modelled as two events with labels of the form \(\texttt{nrR} (y^{\overline{n}}, v)\) and \(\texttt{nlW} (x^n, v, \overline{n})\).
CPU operations are modelled by events as expected. A successful operation \(x :=\texttt{CAS} (y,v_1,v_2)\) is modelled by two events with labels \(\texttt{CAS} (y,v_1,v_2)\) and \(\texttt{lW} (x,v_1)\). An unsuccessful \(x :=\texttt{CAS} (y,v_1,v_2)\) operation is modelled by a CPU read instead: \(\texttt{lR} (y,v)\) and \(\texttt{lW} (x,v)\), with \(v \ne v_1\).
We write \(\textsf{type}(l)\), \(\texttt{loc} (l)\), \(v_{\text {r}} (l)\), \(v_{\text {w}} (l)\), and \(\overline{n}(l)\) for the type (e.g. \(\texttt{lR}\)), location, read value, write value, and remote node of \(l\), where applicable; e.g. \(\texttt{loc} (\texttt{nlR} (x^n,v_{\text {r}},\overline{n})) \!=\! x^n \) and \(\overline{n}(\texttt{nlR} (x^n,v_{\text {r}},\overline{n})) \!=\! \overline{n}\). We lift these functions to events as expected. We write \(\iota (\textsf{e})\), \(t (\textsf{e})\), \(l(\textsf{e})\) to project the corresponding components of an event \(\textsf{e}= (\iota ,t,l)\), and write \(n (\textsf{e})\) for the node \(n (t (\textsf{e}))\) of an event.
Queue Pairs. As mentioned in §2 (see Fig. 1e), two remote operations by the same thread towards different remote nodes can be reordered. When using RDMA, each thread establishes a communication channel, called a queue pair, towards each remote node. The intuition is that operations on different queue pairs are independent and can always be reordered. Different threads, even on the same node, create different queue pairs to connect to the same remote node.
Notation. Given a relation r and a set A, we write \(r^+\) for the transitive closure of r; \(r^{-1}\) for the inverse of r; \(r|_{A}\) for \(r \cap (A \times A)\); and [A] for the identity relation on A, i.e. \(\left\{ {(a, a) \mid a \in A}\right\} \). We write \(r_1; r_2\) for their relational composition: \(\{(a, b) \mid \exists c.\, (a, c) \in r_1 \wedge (c, b) \in r_2\}\). When r is a strict partial order, we write \({r}{|_{\text {imm}}}\) for the immediate edges in r, i.e. \(r \setminus (r;r)\). Given a set of events \(E\) and a location \(x\), we write \(E_x\) for \(\left\{ {\textsf{e}\in E\mid \texttt{loc} (\textsf{e}) \!=\! x}\right\} \). Given a set of events \(E\) and a label type \(\texttt{X} \), we write \(E.\texttt{X} \) for \(\left\{ {\textsf{e}\in E\; | \;\textsf{type}(\textsf{e}) =\texttt{X}}\right\} \), and define its sets of reads as \(E.\mathcal {R} \triangleq E.\texttt{lR} \cup E.\texttt{CAS} \cup E.\texttt{nlR} \cup E.\texttt{nrR} \), writes as \(E.\mathcal {W} \triangleq E.\texttt{lW} \cup E.\texttt{CAS} \cup E.\texttt{nlW} \cup E.\texttt{nrW} \), CPU events as \(E^{\text {cpu}}\triangleq E.\texttt{lW} \cup E.\texttt{lR} \cup E.\texttt{CAS} \cup E.\texttt{P} \), and NIC writes as \(E.\texttt{nW} \triangleq E.\texttt{nlW} \cup E.\texttt{nrW} \). We define the ‘same-location’ relation as \(\texttt{sloc} \!\triangleq \! \left\{ {(\textsf{e},\textsf{e}') \!\in \! \textsf{Event}^2 \!\; | \;\texttt{loc} (\textsf{e})\!=\!\texttt{loc} (\textsf{e}')}\right\} \); the ‘same-thread’ relation as \(\texttt{sthd} \!\triangleq \! \left\{ {(\textsf{e},\textsf{e}')\!\in \!\textsf{Event}^2 \!\; | \;t (\textsf{e})\!=\!t (\textsf{e}')}\right\} \); and the ‘same-queue-pair’ relation as \(\texttt{sqp} \!\triangleq \! \{(\textsf{e},\textsf{e}') \!\in \!\textsf{Event}^2 \!\; | \;t (\textsf{e})\!=\! t (\textsf{e}') \wedge \overline{n}(\textsf{e})\!=\!\overline{n}(\textsf{e}') \}\). We use \(\texttt{sqp} \) for events on the same queue pair, i.e. by the same thread and towards the same remote node. Note that \(\texttt{sqp} \subseteq \texttt{sthd} \) and that \(\texttt{sloc} \), \(\texttt{sthd} \), and \(\texttt{sqp} \) are all symmetric. For a set of events \(E\), we write \(E.\texttt{sloc} \) for \(\texttt{sloc} |_{E}\); similarly for \(E.\texttt{sthd} \) and \(E.\texttt{sqp} \).
Definition 2
(Pre-executions). A tuple
is a pre-execution of a program if:
\(E\subseteq \textsf{Event}\) is the set of events and includes a set of initialisation events, \(E^0 \subseteq E\), comprising a single write with label \(\texttt{lW} (x, 0)\) for each \(x\in \textsf{Loc}\).
is the ‘program order’ relation defined as a disjoint union of strict total orders, each ordering the events of one thread, with
, and such that:
Each put (resp. get) operation corresponds to two events: a read and a write with the read immediately preceding the write in
: 1. if \(r \!\in \! G.\texttt{nlR} \) (resp. \(r \!\in \! G.\texttt{nrR} \)), then
for some \(w \!\in \! G.\texttt{nrW} \) (\(w \!\in \! G.\texttt{nlW} \)); and 2. if \(w \!\in \! G.\texttt{nrW} \) (resp. \(w \!\in \! G.\texttt{nlW} \)), then
for some \(r \!\in \! G.\texttt{nlR} \) (\(r \!\in \! G.\texttt{nrR} \)).
Read and write events of a put (resp. get) have matching values: if
, \(\textsf{type}(r) \in \left\{ {\texttt{nlR},\texttt{nrR}}\right\} \), and \(\textsf{type}(w) \in \left\{ {\texttt{nlW},\texttt{nrW}}\right\} \), then \(v_{\text {r}} (r) = v_{\text {w}} (w)\).
is the ‘polls-from’ relation, relating earlier (in program-order) NIC writes to later poll operations on the same queue pair; i.e.
. Moreover,
is functional on its domain (every NIC write can be be polled at most once), and
is total and functional on its range (every poll in \(E.\texttt{P} \) polls from exactly one NIC write). Also, Poll events poll-from the oldest non-polled remote operation on the same queue pair:
if \(w_1 \in G.\texttt{nW} \) and
, then there exists \(p_1\) such that
.
Pre-executions are constructed syntactically by induction on the structure of the corresponding program. This definition is standard and omitted.
Intuitively, a pre-execution can also be seen as a trace of the execution: for each thread \(t \),
restricted to \(t \) is a total order, and so
is fundamentally a sequence of events for each thread. In this view,
should be considered a well-formedness condition: each prefix of the trace needs to have at least as many remote operations as poll operations. So
can be seen as providing a well-formed trace for each thread. We later define robustness conditions on pre-executions, and as such they can also be considered conditions on traces.
We next extend the notion of a pre-execution to an execution by choosing explicitly how the different events interact.
Definition 3
(Executions).
is an execution if:
is a pre-execution.
is the ‘reads-from’ relation on events of the same location with matching values; i.e.
. Moreover,
is total and functional on its range: every read in \(E.\mathcal {R} \) is related to exactly one write in \(E.\mathcal {W} \).
is the ‘modification-order’, where each
is a strict total order on \(E.\mathcal {W} _x\) with
describing the order in which writes on \(x\) reach the memory.
is the ‘NIC flush order’, such that for all \((a, b) \in E.\texttt{sqp} \), if \(a \in E.\texttt{nlR}, b \in E.\texttt{nlW} \), then
, and if \(a \in E.\texttt{nrR}, b \in E.\texttt{nrW} \), then
.
We define the reads-before relation as
, relating each read r to writes that are
-after the write r reads from. Given a (pre-)execution \(G \) (resp. \(\mathcal {G} \)), we use the ‘\(G.\)’ prefix to project its various components (e.g.
) and derived relations (e.g.
). When the context is clear, we drop the prefix.
PCIe guarantees that a NIC local read (\(\texttt{nlR} \)) propagates all pending NIC local writes (\(\texttt{nlW} \)) (processed by the same queue pair) to memory, while a NIC remote read (\(\texttt{nrR} \)) propagates all pending NIC remote writes (\(\texttt{nrW} \)) (processed by the same queue pair) to memory. We model this total order through the
relation, stipulating that all NIC local reads and writes (resp. all NIC remote reads and writes) on the same queue pair be totally ordered.
Issue and Observation Points. In what follows we distinguish between when an instruction is issued and when it is observed. Intuitively, an instruction is issued when it is processed by the CPU or the NIC, and it is observed when its effect is propagated to memory. As such, since NIC writes can be delayed and have an observable effect on memory, the time points at which they are issued and observed may differ. Since we assume CPUs follow the strong SC memory model, CPU writes are issued and observed at the same time. However, the local (resp. remote) write of a get (resp. put) is issued when it is processed by the NIC and sent to the PCIe fabric, and observed when it is propagated to memory. All other events are instantaneous in that either they do not have an observable effect on memory (e.g. reads), or their effect is written to memory immediately (e.g. CAS operations and CPU writes). Given a set of events \(E\), we thus define the set of instantaneous events in\(E\) as \(E.\texttt{Inst} \triangleq E\setminus (E.\texttt{nlW} \cup E.\texttt{nrW})\). Intuitively, the effects of NIC local writes and NIC remote writes (labelled \(\texttt{nlW} \) and \(\texttt{nrW} \)) can be delayed in the PCIe fabric and are thus excluded from the set of instantaneous events. Note that the observation point either follows the issue point (for NIC writes), or coincides (for instantaneous events).
Fig. 4.
The rdmasc ordering constraints on
(left) and
(right), where
denotes that instructions are ordered (and cannot be reordered),
denotes they are not ordered (and may be reordered), and \(\texttt{sqp} \) denotes they are ordered iff they are on the same queue pair.
We next define the ‘issue-preserved program order’,
, as the subset of
edges (
) that must be preserved when issuing instructions. That is, if two events are
-related, then they must be issued in program order; otherwise they may be processed in either order. The left table of Fig. 4 describes which
edges are included in
, where
denotes that the two instructions are
-related (i.e. they must be issued in program order),
denotes that they are not
-related (i.e. they may be issued out of order) and \(\texttt{sqp} \) denotes that they are
-related iff they are on the same queue pair. For instance, when a CPU instruction is followed by anything, they are issued in order (line A); but when a NIC instruction is followed by a CPU one, they may be reordered (cells B1-F1).
Analogously, we define the ‘observation-preserved program order’,
, as the subset of
edges (
) that must be preserved when observing the effects of instructions. I.e., if two events are
-related, then they become observable in program order in rdmasc; otherwise they may become observable in either order. The right table of Fig. 4 describes which
edges are included in
. The two tables differ in cells C6 and E6. This is because NIC writes can be delayed, and remote fences do not guarantee propagation to memory.
Consistency. The notion of executions (Def. 3) imposes very few constraints on the
, and
relations. Such restrictions and thus the permitted behaviours of a program are determined by defining the set of consistent executions, defined below.
Definition 4
(-consistency). An execution
is rdmasc-consistent iff
and
are irreflexive, where:
The
(resp.
) relation is an extension of
(resp.
), describing the issue (resp. observation) order across the instructions of different threads and nodes. rdmasc-consistency requires that
and
be irreflexive (i.e. yield strict partial orders as they are defined transitively).
The
(resp.
) component in
states that if \(\textsf{e}\) reads from (resp. polls from) w, then w must have been issued before \(\textsf{e}\). Recall that
totally orders the \(\texttt{nlR}/\texttt{nlW} \) and \(\texttt{nrR}/\texttt{nrW} \) operations on the same queue pair and is thus in
. The
component in
states that if a read r reads from a write w, then the write has reached memory. This is because reads can only read the main memory and not auxiliary buffers. The
component states that if p polls from a NIC local write w, then w must have left the PCIe fabric and reached the memory. Note that this is not the case for \(\texttt{nrW} \) events: polling an \(\texttt{nrW} \) event w might succeeds when w is still in the remote PCIe fabric before reaching the remote memory. The
in
can be justified as in the case of
. The
component in
ensures that a read r on \(x\) observes the latest write on \(x\) that has reached the memory. As
describes the order in which the writes on each location reach the memory, it is included in
. Let \((\tau _i, \tau _o)\) be the issue and observation points of \(\textsf{e}\) and \((\tau '_i, \tau '_o)\) be those of \(\textsf{e}'\). The
in
ensures that if
(i.e. \(\tau _i < \tau '_i\)) and \(\textsf{e}\) is instantaneous (\(\tau _i \!=\! \tau _o\)), then \(\tau _o \!=\! \tau _i < \tau '_i \le \tau '_o\), i.e.
.
4 Robustness of rdmasc Programs
In the traditional setting of CPU concurrency (where all threads execute CPU instructions), the most intuitive consistency model is sequential consistency (SC) [43]. While SC is too strong—in that disallowing all reorderings does not enable efficient implementations—it provides an intuitive and commonly understood model, making it easier for developers to reason about their programs.
Although none of the existing well-known consistency models follow SC by default, programmers typically address this difficulty by focusing on robust implementations of algorithms. Specifically, a program is robust under a weak consistency model CM if every possible behaviour of the program under CM is also an allowed behaviour under SC. In our model, this is defined as follows.
Definition 5
(SC-consistency and-robustness). Given an execution
, its associated sequential-consistency relation is defined as
. An execution \(G \) is SC-consistent iff
is acyclic. A pre-execution is robust underrdmasc iff all of its rdmasc-consistent executions (Def. 4) are also SC-consistent.
Our aim here is to provide guidelines to ensure the robustness of rdmasc programs. That is, we identify a number of syntactic requirements such that if a program fulfils them, then the behaviours of the program under rdmasc coincide with its behaviours under SC; i.e. the program does not exhibit any weak behaviours brought about by observable reorderings.
There are two complementary approaches to achieve robustness. The first is to structure the program in a way that limits the very existence of problematic cases. The second is to extend the program with enough restrictions (e.g. polls and remote fences) to prohibit reorderings. In the next section (§5) we focus on the former and provide a set of explicit guidelines to avoid most problematic cases by design. In this section we focus on the latter, and describe how to identify problematic cases and how to block them. In what follows, we present the general syntactic restrictions required to forbid the reordering opportunities for specific operations (§4.1). We then propose sufficient syntactic conditions that block observable reorderings, and we prove that these conditions imply robustness (§4.2). Finally, we discuss the limitations of this approach (§4.3).
4.1 A Syntactic Approach to Enforce the Program Order
Fig. 5.
Constraints necessary to guarantee that a pair of
-related events in \(\mathcal {R} \cup \mathcal {W} \) will be
-related for any consistent execution. CPU denotes local events in \(\texttt{lW} \cup \texttt{lR} \cup \texttt{CAS} \). The
denotes that no additional constraint is needed and that the events are already in
. \(\textsf{P}\) denotes that the earlier operation must be polled before executing the later one. \(\textsf{F}\) denotes that either the earlier operation must be polled (similar to \(\textsf{P}\)) or that a remote fence must be inserted between the two operations. \(\textsf{GP}\) denotes that a get operation and its associated poll on the first queue pair must be inserted between the two operations.
One of our key results relies on enforcing the program order (i.e. blocking instruction reordering) in potentially observable cases. Recall that given an execution, the observed-before order (
) describes when an event takes effect before another. That is, for
, when
in an execution G, then they are not reordered in G. Our first aim here is to identify syntactic constraints that ensure that a specific pair of given instructions (of the same thread) are related by
. However, in order to define syntactic constraints for robustness, we can only rely on the syntax of the program and not components such as
or
. Our syntactic constraints can only rely on the pre-execution components
and
, and we cannot directly use the
relation derived from a specific execution.
To this end, we first define the guaranteed-before relation,
, describing when two instructions in the same thread are guaranteed to remain in order (and their reordering is blocked), as shown in Fig. 5. Specifically, if two instructions are related by
, then they are guaranteed to be observed in that order and thus there is no need for additional restrictions; this is denoted by
in cells A1–A5, B6–B9, C7–C9, D9, and E9 (cf.
in Fig. 4). For most other cases (noted \(\textsf{P} \) or \(\textsf{F} \)), polling the earlier instruction enforces the ordering. Recall that polling a NIC remote write does not guarantee its completion, and we need to add a ‘dummy’ get operation and its corresponding poll to ensure ordering (noted \(\textsf{GP} \)).
In most cases, when the two operations are on the same queue pair, then a remote fence is sufficient to enforce the ordering (noted \(\textsf{F} \) in D6–D8, E6–E7), and is a cheaper alternative to a poll. Perhaps surprisingly, a remote fence is not always sufficient: the two outliers are cells C6 and E8. For C6, consider the program \(z^2 :=x ; \texttt{rfence} (2) ; w^2 :=y\): the local value of y might be read before the value of z is changed. This is because \(\texttt{rfence}\) (2) (as with poll) only awaits the acknowledgement from the remote side which does not necessarily ensure that the first put has completed. For E8, consider \(x :=z^2 ; \texttt{rfence} (2) ; y :=w^2\), where \(w^2\) can be read before x is modified: \(\texttt{rfence}\) (2) only waits for the NIC local write (\(x :=v_z\)) to be sent to the local PCIe fabric and thus the put operation (\(y :=w^2\)) can start earlier than one could expect.
Definition 6
(guaranteed-before). Given a pre-execution
, its guaranteed-before order,
, is defined as
, with:
Given an execution
, we write
for
. Finally, we prove that
implies
for any rdmasc-consistent execution (see the extended version [11] for the proof).
Theorem 1
(implies). Given a pre-execution
, for all rdmasc-consistent executions
and all \(\textsf{e}_1,\textsf{e}_2 \in E\), if
, then
.
Given Theorem 1 above, we can use
as a tool to enforce robustness. Specifically, whenever a program order pair
may be reordered, we can add the prescribed fences to enforce
and thus block the reordering. The rest of this section describes when we should use this tool.
4.2 Conditions for Robustness under rdmasc
As mentioned before, blocking all instruction reorderings, i.e. by requiring
, would enforce sequential consistency and thus robustness. However, this is too strict and highly impractical. Instead, we should ideally enforce
selectively when needed and only prevent observable reorderings.
Two sources of weak behaviours. As presented in §2, rdmasc programs have two distinct sources of weak behaviours. These come from two different kinds of pairs of events (of the same thread): (1) pairs forming a data race on a certain location, e.g. \(a :=y^2; y^2 :=1\), as presented in Fig. 1f (copied below-left) and Figs. 1b, 1e, and 1h ; and (2) pairs whose reordering can be observed by other threads, e.g. \(a :=y^2 ; z^3 :=1\), as in the examples of Fig. 3b (copied below-right).
As such, stopping these two sources of weak behaviours would be enough to ensure robustness. Data races within a thread are always problematic, no matter the context, and we always need to block the reordering of such pairs (i.e. enforce
to ensure the pair is
-ordered in any execution). Pairs of the second kind cannot create weak behaviours by themselves, but they might allow weak behaviours depending on the rest of the program of other threads. In the next section (§5), we show conditions making sure that such pairs can never create weak behaviours by design. In this section, we focus on deciding whether such a pair might lead to a weak behaviour and, if so, how to block the reordering.
To formulate this intuition, we write \(\textsf{public} (x)\) to denote that x is a public location accessed by multiple threads, and given a set of events \(E\), we define the set of public events in \(E\) as \(E^\textsf{pub}\triangleq \left\{ {\textsf{e}\in E\mid \textsf{public} (\texttt{loc} (\textsf{e}))}\right\} \). We further define \(E \setminus t \triangleq \left\{ {\textsf{e}\in E\mid t (\textsf{e}) \ne t}\right\} \) for the set of events in \(E\) that are not by thread \(t \). We can then formulate the two categories of weak behaviours above as two kinds of
cycles:
cycles on a single thread (1) and
cycles on public events across threads (2), as formulated below (see the extended version for the full proof).
Theorem 2
(cycle decomposition). Given a rdmasc-consistent execution
, if
(i.e. a cycle in
), then:
either there is a
cycle on a single thread, i.e.
;
or there exists \(\textsf{e}_1,\textsf{e}_2 \in E^\textsf{pub}\) such that
. That is, there is an
cycle on public events, with two
-related events on some thread \(t (\textsf{e}_1)\) not related in
, and where the rest of the cycle does not go through the events of \(t (\textsf{e}_1)\).
The two kinds of problematic reorderings are tackled separately below, and Theorem 5 confirms the two resulting conditions are sufficient for robustness.
Preventingcycles from data races. As shown above, when an allowed reordering is part of a data race, it becomes observable independently from the context. Thus, we should always preclude this kind of reordering. Specifically, in Def. 7 below we present a local data-race freedom property to block data races within each thread and prevent single-threaded weak behaviours.
Definition 7
(Local DRF). Given a pre-execution
, two events \(\textsf{e}_1, \textsf{e}_2 \in E\) are locally conflicting iff 1. \((\textsf{e}_1,\textsf{e}_2) \in \texttt{sthd} \); 2. \(\texttt{loc} (\textsf{e}_1) = \texttt{loc} (\textsf{e}_2)\); and 3. at least one of \(\textsf{e}_1,\textsf{e}_2\) is a write event. A pre-execution \(\mathcal {G} \) is locally data-race free (LDRF), iff for all \(\textsf{e}_1, \textsf{e}_2 \in \mathcal {G}.E\), if \(\textsf{e}_1, \textsf{e}_2\) are locally conflicting, then
. Put differently, given the definition of
(Fig. 5), a pre-execution
is LDRF iff for all locally conflicting accesses \(\textsf{e}_1, \textsf{e}_2 \in E\), if
, then the following four conditions hold:
1.
If \(\textsf{e}_1 \in \texttt{nlW} \) and \((\textsf{e}_1,\textsf{e}_2) \not \in \texttt{sqp} \), then there exists \(\textsf{e}_3 \in \texttt{P} \) such that
and
(cells E1, E2, and E5 in Fig. 5).
2.
If \(\textsf{e}_1 \in \texttt{nlW} \), \(\textsf{e}_2 \in \texttt{nlR} \), and \((\textsf{e}_1,\textsf{e}_2) \in \texttt{sqp} \), then either there exists \(\textsf{e}_3 \in \texttt{P} \) with
and
; or there exists \(\textsf{e}_3 \in \texttt{nF} \) with
and
(E6).
3.
If \(\textsf{e}_1 \in \texttt{nlR} \), \(\textsf{e}_2 \in (\texttt{nlW} \cup \texttt{lW} \cup \texttt{CAS})\), and \((\textsf{e}_1,\textsf{e}_2) \not \in \texttt{sqp} \), then there exists \(\textsf{e}'_1 \in \texttt{nrW} \) and \(\textsf{e}_3 \in \texttt{P} \) such that
,
, and
(cells B1 and B5).
4.
If \(\textsf{e}_1 \in \texttt{nrR} \) and \(\textsf{e}_2 \in \texttt{nrW} \), then either there exists \(\textsf{e}_3 \in \texttt{nF} \) such that
; or there exists \(\textsf{e}'_1 \in \texttt{nlW} \) and \(\textsf{e}_3 \in \texttt{P} \) such that
(cell D7 in Fig. 5).
These cases prohibit all possible races on a location x, i.e. of the form \(x :=y^n ; x :=-\) (E1,E5), \(x :=y^n ; - :=x\) (E2), \(x :=y^n ; z^n :=x\) (E6), \(y^n :=x ; x :=-\) (B1,B5), or \(- :=x^n ; x^n :=-\) (D7). Other entries in Fig. 5 cannot create races as either their ordering is already guaranteed (e.g.
in E9); or they are on two read events (e.g. B2,D8); or they cannot be on the same location (e.g. D3,E7).
We argue that the constraints in Def. 7 do not restrict RDMA capabilities in that waiting for remote operations to complete before reusing their locations is already considered standard practice when writing RDMA programs.
We next show that LDRF prevents single-threaded weak behaviours.
Theorem 3
Given a rdmasc-consistent execution
, if
is locally data-race free, then there is no
cycle on a single thread; that is,
is acyclic and the first case of Theorem 2 does not arise.
Preventingcycles across threads. Unlike data races, pairs of the second kind cannot create weak behaviours by themselves, and their reorderings can only be observed in certain contexts.
The general strategy to prevent observable reorderings is straightforward: for every pair
on public locations, either we know for certain that
(using other threads) is impossible, or we conservatively block the reordering by enforcing
. The challenge is that the relation
is heavily dependent on the specific execution. So how can we ascertain syntactically that a later event \(\textsf{e}_2\) cannot influence an earlier event \(\textsf{e}_1\)?
One easily accessible syntactic property is the communication pattern between nodes (i.e. when one node performs a remote operation towards another). Thus, to simplify the task, we over-approximate dependency (i.e.
) with communication. Intuitively, if two nodes do not communicate in the network topology, then they cannot causally influence each other.
We write \({n _1} \underset{E}{\leftrightsquigarrow \leftrightsquigarrow } {n _2}\) (defined below) to denote that nodes \(n _1\) and \(n _2\) communicate via some event in \(E\), in that some thread \(t \) on \(n _1\) performs a remote operation \(\textsf{e}\in E\) towards \(n _2\), written \( \textsf{hasQP} (t, n _2, E)\), or vice versa.
$$\begin{aligned} {n _1} \underset{E}{\leftrightsquigarrow \leftrightsquigarrow } {n _2} \triangleq \exists t. \ (n (t) & = n _1 \wedge \textsf{hasQP} (t, n _2, E)) \vee (n (t) = n _2 \wedge \textsf{hasQP} (t, n _1, E))\\ \textsf{hasQP} (t, & \overline{n}, E) \triangleq \exists \textsf{e}\in (E.\texttt{nrW} \cup E.\texttt{nrR}). \ t (\textsf{e}) = t \wedge \overline{n}(\textsf{e}) = \overline{n}\end{aligned}$$
We next show that if there is an
-path from one event \(\textsf{e}_2\) to another \(\textsf{e}_1\) using public events in A, then the corresponding nodes (of the locations) of \(\textsf{e}_2\) and \(\textsf{e}_1\) must communicate via A. This is established in Lem. 1 below, with the proof given in the extended version [11].
Lemma 1
For all \(A \subseteq E^\textsf{pub}\), if
then \({n (\texttt{loc} (\textsf{e}_2))} \underset{A}{\leftrightsquigarrow \leftrightsquigarrow }^* {n (\texttt{loc} (\textsf{e}_1))}\).
We are interested in the inverse direction of this lemma: a topological connection between the nodes (of the locations) of \(\textsf{e}_2\) and \(\textsf{e}_1\) is a necessary condition for an
-path from \(\textsf{e}_2\) to \(\textsf{e}_1\). Put differently, if there is no communication between the nodes of \(\textsf{e}_2\) and \(\textsf{e}_1\), then \(\textsf{e}_2\) cannot influence \(\textsf{e}_1\). As such, we can use this to over-approximate safely whether an event can influence another. We conservatively assume that if the two nodes can communicate (outside of the thread) then \(\textsf{e}_2\) might influence \(\textsf{e}_1\). These communications do not depend on a specific execution and can be ascertained syntactically from the pre-execution.
We can then prevent
cycles across threads using the fenced condition below (Def. 8): for all
on public locations, if \(\textsf{e}_2\) might influence \(\textsf{e}_1\), then we block the reordering. We subsequently prove that if a pre-execution is fenced, then it does not admit
cycles across threads.
Definition 8
(fenced). A pre-execution
is fenced iff for all \(\textsf{e}_1,\textsf{e}_2 \in E^\textsf{pub}\), if
and \({n (\texttt{loc} (\textsf{e}_1))} \underset{ {E^\textsf{pub} \setminus t (\textsf{e}_1)}}{\leftrightsquigarrow \leftrightsquigarrow }^* {n (\texttt{loc} (\textsf{e}_2))}\), then
.
Theorem 4
Given an rdmasc-consistent execution
, if its associated pre-execution
is fenced, then there is no
cycle of the shape
with \(\textsf{e}_1,\textsf{e}_2 \in E^\textsf{pub}\). That is, the second case of Theorem 2 does not arise.
Robustness. Lastly, we show that LDRF and fenced imply robustness under rdmasc. Thus, this approach can be used to prevent RDMA weak behaviours.
Theorem 5
(Robustness under). Given a pre-execution
, if \(\mathcal {G} \) is locally data-race free (Def. 7) and fenced (Def. 8), then \(\mathcal {G} \) is also robust under rdmasc (Def. 5).
4.3 Usage and Limitations
Local data-race freedom (Def. 7) and fenced (Def. 8) are intuitive properties that can be checked syntactically. Indeed, given a program, it is straightforward to check mechanically whether these properties hold or to provide an explicit counterexample and a suggested fix using the definition of
(Def. 6). As a result, sufficient constraints can automatically be added to ensure robustness.
However, this simplicity can occasionally be the limitation of our approach. Specifically, as the main theorem does not account for interactions between threads, it takes a conservative approach, which at times can lead to false negatives (where the program is deemed not robust even though no weak behaviours are possible), recommending unnecessary restrictions.
Fig. 6.
Examples illustrating the limitation of Theorem 5, where the programs in (a) and (b) are robust (the weak behaviour \(a=1\) is not allowed in either) while that in (c) is not robust (it admits the weak behaviour \(a=1\)); while Theorem 5 rightfully identifies (a) as robust (true positive) and (c) as not robust (true negative), it conservatively deems (b) not robust (false negative).
To see this, consider the example in Fig. 6a, where \(a :=x^2\) and \(y^3 :=1\) can be reordered without introducing weak behaviours. In this case, Theorem 5 rightfully confirms that no additional restrictions are necessary. By contrast, consider the variant shown in Fig. 6b: although the two extended threads do not introduce any additional weak behaviours, our approach assumes there might be a causal dependency from \(y^3 :=1\) to \(a :=x^2\), as is the case e.g. in Fig. 6c. As such, Theorem 5 cannot determine Fig. 6b as robust, and our approach would recommend inserting a poll operation in the first thread. Note that removing any of the six operations would enable Theorem 5 to ascertain Fig. 6b as robust.
Understanding that the reordering of the instructions in the first thread of Fig. 6b is not problematic would require a more complex static analysis beyond the scope of this paper.
5 Application: Tree Topology
Theorem 5 outlines the conditions under which we can guarantee that a program is robust under rdmasc. However, while the LDRF property (Def. 7) is reasonable, the fenced property (Def. 8) can lead to excessive restrictions (e.g. as in Fig. 6). Specifically, for every pair of events (\(\textsf{e}_1, \textsf{e}_2\)) in program order, we must either verify that \(\textsf{e}_2\) cannot affect \(\textsf{e}_1\), or ensure that their execution order is preserved. The main issue is that preserving the order of every pair of events can be particularly costly, notably when considering NIC remote write events. In such cases, the only resort is to introduce a ‘dummy’ get operation and poll it, which is inefficient. Instead, we propose a strategy whereby we stipulate certain conditions on the network topology (i.e. the shape of the RDMA network) so that later events are often unable to influence earlier events.
To this end, we propose a tree topology that balances generality (supporting a wide range of programs) with efficiency and restrictiveness (requiring minimal additional constraints to respect the fenced property). In §5.1 we present an overview of our new set of restrictions and illustrate their rationale through examples. In §5.2 we formalises these restrictions and prove that they indeed imply robustness under rdmasc. Finally, in §5.3 we demonstrate specific applications of the tree topology and how RDMA programs can make use of them.
5.1 Overview of the Restrictions
We describe four different conditions that, if satisfied, ensure the robustness of RDMA programs under rdmasc, and we justify them through examples.
LDRF. As before, we require that programs satisfy LDRF (Def. 7). As discussed, this is considered standard practice when writing RDMA programs and should not be seen as a limitation.
Private Copies. We require the local locations of RDMA operations – e.g. location y in \(y :=x^2\) – to be private (i.e. accessed by only one thread, namely that executing the RDMA operation). Intuitively, to maximise the efficiency of RDMA programs, we should ideally allow arbitrary interleaving of RDMA operations and CPU computations. For instance, let us consider the single-threaded program
, where
denotes a block of CPU instructions that does not access location y. If y is private, then although
and the get \(y :=x^2\) may be reordered, this reordering will not lead to any observable weak behaviours. That is, when we run
concurrently with any RDMA program
(i.e. as
), if y is private, then we do not need to poll \(y :=x^2\) before proceeding with
(even though they may be reordered), as the reordering cannot be observed by
.
However, if y is accessible by other threads (on the same node or from a remote node), then the reordering becomes visible, allowing additional, potentially unwanted, weak behaviours. This is illustrated in the example below, where
and y is public (accessed by nodes 1 and 3).
Therefore, to prevent such weak behaviours, we stipulate that local locations of RDMA operations be private. This is not a costly limitation. Specifically, in the case of put operations, the data can easily be copied beforehand to a one-time-use private location. In the case of get operations, it means the thread running the command needs to acknowledge the data and copy it to make it available to other threads having access to the node.
Get in Order. We stipulate that each get operation be followed by a remote fence. Recall that only certain reorderings are allowed on the operations of the same queue pair. Intuitively, put operations cannot be overtaken, and we do not need to restrict their usage. However, get operations can be overtaken by other get/put operations, as shown in the examples below, where the \(a :=x^2\) is overtaken by a later remote operation on the same queue pair, leading to weak behaviours.
As such, to prevent non-SC behaviours, we require that each get operation be followed by a remote fence, forcing the queue pair to await the completion of the get before starting the next remote operation. Of course, if the get is polled before another RDMA operation is submitted, the remote fence is not needed. Note that since remote fences do not block CPU computations nor communications with other nodes, they are not very expensive and are a reasonable cost to pay to ensure remote operations towards a specific remote node stay in order.
Tree Topology. Finally, the most important restriction is to constrain the topology of the network over which the program runs. Intuitively, having multiple paths between a set of nodes allows for visible effects to overtake each other (i.e. be reordered) along different paths, leading to weak behaviours. In the extreme case where every thread can communicate directly with every other node, we allow for a large number of visible reorderings, and lose any hope of preventing non-SC behaviours. When such connected topologies are needed to enable more efficient implementations (e.g. consensus algorithms), the developers must carefully account for the possible weak behaviours.
Our proposal is to adhere to a minimal topology where there is (at most) a single communication path between each pair of nodes. In the examples below we show how not adhering to the tree topology can lead to weak behaviours. Note that although we have followed each remote operation with a corresponding (costly) poll, we still cannot prevent the weak behaviours shown.
The first example shows that queue pairs in both directions (between nodes 1 and 2) can lead to weak behaviours as they can observe the reordering of operations on the other node. The second example illustrates two paths between node 1 and 3: a direct path from node 1 to 3 (via \(z^3 :=1\)) and an indirect path through node 2 (from node 1 to 2 via \(x^2 :=1\); from node 2 to 3 via \(y^3 :=x\)). As shown, having multiple paths between two nodes allows threads to observe reorderings: \(z^3 :=1\) is submitted first, but the effects of \(x^2 :=1\), forwarded via \(y^3 :=x\), is observed first. The third example is a variant of the second, where the middle node is replaced by an additional thread on the left node. As queue pairs from different threads of the same node towards the same remote are still independent, the weak behaviour shown is permitted.
5.2 Tree Robustness
We next formalise the conditions described in §5.1 in Def. 9 below.
Definition 9
(tree-fenced ). A pre-execution
is tree-fenced iff:
1.
Local locations of RDMA operations are private: \(E^\textsf{pub}.\texttt{nlR} = E^\textsf{pub}.\texttt{nlW} = \emptyset \)
2.
Each get operation is followed by a remote fence (or is polled) before the next remote operation on the same queue pair.
That is, for all \(\textsf{e}_1, \textsf{e}_2\), if \(\textsf{e}_1 \in \texttt{nrR} \), \(\textsf{e}_2 \in (\texttt{nrR} \cup \texttt{nrW})\), and
, then: either there exists \(f \in \texttt{nF} \) such that
and
; or there exists \(\textsf{e}_3 \in \texttt{nlW} \) and \(p \in \texttt{P} \) such that
,
, and
.
3.
There is (at most) a single communication path between any pair of nodes in that the following three properties hold:
(a)
The network does not have cycles, i.e. for all sets of distinct nodes \(\{n _1; \ldots ; n _k\}\) with \(k>2\): \(\lnot (n _1 \underset{E}{\leftrightsquigarrow \leftrightsquigarrow }n _2 \underset{E}{\leftrightsquigarrow \leftrightsquigarrow }\ldots \underset{E}{\leftrightsquigarrow \leftrightsquigarrow }n _k \underset{E}{\leftrightsquigarrow \leftrightsquigarrow }n _1)\)
(b)
No two nodes have queue pairs towards each other:
\(\lnot \exists t _1,t _2. \ \textsf{hasQP} (t _1, n (t _2), E) \wedge \textsf{hasQP} (t _2, n (t _1), E)\)
(c)
Each node has at most one queue pair towards each remote node:
\(\forall t, t', \overline{n}. \ t \ne t' \wedge \textsf{hasQP} (t, \overline{n}, E) \wedge \textsf{hasQP} (t ', \overline{n}, E) \implies n (t) \ne n (t ')\)
Conditions 1 and 2 are purely syntactic and can be straightforwardly checked by examining the RDMA program. Condition 3 pertains to the topology of the RDMA network and can also be checked by examining the RDMA program.
A key advantage of these restrictions is that preventing weak behaviours never requires polling remote operations. This is crucial because the efficiency of RDMA implementations comes from parallelising data transfers and computations. As shown in the overview (§2), polling is very costly as it completely halts local computations and prevents submission of remote operations to any queue pair. With a tree topology, programmers only need to wait for remote operations to use their results (as per LDRF Def. 7), and do not need to sacrifice computation time to prevent reorderings.
We next prove that if a pre-execution is tree-fenced, then it is also fenced. The full proof is given in the extended version [11].
Theorem 6
If a pre-execution is tree-fenced (Def. 9), then it is fenced (Def. 8).
Hence, LDRF and tree-fenced properties imply robustness under rdmasc.
Corollary 1
(Tree robustness under). If a pre-execution
satisfies LDRF (Def. 7) and is tree-fenced (Def. 9), then it is also robust under rdmasc (Def. 5).
5.3 Specific Applications
The tree-fenced conditions above provide guidelines to ensure programs cannot exhibit weak behaviours. While not all RDMA programs follow the restrictions presented, a tree topology is sufficient for a range of applications. Notably, any setup using RDMA solely for the data transfer capabilities (and not for distributed computations) can easily be configured as a tree.
Star Topology: Single Manager Multiple Workers. The star topology is one of the most typical network configurations, providing simple and reliable communication between nodes, with many common applications such as for implementing local area networks (LAN). The star topology allows a main node to distribute jobs to other nodes and periodically check for progress. As demonstrated in this paper, this setup prevents any network weak behaviour even if communications towards different workers are independent and can be reordered.
Star Topology: Single Server Multiple Clients. The tree-fenced condition (Def. 9) is permissive enough to allow us to translate common concurrent algorithms (comprising loads and stores over shared memory) to distributed ones over RDMA automatically as follows. Specifically, consider a concurrent algorithm \(P_c\) using k threads \((t _1,...,t _k)\). We can translate this to a corresponding RDMA program \(P_r\) using k nodes \((n _1,...,n _k)\), where a designated node (say \(n _1\)) is the server and the others (\(n _2,...,n _k\)) are clients, and each node \(n _i\) has a single thread simulating \(t _i\). All shared locations and data are located on the server node (\(n _1\) running \(t _1\)). For each of the remaining nodes \(n _i\), we replace the loads and stores on shared locations with get and put operations, respectively. Moreover, we insert a remote fence after each get operation (to ensure condition (2) of Def. 9) and poll get operations before using their values (to ensure LDRF).
The resulting RDMA program follows a star topology, with \(n _1\) as the central (server) node accessed by multiple clients \((n _2,...,n _k)\). Client locations are private by definition, ensuring that the tree-fenced condition holds. \(P_r\) thus avoids weak behaviours and constitutes a suitable implementation of \(P_c\).
Observe that in this implementation, polling put operations is unnecessary (as long as different local locations are used for copying), and get operations can be optimised by being submitted as early as possible (i.e. after previous RDMA operations and reads on the same location) and before they are needed, allowing them to be interleaved with other computations.
6 Related Work
RDMA Semantics. The first realistic formal model for RDMA programs is rdmatso by Ambal et al. [10] (where they assume that CPU concurrency is governed by TSO) formalised both operationally and declaratively, which they show to be equivalent. They also validate rdmatso empirically by running an extensive suite of litmus tests on RDMA hardware. While comprehensive in its formal description of the language, this work does not present strategies for mitigating RDMA weaknesses or optimising the use of this technology by using e.g. minimal poll and fence instructions. The only other work on formal RDMA semantics is that by Dan et al. [27], which as demonstrated by Ambal et al. [10] does not follow the RDMA specification.
Weak Memory Models. Existing literature includes multiple examples of weak consistency models. For hardware, several works have formalised the semantics of the x86, ARMv8 and POWER architectures [2, 5, 9, 31, 48, 59, 63, 67, 68]. However, none of these works covered the consistency semantics of RDMA programs. For software, there has been a number of formal models for C11 [12, 25, 37, 40, 42, 44, 53, 56] with verified compilation schemes [51, 57, 58], Java [15, 49], transactional memory [60, 61, 72], the Linux kernel [8] and the ext4 filesystem [39]. Additionally, there has been several works on formalising the persistency semantics of programs in the context of non-volatile memory, describing the behaviour of programs in case of crashes [26, 38, 64‐66], as well as program logics for verifying such programs [17, 62, 70].
Robustness. The concept of robustness against weak memory semantics has been extensively studied across various models as a means to simplify programming, reasoning, and verification. Notably, robustness for Total Store Order (TSO) and its Partial Store Order (PSO) variant [9, 36, 55] has received significant attention, e.g. [1, 2, 18‐20, 23, 24, 35, 45‐47, 54]. In addition, robustness has been used as a correctness notion in the context of automatic fence insertion for weak hardware memory models [6, 7, 22, 28, 29]. More recent work has developed techniques for checking robustness against concurrency semantics in programming languages, particularly the C11 memory model [41, 50]. Robustness has also been explored in distributed systems, where Sequential Consistency (SC) is replaced by serialisability [13, 14, 16, 21, 30, 52]. More recently, [34] addressed the problem of checking robustness in the context of weak persistency models for non-volatile memory.
Some of these works provide sound and complete techniques for verifying robustness, along with complexity bounds for specific models. Others, as with our work on RDMA, focus on practical over-approximations, offering programmers guidelines that, when followed, ensure stronger semantics. The well-known Data-Race-Free (DRF) guarantee [3, 33] for multicore hardware and programming language models is a prominent criterion of this type.
Acknowledgements
We thank the anonymous reviewers for their valuable feedback and Viktor Vafeiadis for many fruitful discussions. Ambal is supported by the EPSRC grant EP/X037029/1. Lahav is supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement no. 851811) and the Israel Science Foundation (grant no. 814/22). Raad is supported by a UKRI fellowship MR/V024299/1, by the EPSRC grant EP/X037029/1 and by VeTSS.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.