1 Introduction
1.1 Contributions
-
A novel interpretation and formalization of dynamic partial replication as a stateful routing problem.
-
A solution for said routing problem: E-Cast. Formally speaking, E-Cast is a dynamic, uniform, causal total order multicast protocol for asynchronous networks with unreliable failure detectors. It handles both message ordering and dynamic membership in a single, wait-free (asynchronous and pipelining) protocol. By plugging different routing functions into E-Cast, applications built on E-Cast can implement complex system membership and data replication schemes in a straight-forward manner. This novel design makes E-Cast as easy to use as a classic atomic broadcast protocol, while also providing much of the elasticity and performance benefits of a more general group communication toolkit.
-
Rubberband, a framework for highly available, elastic data stores built on E-Cast. Rubberband supports atomic queries and updates with and—in contrast to existing systems—without key predicates.
-
Crescando/RB, a relational data store that combines Rubberband and the Crescando engine into a specialized system for real-time business intelligence workloads. Crescando/RB addresses a real industry use case, the Amadeus travel reservation system.
1.2 Outline
2 Use case
3 System overview
3.1 Crescando
: airport =
‘LHR
’, \(Q_2\)
: airport =
‘LAX
’, etc., it builds a hash index over the predicate constants (‘LHR
’, ‘LAX
’). During execution, the set of queries forms the inner side of an index-join between the data table and the (virtual) query table. Crescando indexes not just queries, but also update and delete operations. Because predicate indices are short-lived (one scan cycle), Crescando can rapidly adapt to changing workloads.3.2 Crescando/RB
Problem | Solution | Advantage |
---|---|---|
Unpredictable workload | Scan-only query processing | No index maintenance, robust performance |
High access latency | End-to-end pipelining and batching | Throughput independent of latency |
Large range queries | Read-one-write-all successor replication | No duplicate matching, minimal overhead |
Non-key queries and updates | Up-front, total order of operations | No deadlocks, no livelocks, robust performance |
System membership changes | Stateful routing, partial live migration | No blocking, continuous availability |
4 Background and related work
4.1 NoSQL data stores
4.2 Push-based query processing
next()
or some equivalent iterator interface to each operator [40].4.3 Scan-only query processing
4.4 State-machine replication
4.5 Group communication
5 E-cast
5.1 Motivation
5.2 System model
5.3 Protocol overview
5.4 Application process algorithm
5.5 Router process algorithm
destsetpfx
[[19]].5.6 Implementation
Config
module into the routers. The Config
module defines apply, destset
, and suspects
functions over config objects and messages. Routers maintain a sequence of config objects, created by successively applying learned messages to the latest config.Config
objects, so there is no space overhead if the configuration does not change. Config objects are garbage-collected when all messages that rely on them for routing have become stable. For a concrete Config
module, see Sect. 6.3.6 Rubberband
-
delivers writes uniformly, in causal total order (yields sequential consistency [33]);
-
is completely wait-free (clients can safely submit a high-rate stream of reads and writes without waiting for confirmations);
-
elastically scales to many storage processes;
-
allows data stores to be continuously available, even during reconfiguration (data repartitioning);
-
tolerates permanent, silent failure of any process;
-
does not require stable storage (disks), perfect failure detectors, or real-time clocks.
6.1 Process types
-
Client Processes submit read and write messages. Depending on the type of message and the chosen isolation level (cf. Sect. 6.7), an individual read or write may be transmitted through E-Cast, or sent directly to one or more storage processes.
-
Storage Processes each store a partition of application data, as explained below. Delivered read and write messages are forwarded to the storage engine (Crescando) in an order that preserves consistency. Conversely, result tuples emitted by the storage engine are streamed back directly to the respective client process, which in turn delivers them to the user.
-
Super Processes submit configuration messages on behalf of users. Configuration messages change the mapping of keys to storage processes, as explained in Sect. 6.4. Super processes maintain a replica of the latest system configuration (set of processes and mapping of keyspace partitions to storage processes). Super processes are also responsible for failure detection of client, storage, and super processes.
6.2 Data placement
6.3 Message routing with E-cast
Config
modules to define the following callback functions: destset, apply
, and suspects
. The Rubberband Config
module and the implementation of these callback functions are sketched in Algorithm 3.destset
function computes the destination set of data messages in read-one-write-all (ROWA) manner. The destination set of a configuration message is simply the set of all affected processes, which always includes the set of super processes, since these must maintain a replica of the current system configuration.apply
function works as follows. Whenever a data message is applied to a config object, it returns the original config (reads and writes do not change the system configuration). But when a configuration message is applied, it returns a new config which reflects the effect of the message and may change the routing of subsequent messages. For example, when a suspect message
\(m\) is applied such that \({{\mathrm{suspects}}}(m) = \{s\}\) where \(s\) is some storage process in the keyspace, then Rubberband will stop routing subsequent read messages to \(s\) and instead use the neighbors of \(s\) on the keyspace, which hold replicas of the data partition assigned to \(s\).suspects
function informs E-Cast of failed application processes. Messages sent by suspected processes have no effect, which ensures that suspected processes cannot corrupt the system state. The concrete implementation of the config module does not remember which processes are suspected (there may be many in a long-lived system), but instead remembers which unsuspected processes are currently part of the system.destset
function must be deterministic to ensure agreement on the destination set of a message across multiple router processes. In the case of read messages, there are typically many possible, correct destination sets due to multiple replicas being available for each data object. A naïve solution is to always choose the replica with the lowest identifier, but this would lead to poor load balance. As a better solution, Rubberband client processes tag every read message with a random replica offset. This offset determines which storage process to choose in every replication group.6.4 Asynchronous reconfiguration
-
Expand Assign an ID to a standby storage process, which upon delivery of the message becomes active.
-
Contract Contract the keyspace by removing an active storage process from it.
-
Replace Replace an active storage process with a standby storage process, assigning the same ID to it.
-
Rebalance Assign a different ID to an active storage process, thereby rebalancing the keyspace.
6.5 Reconfiguration as an optimization problem
6.6 Consistency versus robustness versus scalability
6.7 Read isolation levels
-
Sequential Read Every sequential read sees a globally consistent snapshot, which is more recent than any snapshot seen by any preceding sequential read by the same client, and reflects the effects of any preceding write by the client, but not the effects of any successive write by the client. This guarantees sequential consistency [33] with respect to sequential reads, hence the name.
-
Snapshot Read Snapshot reads are brought into total order with respect to writes. Thus, snapshot reads see a globally consistent snapshot of the data. No guarantees concerning the specific snapshot version with respect to operations submitted before and after are given though (i.e., no causal order).
-
Basic Read Basic reads scale linearly with the number of storage processes. However, multi-key basic reads may see partial effects of concurrent writes.
7 Experimental evaluation
7.1 Platform
7.2 Workload and configuration
7.3 Write scalability
7.4 Read scalability
7.4.1 Hash partitioning
7.4.2 Random partitioning
7.5 Range reads and range writes
7.6 Elasticity
7.6.1 Throughput
7.6.2 Transaction duration
7.6.3 Elasticity on EC2
cc1.4xlarge
. Each of these instances had 23 GB of memory, 2 Intel Xeon X5570 2.93 GHz, quad-core CPUs with hyper-threading, and a 10 Gbit Ethernet interface. 1 EC2 instance was running the client process, and 3 EC2 instances were each running 1 router process. The remaining 16 EC2 instances were each running 1 storage process.