Path queries over labelled graphs are increasingly common in many applications. These include fraud detection [
27], recommender systems [
9] and social analysis [
2] amongst others. Such a labelled graph has the form
\(G=(V,E, L_V, l )\), where each vertex
v is annotated with a label
\( l (v) \in L_V\) from a predefined set
\(L_V\) of labels (e.g. Purchase, Person, etc...). In this work we address the problem of efficiently and incrementally
improving path query performance over
\(k-\)
way partitionings of large, heterogeneous, labelled graphs. A
\(k-\)partitioning
\(P_k(G)\) of
G is a family of disjoint sets
\(\{V_1,V_2,\ldots ,V_k\}\), with
\(V_1 \cup \dots \cup V_k = V\). The heterogeneity of
G refers to the diversity in the labels
\(L_V\) associated to the vertices, e.g., a social graph with
\(L_V = \{ Person, Post\}\) is more heterogeneous than a web graph with
\(L_V = \{ Url \}\).
Partitioning large graphs is a recognised approach to addressing scalability issues in graph data management. However, if these partitionings are of a low quality then the performance of path queries (indeed, general pattern matching queries), greatly decreases [
18]. Intuitively, any measure of this partitioning quality should correspond to the number of inter-partition traversals, or
\( ipt \) for short, i.e., the number of times that inter-partition edges
\((v_i, v_j) \in E\) with
\(v_i \in V_i, v_j \in V_j, i \ne j\) are traversed during query execution. Current systems for improving graph partition quality either optimise data placement (graph partitioners) [
8,
12,
23,
26], or are based on selective vertex replication [
18,
21,
32]. We will improve on the output of graph partitioners, without considering vertex replication, i.e.
\(V_i \cap V_j = \emptyset \) for
\(i \ne j \).
Existing graph partitioners have two main drawbacks: Firstly, due to their computational complexity [
26], non-streaming methods [
8,
12,
23] are only suitable as offline operations, typically performed ahead of analytical workloads. For online, non-analytical workloads, they require complete re-execution, i.e. after a series of graph updates, which may be impractical [
10]. Simpler methods, such as grouping vertices by some hash of their ids, are efficient [
18] but yield poor
\( ipt \) scores when queried. Methods meant to partition graph streams [
26,
28] lie between these two extremes, both in terms of efficiency and quality. However, they make strong assumptions about the order of graph streams and the availability of neighbourhood information for new vertices. As a result they are also largely confined to offline application.
1.1 The TAPER re-partitioner
In this paper we present
TAPER, a graph re-partitioning system that is sensitive to evolving query workloads. Let
\(\mathcal{Q} = \{(Q_1, n_1) \dots (Q_h,n_h)\}\) denote a query workload, where
\(n_i\) is the relative frequency of
\(Q_i\) in
\(\mathcal{Q}\), and let
\(P_k(G)\) be an existing k-way partitioning of
G. This could be for instance a simple hash-based partitioning, or one based on an established method such as Metis [
12] multilevel partitioning [
12] or spectral recursive octasection [
8].
The goal of TAPER is to enhance
\(P_k(G)\), by computing a new partitioning \(P_k'(G,\mathcal{Q})\) from \(P_k(G)\) that takes \(\mathcal{Q}\) into account. The new partitioning is obtained by swapping vertices across the partitions of \(P_k(G)\), using heuristics that attempt to minimise the total probability of \( ipt \), denoted total \( extroversion \), that occur during execution of any of the queries in \(\mathcal{Q}\). As this method only involves moving relatively few vertices from one partition to another, it is much less expensive than a complete re-partitioning, even after many iterations.
Furthermore, by virtue of its incremental nature,
TAPER is able to track changes
\(\mathcal{Q} \rightarrow \mathcal{Q'}\) in the workload, by re-partitioning its own partitioning, i.e.,
$$\begin{aligned} P_k(G,Q) \xrightarrow {\mathcal{Q}'} P_k'(G,Q') \end{aligned}$$
(1)
In general, given an initial, possibly workload-agnostic, and non-optimal initial partitioning
\(P^0_k(G)\),
TAPER can be used to compute a progression of partitionings:
$$\begin{aligned} P^0_k(G) \xrightarrow {\mathcal{Q}_1} P^1_k(G,\mathcal{Q}_{1}) \xrightarrow {\mathcal{Q}_2} P^2_k(G,\mathcal{Q}_{2}) \dots \end{aligned}$$
(2)
These partitionings have the property that each
\(P^i_k(G,\mathcal{Q}_{i}) \) exhibits lower extroversion than
\(P^0_k(G)\) given
\(\mathcal{Q}_i\): it is approximately optimised for that workload.
TAPER makes use of space-efficient main-memory data structures to encode
\(\mathcal{Q}\) and to associate estimates of traversal probability with the edges in
G. These are then used to calculate the extroversion of each vertex in its partition. A
TAPER re-partitioning step, as in Definition
1, is actually several internal iterations of a vertex-swapping procedure aimed at reducing extroversion one vertex at a time. Each time a new
TAPER invocation is required (
1) we update our data structures for
\(\mathcal{Q'}\) and
\(G'\), recompute vertex extroversion and begin a new round of iterations.
1.2 Contributions
Our specific contributions are as follows:
Firstly, from the notion of
stability of a partition [
4] we derive an operational metric of partitioning quality, expressed in terms of extroversion for each vertex;
Secondly, we describe an encoding of traversal probabilities for each edge in
G, given
\(\mathcal{Q}\), which is space-efficient and show how they can be updated following the evolution of
\(\mathcal{Q}\);
Thirdly, we show how
TAPER makes use of these structures to iteratively achieve a re-partitioning step (Definition
1).
We present an extensive evaluation of the
TAPER system using both real and synthetic graphs of varying sizes, and compare its performance and scalability against one-off workload-agnostic partitionings obtained using the popular Metis approach,
1 without edge weights. In our experiments we use both a simple hash-based partitioning as well as a Metis partitioning as a starting point
\(P^0_k(G)\) for one invocation of
TAPER. Our results show that such an invocation converges to a stable quality within 6–7 internal iterations, and that the resulting new partitioning
\(P^{i+1}_k(G,\mathcal{Q})\) exhibits
\(70\%\) quality improvement when a hash-based
\(P^0_k(G)\) starting point is used, and about
\(30\%\) improvement when using a Metis initial partitioning.
Finally, we show experimentally how the quality of a partitioning degrades following successive simulated changes in \(\mathcal{Q}\), and how it is successfully restored by repeatedly invoking TAPER on the current partitioning and the new workload.
Two main strands of prior work are relevant to our study: (1) workload aware replication and data placement in distributed databases; and (2) graph partitioning.
In the context of online applications, distributed database queries that traverse multiple partitions are expensive [
20], incurring high communication cost and, in some implementations, resource contention. In order to achieve good latencies, distributed databases must find a data placement strategy which minimises these transactions, and the overhead they cause, whilst maintaining a balanced load across all machines involved. This is also known as reducing average query
span, i.e. the number of partitions involved in answering a query.
For graph data,
balanced graph partitioning has been exhaustively studied in literature since the 1970s [
8,
12,
13,
16,
23,
25,
26,
28,
30], and a number of practical implementations are available [
12,
23]. We do not seek a new graph partitioning algorithm; rather, to propose a workload-driven method for improving partitions that already exist.
This goal is orthogonal to that of the Loom [
7] system, also proposed by the present authors. Loom uses a variant of the traversal probability encoding presented here (Sect.
4) to produce a workload-aware partitioning of a graph
stream. This may seem to support our goal with
TAPER, however, like other streaming graph partitioners [
26,
28], once Loom has assigned a vertex to a partition it will never reassign it. This has several implications: firstly, Loom assumes that a workload
\(\mathcal{Q}\) is known
a priori with respect to vertex assignment; secondly, Loom’s partitionings are vulnerable to aforementioned degrading quality given changes to
\(\mathcal{Q}\) over time; and thirdly, Loom may only partition a graph fully, rather than take an existing partitiong as input. This is in contrast with
TAPER, which tracks an ongoing workload
\(\mathcal{Q} \rightarrow \mathcal{Q'}\) and may continuously adapt an existing graph partitioning for good performance given a random
\(Q_i \in \mathcal{Q'}\).
Note that
TAPER’s goal of refining an existing partitioning
is supported by ParMetis [
13], a parallel implementation of the established Metis partitioner [
12]. However, this process is communication and computation intensive and as such is often only used as a “one-off” step, rather than for repeated repartitioning of a graph [
29]. Additionally, like other existing partitioners, ParMetis is agnostic to a changing query workload. Although ParMetis may be provided with weights which correspond to the traversals of individual edges by a query workload, such weights are much too expensive to maintain over time [
32].
Aside from Loom and ParMetis, Curino et al. [
3,
20] have proposed systems to tackle a problem related to our own: that of workload-aware data placement in distributed RDBMS. In particular,
Schism [
3] captures a query workload over a period of time, modelling it as a graph, each edge of which represents tuples involved in the same transaction. This graph is then partitioned using existing “one-off” techniques to achieve a min edge-cut. Mapped back to the original database, this partitioning represents an arrangement of records which causes a minimal number of transactions in the captured workload to be distributed. In
SWORD, Quamar et al. [
22] build upon the ideas presented in
Schism [
3]. They use a compressed representation of the workload graph and perform incremental re-partitioning to improve the partitioning’s scalability and sensitivity to workload changes.
Although the goal of these works and our own is similar, there exist major differences in approach. For instance, in
Schism, edges directly represent the elements accessed by a query, rather than their labels as we do. However this fine grained approach produces very large graphs, which are expensive to both partition and store, and may impact scalability [
22]. Furthermore, these works are focused on a relational data model, where typical workloads overwhelmingly consist of short, 1–2 “hop” queries. This justifies Quamar et al’s simplifying decision, when repartitioning a graph, to consider only queries which span a single partition. However, this assumption does not hold for general graph path and pattern matching queries. It is unclear how
SWORD’s approach would perform given a workload containing many successive join operations, equivalent to the traversals required for graph pattern matching.
Other systems, such as LogGP [
30] and CatchW [
25], are explicitly workload-aware graph partitioners, but focus on improving analytical workloads designed for the
bulk synchronous parallel (BSP) model of computation. In the BSP model a graph processing job is performed in a number of supersteps, synchronised between partitions. Both systems keep a historical log of edges traversed in previous supersteps and use this to predict future traversals and move involved vertices so that their edges are internal to partitions, reducing
\( ipt \) in subsequent supersteps.
Yet more partitioning systems [
29,
31] consider other aspects of a graph’s workload, besides edge traversals. Xu et al. [
31] note that, for many BSP workloads, reducing the total communication time of a workload by simply minimising a graph’s inter-partition edges may not be an appropriate optimisation. This is due to the fact that a cluster of machines may be heterogeneous, i.e. machines may have different levels of computing power, or be virtualised across various physical hosts, affording different levels of inter-machine communication. Xu et al. therefore propose modelling the topology of a cluster upon which a graph partitioning will reside, as a graph itself. Subsequently, given some basic information about a workload’s communication and computation characteristics, vertices and edges of the data graph are assigned to appropriate machines, minimising workload execution time.
The work of Vaquero et al. [
29] is perhaps the most relevant to our own with TAPER. They propose a system of iterative vertex swapping to adapt to graph changes over time (e.g. vertex additions and removals). This is highly effective at maintaining a good partitioning over time,
w.r.t min edge-cut. However,
because the system optimises for min edge-cut, it does not consider the heterogeneity of a graph, or that of its query workload. Thus the work we present here could complement that of Vaquero et al: they consider changes to the graph whilst we consider changes to its workload.
Not only are all these systems [
3,
22,
25,
29‐
31] focused on different types of workload than we are with
TAPER, but in an online setting, workload summaries which directly measure individual edge traversals [
3,
25,
30] are vulnerable to overfitting. Path queries in particular are often only executed over a subset of the graph, i.e. find
Post connected to
Person
a. If this subset changes (e.g. to
Person
j) then such a granular workload summary would be inaccurate. By considering instead the vertex labels in a workload’s queries and the graph structure local to a given edge
e,
TAPER is able to predict
e’s traversal probability even when it has not yet been traversed.
Further prior work has focused on exploiting statistical properties of a query workload, to efficiently manage graph data through replication [
18,
21,
32]. Whilst [
18,
21] focus on 1-hop queries over social networks, Yang et al. [
32] propose algorithms to efficiently analyse general online query workloads and to dynamically replicate “hotspots” (cross partition clusters of vertices that are being frequently traversed), thereby temporarily dissipating network load. Whilst highly effective at dealing with unbalanced query workloads, Yang et al. focus solely upon the replication of vertices and edges using temporary
secondary partitions. They do not improve upon the initial partitioning, nor do they consider workload characteristics when producing it. This can result in replication mechanisms doing far more work than is necessary over time, adversely affecting the performance of a system. As a result, the enhancement techniques we present here would also complement many workload aware replication approaches, such as that proposed by Yang et al.