Introduction
Domains
Social networks
World Wide Web
Telecommunications
Recommendation systems
Transports, smart cities and IoT
Epidemiology
Laboratory of Web Algorithmics
[42] and by Web Data Commons
[43].
Motivation
Single-machine
Multi-machine
Document roadmap
Graph algorithms: natures and types
Apache Spark
[53] or Apache Flink
[14]), some challenges must be considered. This means that while the field of graph processing is developed with the goal of improving how we manipulate and extract value from graph-based data, as the techniques to achieve this end become more refined, other aspects of graph structures gain prominence as challenges to them.Algorithms
Neo4j
have extensions like the Neo4j APOC Library
for languages like Cypher
to start algorithms with global computation from the graph query language [63].Computational representations
Neo4j
[67], where one may model more than one relation type between the same vertices. Additionally, given a graph \(G = (V, E)\), the set of vertices of G is written as V(G) and the set of edges as E(G). More commonly, we write \(V(G) = V\) and \(E(G) = E\).WebGraph
[1], which exploits certain properties of web graphs to represent them with increased compression. An important property they exploit is locality, as many links stay within the same domain, that is, if the web graph is lexicographically ordered, most links point close by. Another property is similarity: pages that are close by in the lexicographical order are likely to have sets of neighbours that are similar. The study performed with WebGraph
also highlighted, among other facts, the following: similarity was found to be much more concentrated than previously thought; consecutivity is common regarding web graphs. The properties of ordering (and different techniques to produce them) have also been exploited by the same authors to obtain compression with social networks. WebGraph
was used in an extensive analysis of many different data sets, which were made available online by the Laboratory for Web Algorithmics
[1, 42, 70‐72].
RDF
data sets by internally using compressed bit vectors. Conceptually, we recursively subdivide each block of a graph’s adjacency matrix until we reach the level of individual cells of the matrix. The idea is to divide (following an MX-Quadtree
strategy [74, Sec. 1.4.2.1]) the matrix in blocks and then assign 0 to the block if it only contains zeros (no edges) or 1 if it contains at least an edge. We show in Fig. 3 a sample adjacency matrix on the left and the corresponding \(k^2\)-tree representation of the decomposition. This representation of the adjacency matrix is actually a \(k^2\)-tree of height \(h = \lceil \log _{k}{n}\rceil \), where (\(n = |V|\) and) each node contains a single bit of data. It is 1 for internal nodes and 0 for leaves, except for the last level, in which all nodes are leaves representing values from the adjacency matrix. It is a data structure that also efficiently matches the properties of sparseness and clustering of web graphs.
Log(Graph)
[75] is a graph representation that combines high compression ratios with low overhead to enable competitive processing performance while making use of compression. It achieved compression ratios similar to WebGraph
while reaching speedups of more than 2x. The authors achieve results by applying logarithm-based approaches to different graph elements. They describe its application on fine elements of the adjacency array (the basis of Log(Graph)
: vertex IDs, offsets and edge weights. From information theory, the authors note that a simple storage lower bound can be the number of possible instances of an entity, meaning the number of bits required to distinguish them. Using this type of awareness on the different elements that represent an adjacency array and by incorporating bit vectors, the authors present a C++
library for the development, analysis and comparison of graph representations composed of the many schemes described in their work.WebGraph
framework [1, 2] as one of the most well-known, and more recently the \(k^2\)-tree structure [73, 82‐84], only later was the focus cast on being able to represent big graphs with compression while allowing for updates. Furthermore, if we add the possibility of dynamism of the data (the graph is no longer a static object that one wishes to analyse) to the factors guiding representation choice, then it makes sense to think about how to represent a big graph in common hardware not only for storage purposes but also for efficient access with mutability. Works such as VeilGraph
[85] approach the concepts of efficient representations by for example incorporating summary graph structures to reduce the total performed computations in the context of graph updates.Graph processing: computational units and models
Unit: Vertex-Centric (TLAV)
Pregel
system [8]. An open-source implementation of this model known as Apache Giraph
[12] was then offered to the public. Other example systems that were created using that model are GraphLab
[90], PowerGraph
[60], PowerLyra
[61]. As the unit of computation is the vertex itself, the user algorithm logic is expressed from the perspective of vertices. The idea is that a vertex-local function will receive information from the vertex’s incoming neighbours, perform some computation, potentially update the vertex state and then send messages through the outgoing edges of the vertex. A vertex is the unit of parallelization and a vertex program receives a directed graph and a vertex function as input. It was then extended to the concept of vertex scope, which includes the adjacent edges of the vertex. The order of these steps will vary depending on the type of vertex-centric model used (scatter-gather, gather-apply-scatter).Model: Superstep Paradigm
Model: Scatter-Gather
Model: Gather-Apply-Scatter
PowerGraph
[60] and was aimed at solving the limitations encountered by vertex-centric or scatter-gather when operating on power-law graphs. The discrepancy between the ratios of high-degree and low-degree vertices leads to imbalanced computational loads during a superstep, with high-degree vertices being more computationally-heavy and becoming stragglers. GAS consists of decomposing the vertex program in several phases, such that computation is more evenly distributed across the cluster. This is achieved by parallelizing the computation over the edges of the graph. In the gather phase, a user-defined function is applied to each of the adjacent edges of each vertex in parallel.Unit: Edge-Centric (TLEV)
X-Stream
[56] and Chaos
[62] which specify the computation from the point-of-view of edges. These systems made of use of this paradigm to optimize the usage of secondary storage and network communication with cloud-based machines to process large graphs.Unit: Sub-graph-Centric (TLAG)
Model: MEGA
Tux2
[93], a system designed for graph computations in machine learning. The model is composed of four functions defined by the user: an exchange function which is applied to each edge and can change the value of the edge and adjacent vertices; an apply function to synchronize the value of vertices with their replicas; a global sync function to perform shared computations and update values shared among partitions; a mini-batch function to indicate the execution sequence of other functions in each round.Dimension: partitioning
Edge-Cut (EC)
Vertex-Cut (VC)
Hybrid-Cut (HC)
PowerLyra
[61] system for example allocates the incoming edges of vertices with low degree in a worker. It uses edge-cut for vertices of low-degree and vertex-cut for high-degree vertices.Stream-based partitioning
Random heuristic
and the Linear Deterministic Greedy
[100], Gemini
which uses chunk-based assuming adjacency list model [101], Fennel
[102]), vertex-cut partitioning (e.g. Grid
heuristic [103], PowerGraph
greedy heuristic [60], Graphbuilder
[103] placing the edge in the smallest partition, HDRF
[104] method which takes into consideration vertex degrees) and there are aspects of these methods that will have different approaches regarding how this is achieved with parallel and distributed execution. Stream-based partitioning is also used as a good choice for loading the graph as it does not have to be fully loaded in memory for partitioning.Distributed partitioning
Revolver
, which performs vertex-centric graph partitioning with reinforcement learning, assigns an agent to each vertex, with agents assigning vertices to partitions based on their probability distribution (these are then refined based on feedbacks [97]). The authors of [50] note that other approaches consider the partitioning problem as a multi-objective and multi-constraint problem, achieving better results compared to one-phase methods [108]. Distributed partitioning systems are good for when partitioning is performed once and then calculations are repeatedly performed.Dynamic graph partitioning
xDGP
[109] to repartition massive graphs to adapt to structural changes, GPS
[110] which reassigns vertices based on communication patterns, X-Pregel
[98] with reduction of message exchanges and dynamic repartitioning). Dynamic partitioning methods have the advantage of outputting very good load balancing and communication cost reductions due to considering heterogeneous hardware and runtime characteristics.Partitioning: summary
Dimension: Dynamism
Chaos
[62] and X-Stream
[56] with their edge-centric approach). Considering if a system targets graphs that change or are immutable (static) is an obvious way to separate graph processing systems when classifying them. However, this dimension is actually a spectrum between the immutable (e.g. stream-based perspectives to process static graphs) and the changing—for example, is the whole graph structure kept in memory (or secondary storage) in a single machine (or across cluster nodes), or is it discarded by proxy of some criteria (and thus one simply updates mathematical properties of the graph using only recent information from the stream)? For this spectrum, the authors of [112] cover definitions found in the literature:Temporal graphs
ImmortalGraph
[114] is a storage and execution engine designed with temporal graphs in mind, having achieved greater efficiency than database solutions for graph queries. ImmortalGraph
schedules common bulk operations in a way to maximize the benefit of in-memory data locality. It explores the relation between locality, parallelism and incremental computation while enabling mining tasks on temporal graphs. For more information and reach on the topic of temporal graphs, we direct the reader to [115].Streaming graph algorithms
STINGER
data structure has been used for streaming graphs as well [118].Sketching and dynamic graph streams
Multi-pass streaming graph algorithms
Dynamic graph algorithms
Ringo
[122] is a single-machine analytics system that supports dynamic graphs.Dimension: workload
Neo4j
[123] and JanusGraph
[124], among others. These databases offer graph query languages (usually even allowing interchangeability between languages) such as Cypher
or Gremlin
[125]. They are built to store the graph, some with sharding (horizontal scaling) to distribute the graph across the storage/computational infrastructure (some outsource the storage medium to database technologies such as HBase
[126] or Cassandra
[127]), others in a centralized server (but allowing cluster nodes for the specific purpose of redundancy). They employ schemes to store the graph efficiently while offering transaction mechanisms to operate over the graph and to perform queries. The latter type (b) is seen in big (graph) data processing systems like Spark (GraphX library)
[13] and Flink (Gelly library)
[14]. The mentioned names are all distributed processing frameworks that can take advantage of multi-core machines and clusters. These systems and their libraries allow for expressive computation over graphs in few lines of code. Many of the systems come with their sets of graph algorithms, allowing for the composition of workflows while abstracting away many details from the programmer (regarding distributed computation orchestration and the internal implementation of the graph algorithms).OLAP
) and online transaction processing (OLTP
). OLAP
is an approach to enable answering multi-dimensional analytical queries quickly. Among its instances we may find tasks such as business reporting for sales, management reporting, business process management [128], financial reporting and others. OLTP
, on the other hand, refers to systems that enable and manage transaction-oriented applications, with transaction meaning in a computational context the atomic state changes that take place in database systems. OLTP
examples include retails sales and financial transaction systems, and applications of this type tend to be high-throughput and update/insertion-intensive in order to provide availability, speed, recoverability and concurrency [129].OLTP
systems as the goal is to store representations of graphs by quickly ingesting new information, efficiently representing it regarding space consumption and access speed, and being able to execute updates under ACID properties (or a subset of those). For this type of task (a), one may find numerous graph databases to match the description, such as those for designed for semantic representations, or for property graph models, both and also other specific purposes. The latter type of task (b) may be associated to OLAP
, where there is a focus on extracting value from the data and the nature of the task is typically read-only. We include graph processing systems (not databases) in this group of OLAP
-type tasks, even the systems which support mutability in graphs due to supporting dynamism in any form.OLTP
-type tasks and graph databases, and there is also an overlap between OLAP
-type tasks and graph processing systems. While the distinction between OLAP
and OLTP
task types is not a dimension that perfectly divides systems in the graph processing landscape, we note that such a distinction holds value in guiding future taxonomies of the graph processing system landscape, and for that reason we include it as a dimension.Single-machine and shared-memory parallel approaches
C++
) for parallel machine learning and later extended to support distributed settings while retaining strong data consistency guarantees [90]. The authors evaluate it on Amazon EC2, outperforming equivalent MapReduce
implementations by over 20X and match the performance of specifically-crafted MPI
implementations. GraphLab
requires the whole graph and program state to reside in RAM. It uses a chromatic engine so that no adjacent vertices have the same colour and to enable the efficient use of network bandwidth and processor time. The authors evaluate it for applications such as Netflix movie recommendation, video co-segmentation and named entity recognition. It is open-source [131] under the Apache License 2.0
.C++
) of a parallel execution engine for both synchronous and user-specified asynchronous execution policies. GRACE
stores directed graphs, and in its model and the computation is expressed and performed in a way similar to Pregel
. It provides additional flexibility, by allowing the user to relax synchronization of computation. This is achieved with user-defined functions which allow updating the scheduling priority of vertices that receive messages (the vertex order in which computation will take place within an iteration). GRACE
’s design targets both shared-memory and distributed system scenarios, but the initial prototype focuses on shared-memory. We did not find the source code available.C++
lightweight graph processing framework targeting shared-memory parallel/multi-core machines, easing the writing of graph traversal algorithms. This framework offers two map primitives to operate a given logic on vertices (VertexMap
) and edges (EdgeMap
) and supports two data types: the traditional graph \(G = (V, E)\) as we described in an earlier section, and another one to represent subsets of vertices. The interface is designed to enable the processing of edges in different orders depending on the situation (as opposed to Pregel
or Giraph
). The code of Ligra
represents in-edges and out-edges as arrays, with in-edges for all vertices being partitioned by their target vertex and storing the source vertices, and the out-edges are in an array partitioned by source vertices and storing the target vertices. While the authors claim to have achieved good performance results, they mention Ligra
does not support algorithms based on modifying the input graph. It is available [134] under the MIT License
.Python
front-end on a scalable parallel C++
back-end, representing the graph as a hash table of nodes. It supports fast execution times with exploratory and interactive use, offering graph algorithms in a high-level language and rich support for transformations of input data into graphs. Ringo
is open-source and available [135] under the BSD License
.Apache License 2.0
and implemented in C++
. It innovated by differentially allocating and placing topology data, application-defined data and mutable run-time graph system states according to access patterns to minimize remote accesses. Polymer
also deals with random accesses by converting the random ones into sequential remote accesses using lightweight vertex replication across the NUMA nodes. It was built with a hierarchical barrier for increased parallelism and locality. The design also includes edge-oriented balanced partitioning for skewed graphs and adaptive data structures in function of the fraction of active vertices. It was compared to Ligra
, X-Stream
and Galois
on an 80-core Intel machine (no hyper-threading) and on a 64-core AMD machine. For different algorithms across several data sets, Polymer
consistently almost always achieved the lowest execution time.C++
aimed at bridging the user-friendly graph analytics and native hand-optimized code. It presents itself as a vertex-centric framework without sacrificing performance, as it takes vertex programs and maps them to exclusively use sparse matrix high-performance back-end operations. GraphMat
takes graph algorithms expressed as vertex programs and performs generalized sparse matrix vector multiplication on them. It achieved greater performance than other frameworks such as 5-7X faster than GraphLab
, Galois
and ComBLAS
. It also achieved multi-core scalability, being over 10X faster than single-threaded implementation on a 24-core machine. It is open-source and available [139] under specific conditions by Intel.C++
uses a compact representation of the graph using Hilbert-ordered tiles for locality, load balancing and compression and uses a hybrid computation model that uses both vertex-centric operations (on host processors) and edge-centric operations (on co-processors). Mosaic
is open-source [141] under the MIT License
.High-performance computing
PBGL
) [142] . It is an extension (C++
) of Boost’s
graph library. It is a distributed graph computation library, also offering abstractions over the communication medium (e.g. MPI
). The graph is represented as an adjacency list that is distributed across multiple processors. In PBGL
, vertices are divided among the processors, and each vertex’s outgoing edges are stored on the processor storing that vertex. PBGL
was evaluated on a system composed of 128 compute nodes connected via Infiniband. It is available [143] under a custom Boost Software License 1.0
.C++
offering linear algebra primitives based on sparse arrays for graph analytics. This system considers the adjacency matrix of the graph as a sparse matrix data structure. CombBLAS
is edge-based in the sense that each element of the matrix represents an edge and the computation is defined over it. It decouples the parallel logic from the sequential parts of the computation and makes use of MPI
. However, its MPI
implementation does not take advantage of flexible shared-memory operations. Its authors targeted hierarchical parallelism of supercomputers for future work. It is available [145] under a custom license.C++
system with techniques for processing scale-free graphs using distributed memory. To handle the scale-free properties of the graph, it uses edge list partitioning to deal with high-degree vertices (hubs) and dummy vertices to represent them to reduce communication hot spots. HavoqGT
allows algorithm designers to define vertex-centric procedures in a distributed asynchronous visitor queue. This queue is part of an asynchronous visitor pattern designed to tackle load imbalance and memory latencies. HavoqGT
targets supercomputers and clusters with local NVRAM. It is available [147] under the GNU Lesser General Public License 2.1
.Distributed graph processing systems
OLAP
). Their use of different architectures (from using local commodity clusters to cloud-based execution) and greater flexibility of deployment scenarios differentiate them from those of the previous section. The following systems are relevant names in the literature, with Giraph
being the first open-source implementation of the Pregel
approach to graph processing, and Spark
and Flink
being open-source general distributed processing systems with graph processing APIs:Java
implementation of Pregel
[8], tailor-made for graph algorithms, supporting the GAS model and licensed [148] under the Apache License 2.0
. It was created as an efficient and scalable fault-tolerant implementation on clusters with thousands of commodity hardware, hiding implementation details underneath abstractions. Work has been done to extend Giraph
from the think-like-a-vertex (TLAV) model to think-like-a-graph (TLAG) [57]. It uses Hadoop
’s Map
Reduce
implementation to process graphs. Giraph
[148] allows for master computation, sharded aggregators (relevant when computing a final result comprised of intermediate data from nodes), has edge-oriented input, and also uses out-of-core computation—limited partitions in memory. Partitions are stored in local disks, and for cluster computing settings, the out-of-core partitions are spread out across all disks. Giraph
attempts to keep vertices and edges in memory and uses only the network for the transfer of messages. Improving Giraph
’s performance by optimizing its messaging overhead has also been studied [149]. It is interesting to note that single-machine large-memory systems such as Ringo
highlight the message overhead as one of the major reasons to avoid a distributed processing scheme.Apache License 2.0
) dataflow processing system [151] offering different levels of complexity and abstractions to programmers. It allows programmers to implement graph algorithms such as weakly connected components, approximate shortest paths and others while achieving better performance than other systems. Naiad
is implemented in C#
and allows programmers to use common high-level APIs to express algorithm logic and also offers a low-level API for performance. Its concepts are important and other systems could benefit from offering tiered programming abstraction levels as in Naiad
. Its low-level primitives allow for the combination of dataflow primitives (similar to those VeilGraph
uses from Flink
) with finer-grained control on iterative computations. An extension to Flink
’s architecture to offer this detailed control would enrich the abilities that our framework is able to offer to users.Stratosphere
[152], it is a framework which supports built-in iterations [14] (and delta iterations) to efficiently aid in graph processing and machine learning algorithms. It is licensed [153] under the Apache License 2.0
and has a graph processing API called Gelly
, which comes packaged with algorithms such as PageRank, single-source shortest paths and community detection, among others. Flink
supports Java
, Python
and Scala
. It explicitly supports three vertex-based programming models: think-like-a-vertex (TLAV) described as the most generic model, supporting arbitrary computation and messaging for each vertex; Scatter-Gather, which separates the logic of message production from the logic of updating vertex values, which may typically make these programs have lower memory requirements (concurrent access to the inbox and outbox of a vertex is not required) while at the same time potentially leading to non-intuitive computation patterns; Gather-Sum-Apply-Scatter (GAS), which is similar to Scatter-Gather but the Gather phase parallelizes the computation over the edges, the messaging phase distributes the computation over the vertices and vertices work exclusively on neighbourhood, where in the previous two models a vertex can send a message to any vertex provided it knows its identification. It supports all Hadoop
file systems as well as Amazon S3
and Google Cloud
storage, among others. Delta iterations are also possible with Flink
, which is quite relevant as they take advantage of computational dependencies to improve performance. It also has flexible windowing mechanisms to operate on incoming data (the windowing mechanism can also be based on user-specific logic). Researchers have also looked into extending its DataStream
constructs and its streaming engine to deal with applications where the incoming flow of data is graph-based [154].GraphX
[13] graph processing library, licensed [156] under the Apache License 2.0
. It is a graph processing framework built on top of Spark
(a framework supporting Java
, Python
and Scala
), enabling low-cost fault-tolerance. The authors target graph processing by expressing graph-specific optimizations as distributed join optimizations and graph views’ maintenance. In GraphX
, the property graph is reduced to a pair of collections. This way, the authors are able to compose graphs with other collections in a distributed dataflow framework. Operations such as adding additional vertex properties are then naturally expressed as joins against the collection of vertex properties. Graph computations and comparisons are thus an exercise in analysing and joining collections.Spark
(Java
, Scala
). It represents computations on time evolving graphs as a stream of consistent and resilient graph snapshots and a small set of operators that manipulate such streams. GraphTau
builds fault-tolerant graph snapshots as each small batch of new data arrives. It is also able to periodically load data from graph databases and reuses many operators from GraphX
and Spark Streaming
. For algorithms (based on label propagation) that are not resilient to graph changes, GraphTau
introduced an online rectification model, where errors caused by underlying graph modifications are corrected in online fashion with minimal state. Its API frees programmers from having to implement graph snapshot generation, windowing operators and differential computation mechanisms. We did not find its source code available.Flink
(Java
, Scala
) and focuses on interval graphs, where each edge has an associated starting time and ending time. The author created different graphs with information provided by Facebook and Wikipedia in order to evaluate the framework. Tink
defines a temporal property graph model. It is available online [159], although we did not find information pertaining licensing.Flink
and Spark
are the most widely-known distributed processing frameworks (we note GraphTau
, although its code is not available, is built over Spark
) based on dataflow programming. While the use of dataflows grants flexibility to program implementation and execution by decoupling the program logic from how it is translated to the workers of a cluster, the graph libraries of these systems do not allow in an efficient way for a graph to be updated using stream-processing semantics while also maintaining the graph structure during computation. It is possible to update graphs using these systems, but they make use of batch processing APIs for which the dataflow graphs must not become excessively big (or else dataflow plan optimizers may be locked in the phase of exploring the optimization space of the execution plan) and graph must be periodically written to secondary storage (as a solution to avoid having progressively bigger execution plans).Flink
’s Gelly
library has been used in GRADOOP
, which is an open-source [160] (Apache License 2.0
) distributed graph analytics research framework [161] under active development and providing higher-level operations. GRADOOP
extends Gelly
with additional specialized operators such as a graph pattern matching operator (which abstracts a cost-based query engine) and a graph grouping operator (implemented as a composition of map, filter, group and join transformations on Flink
’s DataSet
). GRADOOP
also adopts the Cypher
query language (typically found in graph databases like Neo4j
) to express logic that is translated to the relational algebra that underlies Flink
’s DataSet
[162].Spark
has its graph processing library GraphX
which was built over the system’s batch processing API, like the case of Flink
’s Gelly
and also suffering from the same previously mentioned limitations. A higher-level API was designed to extend the functionalities of GraphX
while harnessing Spark
’s DataFrame
API. For this, the GraphFrames
open-source [163] (Apache License 2.0
) library was created [164]. A look at its implementation reveals that it has less high-level operations than Gelly
. Effectively, without simulating some of Gelly
’s API, equivalent programs in GraphX
lend themselves to more conceptual verbosity due to the lack of syntactic sugar.Flink, Spark
and the graph processing ecosystems built on top of them. Gelly
’s equivalent in Spark
is GraphX
, implemented in Scala
. Vertices and edges are manipulated by using Spark
’s Resilient Distributed Datasets
(RDDs
), which can be viewed as a conceptual precursor to Flink
’s DataSet
. Spark
also offers the DataFrame
API to enable tabular manipulation of data. GraphFrames
is another graph processing library for Spark
. While it has interoperability and a certain overlap with the functionality offered in GraphX
, it integrates the tabular perspective supported by Spark
’s DataFrame
API and also supports performing traversal-like queries of the graph via SparkSQL
. In this way, GraphFrames
provides graph analytics capabilities in Spark
much the same way GRADOOP
does in Flink
.
X-Stream
and Chaos
are grouped together as they brought relevance to the edge-centric (TLAE) model and employed it to explore novel ways to balance network latencies and use of SSDs to increase performance:X-Stream
is an open-source system written in C++
which introduced the concept of edge-centric graph processing via streaming partitions. X-Stream
exposes an edge-centric scatter-gather programming model that was motivated by the lack of access locality when traversing edges, which makes it difficult to obtain good performance. State is maintained in vertices. This tool uses the streaming partition, which works well with RAM and secondary (SSD and Magnetic Disk) storage types. It does not provide any way by which to iterate over the edges or updates of a vertex. A sequential access to vertices leads to random access of edges which decreases performance. X-Stream
is innovative in the sense that it enforces sequential processing of edges (edge-centric) in order to improve performance. It is available [165] under the Apache License 2.0
.C++
which had its foundations on XStream
. On top of the secondary storage studies performed in the past, graph processing in Chaos
achieves scalability with multiple machines in a cluster computing system. It is based on different functionalities: load balancing, randomized work stealing, sequential access to storage and an adaptation of X-Stream
’s streaming partitions to enable parallel execution. Chaos
is composed of a storage sub-system and a computation sub-system. The former exists concretely as a storage engine in each machine. Its concern is that of providing edges, vertices and updates to the computation sub-system. Previous work on X-Stream
highlighted that the primary resource bottleneck is the storage device bandwidth. In Chaos
, the storage and computation engines’ communication is designed in a way that storage devices are busy all the time—thus optimizing for the bandwidth bottleneck. It was released [166] under the Apache License 2.0
.C++
which adopts different partitioning and computing strategies depending on vertex types. The authors note that most systems use a one-size-fits-all approach. They note that Pregel
and GraphLab
focus in hiding latency by evenly distributing vertices to machines, making resources locally accessible. This may result in imbalanced computation and communication for vertices with higher degrees (frequent in scale-free graphs). Another provided design example is that of PowerGraph
and GraphX
which focus on evenly parallelizing the computation by partitioning edges among machines, incurring communication costs on vertices, even those with low degrees. PowerLyra
was released under the Apache License 2.0
[167].Kineograph
performs computation on static snapshots, simplifying algorithm design. We did not find its source code online.Apache Storm
and provides an asynchronous bounded iteration model, offering fine-grained updates while ensuring correctness. It is based on the observations that: 1) loops starting from good enough guesses usually converge quickly; 2) for many iterative methods, the running time is closely related to the approximation error. From this, an execution model was built where a main loop continuously gathers incoming data and instantly approximates the results. Whenever a result request is received, the model creates a branch loop from the main loop. This branch loop uses the most recent approximations as a guess for the algorithm. We did not find its source code online.KickStarter
deals with this by identifying values impacted by edge deletions and adapting the network impacts before the following computation, achieving good results on real-world use-cases. Despite this, by focusing on monotonic graph algorithms, its scope is narrowed to selection-based algorithms. For this class, updating a vertex value implies choosing a neighbour under some criteria. KickStarter
is now known as GraphBolt
, a recent work [171, 172] licensed under the MIT License
[171] which offers a generalized incremental programming model enabling development of incremental versions of complex aggregations. Both were written in C++
.Pixie
chooses in real-time the pins that are most related to the query, out of billions of candidates. With this system, Pinterest was able to execute a custom (Pixie Random Walk) algorithm on an object graph of 3 billion vertices and 17 billion edges. On a single server, they were able to serve around 1200 recommendation requests per seconds with 60 millisecond latency. The authors note that the deployment of Pixie
benefited from large RAM machines, using a cluster of Amazon AWS r3.8xlarge machines with 244GB RAM. They fitted the pruned Pinterest graph (3 billion vertices, 17 billion edges) in around 120GB of RAM, in a setup that yielded the following advantages: random walk not forced to cross machines, which increases performance; multiple walks can be executed on the graph in parallel; the system can be parallelized and scaled by adding more machines to the cluster. This system is a relevant case study (of applying graph theory to recommendation systems at scale) as a scalable system for processing on large graphs a biased random walk algorithm (with user-specific preferences) while using graph pruning techniques to disregard large boards that are too diverse and diffuse the random walk (the non-pruned graph has 7 billion vertices and 100 billion edges). We did not find the source code available online.BSD License
) scalable graph processing system written in Java
and allowing fault-tolerant and easy-to-program algorithm execution on very large graphs. It adopts Pregel
’s vertex-centric API and extends it with: features to make global computations easier to express and more efficient; dynamic repartitioning scheme to reassign vertices to different workers during computation based on messaging patterns; distribution of high-degree vertex adjacency lists across all computer nodes to improve performance (something that PowerGraph
and PowerLyra
later adopted). It was designed to run on a cluster of machines such as Amazon EC2, over which the authors tested their work. GPS
’s initial version was run on up to 100 Amazon EC2 large instances and on graphs of up to 250 million vertices and 10 billion edges. It is open-source and available under the BSD License
[175].Java
. GoFFish
states that two sub-graphs many not share the same vertex, but they can have remote edges that connect their vertices, provided that the sub-graphs are on different partitions. If two sub-graphs in the same partition share an edge, by definition they are merged into a single-sub-graph. It was evaluated with a cluster of 12 nodes each with 8-core Intel Xeon CPUs, 16 GB RAM and 1 TB SATA HDD. The authors compare the execution of GoFFish
(one worker per node) with Giraph
(default two workers per node), achieving faster execution times for algorithms such as PageRank, connected components and single-source shortest-paths. Its source code is available though we did not find any information pertaining licensing. While its source code is available [177], we did not find information regarding licensing.FBSGraph
relies on the observation that state can be propagated faster by processing vertices sequentially along the graph path in each round. They achieve greater execution speed when analysing several graph algorithms across a set of datasets, comparing against systems such as PowerGraph
and GraphLab
. We did not find its source available.Java
that uses vertex-cut graph partitioning that takes into consideration the diversity of vertex traffic and the heterogeneous costs of the network. It relies on a strategy of adaptive edge migration to reduce the frequency of communication across expensive network links. For this work, the authors focused on vertex-cut as it has better partitioning properties for real-world graphs that have power-law degree distributions. GrapH
has two partitioning techniques, H-load which is used for the initial partitioning of the graph so that each cluster worker can load it into local memory, and H-adapt, a distributed edge migration algorithm to address the dynamic heterogeneity-aware partitioning problem. In evaluation, GrapH
outperformed PowerGraph
’s vertex-cut partitioning algorithm with respect to communication costs. While its source code is available [180], we found no information on licensing.Ligra
(C++
) and provides an interface to maintain a collection of buckets under vertex insertions and bucket deletions. They evaluated under bucketing algorithms such as weighted breadth-first search, k-core and approximate set cover. The authors describe as bucketing-based algorithms those that maintain vertices in a set of unordered buckets—and in each round, the algorithm extracts the vertices contained in the lowest (or highest) bucket to perform computation on them. Then, it can update the buckets containing the extracted vertices or their neighbours. For example, weighted breadth-first search processes vertices level by level, where level i contains all vertices at distance i from the source vertex. The system was tested in a multi-core machine with 72 cores (4 CPUs at 2.4GHz) and 1TB of main memory, achieving performance improvements on several data sets when compared to systems such as Galois
, base Ligra
and Galois
. We did not find its source code available.Pregel
and targeting efficient big graph processing using a small cluster of commodity machines connected by Gigabit Ethernet, contrasting with other out-of-core works that focus on specialized hardware. The authors focus on a setting that sees vertex-centric programs being data-intensive, as the CPU cost of computing a message is small when compared to the network transmission cost. GraphD
masks disk I/O overhead with message transmission though parallelism of computation and communication. It eliminates the need for (expensive) external-memory join or group-by operations, which are required in other systems such as Chaos
. It was evaluated on PageRank, single-source shortest-paths and connected components. GraphD
was evaluated against distributed out-of-core systems Pregelix
, HaLoop
and Chaos
, against single-machine systems GraphChi
and X-Stream
and representative in-memory systems Pregel
and Giraph
, achieving better performance in some scenarios. We did not find its source available.TurboGraph++
has the goal of balancing the workloads across machines, which requires balancing the number of edges and the number of high-degree and low-degree vertices among machines. It also focuses on balancing the number of vertices on each machine so that each one requires the same memory budget. We did not find its source code available online.GraphIn
proposes an adaptation of gather-apply-scatter (GAS) called I-GAS which enables the implementation of incremental graph processing algorithms across multiple CPU cores. It also introduces an optimization heuristic to choose between static or dynamic execution based on built-in and user-defined graph properties. Native and benchmarking code were implemented in C++
and for experimental evaluation it was compared to GraphMat
and STINGER
. The heuristic-base computation made GraphIn
faster than systems using fixed strategies. We did not find its source code available.C++
code to use the framework. The early MapGraph
work was first available as an open-source project [186] licensed under the Apache License 2.0
, but it has been integrated in the former line of products of Blazegraph
, also available online [187].C++
which was motivated by the negative impact that irregular memory accesses have on the compressed sparse row graph (CSR) representation. CuSha
overcomes this by: 1) organizing the graph into autonomous sets of ordered edges called shards (a representation they call G-Shards) unto which GPU hardware resources are mapped for fully coalesced memory accesses; 2) accounting for input graph properties such as sparsity (the sparser the graph, the smaller the computation windows) to avoid GPU under-utilization (Concatenated Windows, or CW). This framework allows users to define vertex-centric algorithms to process large graphs on GPU. It is open-source [189] and available under the MIT License
.Apache License 2.0
) CUDA library for graph processing targeting the GPU and written in C
. It implements a data-centric abstraction focused on operations on a vertex or edge frontier. For different graph algorithms, it achieved at least an order of magnitude speedup over PowerGraph
and better performance than any other high-level GPU graph library at the time. Its operations are bulk-synchronous and manipulate a frontier, which is a subset of the edges or vertices within the graph that is relevant at a given moment in the computation. Gunrock
couples high-performance GPU computing primitives and optimization strategies with a high-level programming model to quickly develop new graph primitives. It was evaluated using breadth-first search, depth-first search, single-source shortest paths, connected components and PageRank.C++
for fast graph processing by exploiting aggregate memory bandwidth of multiple GPUs and the locality of the memory hierarchy of multi-GPU clusters. It proposes a dynamic graph repartitioning strategy to enable well-balanced distribution of workload with minimal overhead (improving performance by up to 50%), as well as a performance model providing insight on how to choose the optimal number of nodes and GPUs to optimize performance. Lux
is aimed at graph programs that can be written with iterative computations. Vertex properties are read-only in each iteration, with updates becoming visible at the end of an iteration. It offers two execution models: pull which optimizes run-time performance of GPUs (enables optimizations like caching and locally aggregating updates in GPU shared memory); and push, which optimizes algorithmic efficiency (maintains a frontier queue and only performs computation over the out-edges of vertices in the frontier). Its source code is available [194] under Apache License 2.0
.C
. The authors note that common colouring algorithms may suffer from low parallelism due to a large number of colours being needed to process large graphs with billions of vertices. Frog
separates vertex processing based on colour distribution. They propose an efficient hybrid graph colouring algorithm, relying on a relaxed pre-partition method to solve vertex classification with a lower number of colours, without forcing all adjacent vertices to be assigned different colours. The execution engine of Frog
scans the graph by colour, and all vertices under the same colour are updated in parallel in the GPU. For large graphs, when processing each partition, the data transfers are overlapped with GPU kernel function executions, minimizing PCIe data transfer overhead. It is open-source [196] and licensed under the GNU General Public License 2.0
.Ligra
interface, supporting graph updates. To support this, the authors developed and presented the C-tree data structure which achieves good cache locality, lowers space use and has operations which are efficient from a theoretical perspective. It applies a chunking scheme over the tree, storing multiple elements in a tree-node. The scheme takes the ordered set of elements that are represented. More relevant elements are stored in tree nodes, while the remaining ones are associated in tails of the tree nodes. It employs compression and supports parallelism. The authors evaluate it with the largest publicly-available graph, which has more than two hundred billion edges on a multi-core server with 1 TB memory. Source code is available online [198] albeit no license information was provided.Gluon
, programmers implement applications in shared-memory programming systems of their choice and then interface the applications with Gluon
to enable execution on heterogeneous clusters. Gluon
optimizes communication by taking advantage of temporal and structural invariants of graph partitioning policies. It runs on shared-memory NUMA platforms and NVIDIA GPUs. Its programming model offers a small number of programming patterns implemented in C++
, its library offers concurrent data structures, schedulers and memory allocators and the runtime executes programs in parallel, using parallelization strategies as optimistic and round-based execution. Gluon
is available [200] under the 3-Clause BSD License
.Hornet
is available [202] under the 3-Clause BSD License
.faimGraph
was benchmarked against Hornet
on an NVIDIA GeForce GTX Titan Xp GPU using algorithms such as PageRank and triangle counting. Source code is available online [204] without a specified license.GraphCage
applies the scheme to both push and pull directions and coordinates with load balancing strategies by considering sparsity of sub-graphs. This technique is applied to traversal-based algorithms by considering the benefit and overhead in different iterations with working sets of different sizes. In its evaluation, GraphCage
achieved in average lower execution times for one PageRank iteration compared to both Gunrock
and CuSha
. We did not find its source code available.C++
over a user-space SSD file system designed for high IOS and very high levels of parallelism. Vertex state is stored in memory while edge lists are on SSDs. Latency is hidden by overlapping computation with I/O, a concept similar to X-Stream
and Chaos
, and edges lists are only accessed if requested by applications from SSDs. FlashGraph
has a vertex-centric (TLAV) interface, its designed to reduce CPU overhead and increase throughput by conservatively merging I/O requests, and the authors demonstrate that FlashGraph
in semi-external memory executes many algorithms with a performance of up to 80% of the in-memory implementation and It also outperformed PowerGraph
. It is open-source [208] under the Apache License 2.0
.GraphSSD
innovates by considering a vertex-to-page mapping scheme and uses advanced knowledge of flash properties to reduce page accesses. It offers a simple API to ease development of applications accessing graphs as native data and its evaluation showcased average performance gains for basic graph data fetch functions on breadth-first search, connected components, random-walk, maximal independent set and PageRank. We did not find its source available.PBGL
, CombBLAS
and HavoqGT
) contains systems which use multiple machines for computation but not in the typical cluster scenario. Instead, they are characterized by using specific machines for high-performance computing.System | Multi-core | GPU | Cluster | Languages | License | Notes |
---|---|---|---|---|---|---|
\(\cdot \) | \(\cdot \) | C++ | AL 2.0 | N/A | ||
GRACE [132] | \(\cdot \) | \(\cdot \) | C++ | Unavailable | N/A | |
\(\cdot \) | C++ | MIT | N/A | |||
\(\cdot \) | C++ , Python | BSD | N/A | |||
\(\cdot \) | C++ | AL 2.0 | N/A | |||
\(\cdot \) | C++ | Custom | N/A | |||
\(\cdot \) | C++ | MIT | Fast storage | |||
\(\cdot \) | C++ | Custom | Hardware | |||
\(\cdot \) | C++ | Custom | Hardware | |||
\(\cdot \) | C++ | GNU LGPL 2.1 | Hardware | |||
\(\cdot \) | \(\cdot \) | Java | AL 2.0 | N/A | ||
\(\cdot \) | C# | AL 2.0 | N/A | |||
\(\cdot \) | \(\cdot \) | Java , Python , Scala | AL 2.0 | N/A | ||
\(\cdot \) | \(\cdot \) | Java , Python , Scala | AL 2.0 | N/A | ||
GraphTau [157] | \(\cdot \) | \(\cdot \) | Java , Scala | Unavailable | N/A | |
\(\cdot \) | \(\cdot \) | Java , Scala | AL 2.0 | N/A | ||
\(\cdot \) | C++ | AL 2.0 | N/A | |||
\(\cdot \) | \(\cdot \) | C++ | AL 2.0 | N/A | ||
\(\cdot \) | \(\cdot \) | C++ | AL 2.0 | N/A | ||
\(\cdot \) | \(\cdot \) | Unknown | Unavailable | N/A | ||
Tornado [169] | \(\cdot \) | \(\cdot \) | Unknown | Unavailable | N/A | |
KickStarter [170] | \(\cdot \) | \(\cdot \) | C++ | MIT | N/A | |
Pixie [173] | \(\cdot \) | \(\cdot \) | Unknown | Unavailable | N/A | |
FlowGraph [174] | \(\cdot \) | \(\cdot \) | Unknown | Unavailable | N/A | |
\(\cdot \) | \(\cdot \) | Java | BSD | N/A | ||
\(\cdot \) | \(\cdot \) | Java | Unknown | Copyright | ||
FBSGraph [178] | \(\cdot \) | \(\cdot \) | Unknown | Unavailable | N/A | |
\(\cdot \) | \(\cdot \) | Java | Unknown | Copyright | ||
Julienne [181] | \(\cdot \) | C++ | Unavailable | N/A | ||
GraphD [182] | \(\cdot \) | \(\cdot \) | Unknown | Unavailable | N/A | |
TurboGraph++ [183] | \(\cdot \) | \(\cdot \) | Unknown | Unavailable | N/A | |
GraphIn [184] | \(\cdot \) | C++ | Unavailable | N/A | ||
\(\cdot \) | C++ | AL 2.0 | Discontinued | |||
\(\cdot \) | C++ | MIT | N/A | |||
\(\cdot \) | C | AL 2.0 | N/A | |||
\(\cdot \) | \(\cdot \) | \(\cdot \) | C++ | AL 2.0 | N/A | |
\(\cdot \) | C | GPL 2.0 | N/A | |||
\(\cdot \) | \(\cdot \) | C++ | 3C BSD | N/A | ||
GraphCage [205] | \(\cdot \) | Unknown | Unavailable | N/A | ||
C++ | AL 2.0 | SSDs | ||||
GraphSSD [209] | Unknown | Unavailable | SSDs |