Skip to main content
main-content

Über dieses Buch

This book introduces readers to a workload-aware methodology for large-scale graph algorithm optimization in graph-computing systems, and proposes several optimization techniques that can enable these systems to handle advanced graph algorithms efficiently. More concretely, it proposes a workload-aware cost model to guide the development of high-performance algorithms. On the basis of the cost model, the book subsequently presents a system-level optimization resulting in a partition-aware graph-computing engine, PAGE. In addition, it presents three efficient and scalable advanced graph algorithms – the subgraph enumeration, cohesive subgraph detection, and graph extraction algorithms.

This book offers a valuable reference guide for junior researchers, covering the latest advances in large-scale graph analysis; and for senior researchers, sharing state-of-the-art solutions based on advanced graph algorithms. In addition, all readers will find a workload-aware methodology for designing efficient large-scale graph algorithms.

Inhaltsverzeichnis

Frontmatter

Chapter 1. Introduction

Abstract
With the rapid development of Internet of Things (IoT), mobile devices, and social networks, our world has become more connected than ever before, resulting in ubiquitous linked data, more generally, graphs. To discover the knowledge from the connective world, graph analysis is the de facto technique. In the consensus study report (National Research Council, Frontiers in massive data analysis. The National Academies Press, Washington, DC, 2013), National Research Council of US National Academies points out that graph analysis is one of the seven major computational methods of massive data analysis. A wide array of applications such as social network analysis, recommendations, semantic web, bioinformatics, intelligence surveillance, and image processing utilize graph analysis techniques to discover helpful insights. However, unlike decades ago, nowadays graphs are large, sparse, and highly dynamic. The classical methods of graph analysis become inefficient or even infeasible. Large-scale graph analysis becomes a problem that both industry and academia trying to solve. In this chapter, we first introduce the background of large-scale graph analysis and briefly review existing solutions; then we introduce three advanced graph analysis tasks which are popular and fundamental but not yet have efficient solutions on large graphs; third, we summarize the research issues of the large-scale graph analysis, especially for the advanced graph analysis tasks. Finally, we present an overview of this book.
Yingxia Shao, Bin Cui, Lei Chen

Chapter 2. Graph Computing Systems for Large-Scale Graph Analysis

Abstract
Since Google introduced the first distributed graph computing system Pregel, many similar systems are proposed. The distributed graph computing systems become a standard platform for large-scale graph analysis. Compared to the previous graph processing libraries, the new systems have the advantages of scalability, usability, and flexibility. In this chapter, we briefly review the basic concepts of the distributed graph computing systems, including the architecture, execution flow, and programming abstraction and computation models (e.g., vertex-centric, edge-centric, subgraph-centric, etc.). In this book, we concentrate on the vertex-centric computation model and then describe two excellent and popular programming abstractions—vertex programming abstraction and gather–apply–scatter (GAS) programming abstraction. Finally, we introduce the workload-aware cost model which classifies the factors influencing the performance into two types—workload source and workload distribution. The model helps to estimate the workload for a distributed graph computing system and guides us to optimize the systems and algorithms smartly.
Yingxia Shao, Bin Cui, Lei Chen

Chapter 3. Partition-Aware Graph Computing System

Abstract
Graph partition quality affects the overall performance of distributed graph computing systems. The quality of a graph partition is measured by the balance factor and edge cut ratio. A balanced graph partition with small edge cut ratio is generally preferred since it reduces the high network communication cost. However, through an empirical study on Giraph, we find that the performance over well partitioned graph might be even two times worse than simple random partitions. The reason is that the systems only optimize for the simple partition strategies and cannot efficiently handle the increasing workload of local message processing when a high-quality graph partition is used. In this chapter, we introduce a novel partition-aware graph computing system named PAGE, which equips a new message processor and a dynamic concurrency control model. The new message processor concurrently processes local and remote messages in a unified way. The dynamic model adaptively adjusts the concurrency of the processor based on the online statistics. The experimental studies demonstrate the superiority of PAGE over the graph partitions with various qualities.
Yingxia Shao, Bin Cui, Lei Chen

Chapter 4. Efficient Parallel Subgraph Enumeration

Abstract
In this chapter, we introduce a novel parallel subgraph enumeration framework, named PSgL, which is built on top of Pregel-like graph computing systems. The PSgL iteratively enumerates subgraph instances and solves the subgraph enumeration in a divide-and-conquer fashion. The framework completely relies on the graph traversal operation instead of the explicit join operation. To achieve the high efficiency of the framework, we propose several algorithm-specific optimization techniques for balancing the workload and reducing the size of intermediate results. In respect to the workload balance, we theoretically prove the problem of partial subgraph instance distribution is NP-hard, and carefully design heuristic strategies. To reduce the massive intermediate results, we develop three mechanisms, which are automorphism breaking of the pattern graph, initial pattern vertex selection based on a cost model, and a pruning method based on a light-weight index. We implemented the prototype of PSgL, and conducted comprehensive experiments of various graph enumeration operations on real-world large graphs. The experimental results clearly demonstrate that PSgL is robust and can achieve performance gain over the existing considerable solutions up to 90%.
Yingxia Shao, Bin Cui, Lei Chen

Chapter 5. Efficient Parallel Graph Extraction

Abstract
In this chapter, we introduce the homogeneous graph extraction task, which extracts homogeneous graphs from the heterogeneous graphs. In an extracted homogeneous graph, the relation is defined by a line pattern on the heterogeneous graph and the new attribute values of the relation are calculated by user-defined aggregate functions. When facing large-scale heterogeneous graphs, the key challenges of the extraction problem are how to efficiently enumerate paths matched by the line pattern and aggregate values for each pair of vertices from the matched paths. To address the above two challenges, we propose a parallel graph extraction framework. The framework compiles the line pattern into a path concatenation plan, which is selected by a cost model. To guarantee the performance of computing aggregate functions, we first classify the aggregate functions into distributive aggregation, algebraic aggregation, and holistic aggregation; then we speed up the distributive and algebraic aggregations by computing partial aggregate values during the path enumeration. The experimental results demonstrate the effectiveness of the proposed graph extraction.
Yingxia Shao, Bin Cui, Lei Chen

Chapter 6. Efficient Parallel Cohesive Subgraph Detection

Abstract
Community detection is a fundamental graph analytic task. However, due to the high computation complexity, many community detection algorithms cannot handle large graphs. In this chapter, we investigate a special community detection problem, that is, cohesive subgraph detection. Here the target cohesive subgraph is k-truss, which is motivated by a natural observation of social cohesion. We propose a novel parallel and efficient truss detection algorithm, called PeTa. PeTa produces a triangle complete subgraph (TC-subgraph) for every computing node. Based on the TC-subgraphs, it can detect the local k-truss in parallel within a few iterations. We theoretically prove, within this new paradigm, the communication cost of PeTa is bounded by three times of the number of triangles, the total computation complexity of PeTa is the same order as the best known serial algorithm, and the number of iterations for a given partition scheme is minimized as well. Furthermore, we present a subgraph-oriented model to efficiently express PeTa in parallel graph computing systems. The results of comprehensive experiments demonstrate, compared with the existing solutions, PeTa saves 2× to 19× in communication cost, reduces 80% to 95% number of iterations, and improves the overall performance by 80% across various real-world graphs.
Yingxia Shao, Bin Cui, Lei Chen

Chapter 7. Conclusions

Abstract
Large-scale graph analysis is a critical task for big data applications. The distributed graph computing system is a successful paradigm for the large-scale graph analysis. It not only helps analysts achieve high scalability and efficiency, but also enables analysts to focus on the logic of analysis tasks through transparenting the tedious distributed communication protocols. In this book, we chose Pregel-like systems as a basic platform, and studied the deficiency of existing systems.
Yingxia Shao, Bin Cui, Lei Chen
Weitere Informationen

Premium Partner

    Bildnachweise