research-article

PGX.D: a fast distributed graph processing engine

Authors:
Sungpack Hong

Oracle Labs

Oracle Labs
View Profile

,
Siegfried Depner

Oracle Labs

Oracle Labs
View Profile

,
Thomas Manhardt

Oracle Labs

Oracle Labs
View Profile

,
Jan Van Der Lugt

Google

Google
View Profile

,
Merijn Verstraaten

Univ. Amsterdam, Netherlands

Univ. Amsterdam, Netherlands
View Profile

,
Hassan Chafi

Oracle Labs

Oracle Labs
View Profile

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisNovember 2015Article No.: 58Pages 1–12https://doi.org/10.1145/2807591.2807620

Published:15 November 2015Publication History

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Pages 1–12

ABSTRACT

Graph analysis is a powerful method in data analysis. Although several frameworks have been proposed for processing large graph instances in distributed environments, their performance is much lower than using efficient single-machine implementations provided with enough memory. In this paper, we present a fast distributed graph processing system, namely PGX.D. We show that PGX.D outperforms other distributed graph systems like GraphLab significantly (3x -- 90x). Furthermore, PGX.D on 4 to 16 machines is also faster than an implementation optimized for single-machine execution. Using a fast cooperative context-switching mechanism, we implement PGX.D as a low-overhead, bandwidth-efficient communication framework that supports remote data-pulling patterns. Moreover, PGX.D achieves large traffic reduction and good workload balance by applying selective ghost nodes, edge partitioning, and edge chunking transparently to the user. Our analysis confirms that each of these features is indeed crucial for overall performance of certain kinds of graph algorithms. Finally, we advocate the use of balanced beefy clusters where the sustained random DRAM-access bandwidth in aggregate is matched with the bandwidth of the underlying interconnection fabric.

References

Apache Giraph Project. http://giraph.apache.org.Google Scholar
Koblenz Network Collection. http://konect.uni-koblenz.de.Google Scholar
Neo4j graph database. http://www.neo4j.org/.Google Scholar
NetworkX. https://networkx.github.io.Google Scholar
SNAP. http://snap.stanford.edu/data/.Google Scholar
Yahoo! Labs Datasets. http://webscope.sandbox.yahoo.com/.Google Scholar
Atul Adya, Jon Howell, Marvin Theimer, William J Bolosky, and John R Douceur. Cooperative task management without manual stack management. In USENIX Annual Technical Conference (ATEC), pages 289--302, 2002. Google ScholarDigital Library
Sutanay Choudhury, Lawrence Holder, George Chin, Khushbu Agarwal, and John Feo. A selectivity based approach to continuous pattern detection in streaming graphs. arXiv preprint arXiv:1503.00849, 2015.Google Scholar
David Ediger, Robert McColl, Jason Riedy, and David A Bader. Stinger: High performance data structure for streaming graphs. In High Performance Extreme Computing (HPEC), pages 1--5, 2012.Google ScholarCross Ref
Jing Fan, Adalbert Gerald Soosai Raj, and Jinnesh M. Patel. A case against specialized graph analytics engines. In 7th Biennial Conference on Innovative Data Systems Research (CIDR), 2015.Google Scholar
Joseph E Gonzalez, Reynold S Xin, Ankur Dave, Daniel Crankshaw, Michael J Franklin, and Ion Stoica. Graphx: Graph processing in a distributed dataflow framework. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2014. Google ScholarDigital Library
Sairam Gurajada, Stephan Seufert, Iris Miliaraki, and Martin Theobald. Triad: a distributed shared-nothing rdf engine based on asynchronous message passing. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 289--300. ACM, 2014. Google ScholarDigital Library
Apache Hadoop. http://hadoop.apache.org/.Google Scholar
Wook-Shin Han, Jinsoo Lee, and Jeong-Hoon Lee. Turbo iso: towards ultrafast and robust subgraph isomorphism search in large graph databases. In Proc. of the 2013 ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
Sungpack Hong, Hassan Chafi, Edic Sedlar, and Kunle Olukotun. Green-Marl: A DSL for Easy and Efficient Graph Analysis. In ASPLOS. ACM, 2012. Google ScholarDigital Library
Sungpack Hong, Semih Salihoglu, Jennifer Widom, and Kunle Olukotun. Simplifying scalable graph processing with a domain-specific language. In IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 208--218, 2014. Google ScholarDigital Library
Jin Huang, Rui Zhang, and Jeffrey Xu Yu. Technical report: Hyperx a framework for scalable hypergraph learning. 2015.Google Scholar
U Kang, Charalampos E Tsourakakis, and Christos Faloutsos. Pegasus: A peta-scale graph mining system implementation and observations. In IEEE International Conference on Data Mining (ICDM), pages 229--238, 2009. Google ScholarDigital Library
Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. What is Twitter, a social network or a news media? In WWW '10: Proceedings of the 19th international conference on World wide web, pages 591--600. ACM, 2010. Google ScholarDigital Library
Willis Lang, Stavros Harizopoulos, Jignesh M Patel, Mehul A Shah, and Dimitris Tsirogiannis. Towards energy-efficient database cluster design. Proceedings of the VLDB Endowment, 5(11):1684--1695, 2012. Google ScholarDigital Library
Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M Hellerstein. Distributed graphlab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment, 5(8):716--727, 2012. Google ScholarDigital Library
Grzegorz Malewicz, Matthew H. Austern, Aart J. C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: A System for Large-scale Graph Processing. In SIGMOD '10, pages 135--146. ACM. Google ScholarDigital Library
Robert Campbell McColl, David Ediger, Jason Poovey, Dan Campbell, and David A Bader. A performance evaluation of open source graph databases. In Proceedings of the first workshop on PPAA. ACM, 2014. Google ScholarDigital Library
Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. Grappa: A latency-tolerant runtime for large-scale irregular applications. Technical report, Technical report, University of Washington, 2014. URL http://sampa. cs. washington. edu/papers/grappa-tr-2014-02. pdf. 4.1, 2014.Google Scholar
Donald Nguyen, Andrew Lenharth, and Keshav Pingali. A lightweight infrastructure for graph analytics. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 456--471. ACM, 2013. Google ScholarDigital Library
Roger Pearce, Maya Gokhale, and Nancy M Amato. Faster parallel traversal of scale free graphs at extreme scale with vertex delegates. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 549--559, 2014. Google ScholarDigital Library
Raghavan Raman, Oskar van Rest, Sungpack Hong, Zhe Wu, Hassan Chafi, and Jay Banerjee. Pgx. iso: Parallel and efficient in-memory engine for subgraph isomorphism. In Proceedings of Workshop on GRAph Data management Experiences and Systems. Google ScholarDigital Library
Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Jiwon Seo, Jongsoo Park, M. Amber Hassaan, Shubho Sengupta, Zhaoming Yin, and Pradeep Dubey. Navigating the maze of graph analytics frameworks using massive graph datasets. In ACM SIGMOD International Conference on Management of Data, pages 979--990, 2014. Google ScholarDigital Library
Jiwon Seo, Jongsoo Park, Jaeho Shin, and Monica S. Lam. Distributed socialite: A datalog-based language for large-scale graph analysis. Proc. VLDB Endow., 6(14):1906--1917, September 2013. Google ScholarDigital Library
Adam Welc, Raghavan Raman, Zhe Wu, Sungpack Hong, Hassan Chafi, and Jay Banerjee. Graph analysis: do we have to reinvent the wheel? In First International Workshop on Graph Data Management Experiences and Systems, page 7. ACM, 2013. Google ScholarDigital Library
Jeremiah James Willcock, Torsten Hoefler, Nicholas Gerard Edmonds, and Andrew Lumsdaine. Active pebbles: A programming model for highly parallel fine-grained data-driven computations. In ACM SIGPLAN Notices, volume 46, pages 305--306. ACM, 2011. Google ScholarDigital Library
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster computing with working sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud'10, pages 10--10, Berkeley, CA, USA, 2010. USENIX Association. Google ScholarDigital Library
Kai Zeng, Jiacheng Yang, Haixun Wang, Bin Shao, and Zhongyuan Wang. A distributed graph engine for web scale rdf data. Proceedings of the VLDB Endowment, 6(4):265--276, 2013. Google ScholarDigital Library

Index Terms

PGX.D: a fast distributed graph processing engine

Recommendations

PGX.D/Async: A Scalable Distributed Graph Pattern Matching Engine
GRADES'17: Proceedings of the Fifth International Workshop on Graph Data-management Experiences & Systems

Graph querying and pattern matching is becoming an important feature of graph processing as it allows data analysts to easily collect and understand information about their graphs in a way similar to SQL for databases. One of the key challenges in graph ...
Read More
On the Multichromatic Number of s-Stable Kneser Graphs

For positive integers n and s, a subset Sï [n] is s-stable if sï |i-j|ï n-s for distinct i,j∈S . The s-stable r-uniform Kneser hypergraph KGrn,ks-stable is the r-uniform hypergraph that has the collection of all s-stable k-element subsets of [n] as ...
Read More
Adjacent vertex-distinguishing edge and total chromatic numbers of hypercubes

An adjacent vertex-distinguishing edge coloring of a simple graph G is a proper edge coloring of G such that incident edge sets of any two adjacent vertices are assigned different sets of colors. A total coloring of a graph G is a coloring of both the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2015
985 pages
ISBN:9781450337236
DOI:10.1145/2807591
General Chair:
Jackie Kern
University of Illinois at Urbana-Champaign, Urbana, Illinois
,
Program Chair:
Jeffrey S. Vetter
Oak Ridge National Laboratory and Georgia Institute of Technology, Oak Ridge, Tennessee
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 November 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
SC '15 Paper Acceptance Rate79of358submissions,22%Overall Acceptance Rate1,516of6,373submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 74
  Total Citations
  View Citations
- 938
  Total Downloads
- Downloads (Last 12 months)40
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

PGX.D: a fast distributed graph processing engine

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

PGX.D/Async: A Scalable Distributed Graph Pattern Matching Engine

On the Multichromatic Number of s-Stable Kneser Graphs

Adjacent vertex-distinguishing edge and total chromatic numbers of hypercubes

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

PGX.D: a fast distributed graph processing engine

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

PGX.D/Async: A Scalable Distributed Graph Pattern Matching Engine

On the Multichromatic Number of s-Stable Kneser Graphs

Adjacent vertex-distinguishing edge and total chromatic numbers of hypercubes

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media