research-article

Scalable GPU graph traversal

Authors:
Duane Merrill

Unversity of Virginia, Charlottesville, VA, USA

Unversity of Virginia, Charlottesville, VA, USA
View Profile

,
Michael Garland

NVIDIA Corporation, Santa Clara, CA, USA

NVIDIA Corporation, Santa Clara, CA, USA
View Profile

,
Andrew Grimshaw

University of Virginia, Charlottesville, VA, USA

University of Virginia, Charlottesville, VA, USA
View Profile

Authors Info & Claims

ACM SIGPLAN Notices Volume 47 Issue 8August 2012pp 117–128https://doi.org/10.1145/2370036.2145832

Published:25 February 2012Publication History

ACM SIGPLAN Notices

Abstract

Breadth-first search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and data-dependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with non-trivial diameter.

We present a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum that achieves an asymptotically optimal O(|V|+|E|) work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single and quad-GPU configurations, respectively. This level of performance is several times faster than state-of-the-art implementations both CPU and GPU platforms.

References

10th DIMACS Implementation Challenge: http://www.cc.gatech.edu/dimacs10/index.shtml. Accessed: 2011-07-11.Google Scholar
9th DIMACS Implementation Challenge: http://www.dis.uniroma1.it/~challenge9/download.shtml. Accessed: 2011-07-11.Google Scholar
Agarwal, V. et al. 2010. Scalable Graph Exploration on Multicore Processors. 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (New Orleans, LA, USA, Nov. 2010), 1--11. Google ScholarDigital Library
Bader, D.A. and Madduri, K. Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2. 2006 International Conference on Parallel Processing (ICPP'06) (Columbus, OH, USA), 523--530. Google ScholarDigital Library
Bader, D.A. et al. On the Architectural Requirements for Efficient Execution of Graph Algorithms. 2005 International Conference on Parallel Processing (ICPP'05) (Oslo, Norway), 547--556. Google ScholarDigital Library
Bell, N. and Garland, M. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (New York, NY, USA, 2009), 18:1--18:11. Google ScholarDigital Library
Blelloch, G.E. 1990. Prefix Sums and Their Applications. Synthesis of Parallel Algorithms.Google Scholar
Blelloch, G.E. 1989. Scans as primitive parallel operations. IEEE Transactions on Computers. 38, 11 (Nov. 1989), 1526--1538. Google ScholarDigital Library
Chatterjee, S. et al. 1990. Scan primitives for vector computers. Proceedings of the 1990 ACM/IEEE conference on Supercomputing (Los Alamitos, CA, USA, 1990), 666--675. Google ScholarDigital Library
Che, S. et al. 2009. Rodinia: A benchmark suite for heterogeneous computing. 2009 IEEE International Symposium on Workload Characterization (IISWC) (Austin, TX, USA, Oct. 2009), 44--54. Google ScholarDigital Library
Cormen, T.H. et al. 2001. Introduction to Algorithms. MIT Press. Google ScholarDigital Library
Deng, Y. (Steve) et al. 2009. Taming irregular EDA applications on GPUs. Proceedings of the 2009 International Conference on Computer-Aided Design (New York, NY, USA, 2009), 539--546. Google ScholarDigital Library
Dotsenko, Y. et al. 2008. Fast scan algorithms on graphics processors. Proceedings of the 22nd annual international conference on Supercomputing (New York, NY, USA, 2008), 205--213. Google ScholarDigital Library
Garland, M. 2008. Sparse matrix computations on manycore GPU's. Proceedings of the 45th annual Design Automation Conference (New York, NY, USA, 2008), 2--6. Google ScholarDigital Library
GTgraph: A suite of synthetic random graph generators: https://sdm.lbl.gov/~kamesh/software/GTgraph/. Accessed: 2011-07-11.Google Scholar
Harish, P. and Narayanan, P.J. 2007. Accelerating large graph algorithms on the GPU using CUDA. Proceedings of the 14th international conference on High performance computing (Berlin, Heidelberg, 2007), 197--208. Google ScholarDigital Library
Hillis, W.D. and Steele, G.L. 1986. Data parallel algorithms. Communications of the ACM. 29, 12 (Dec. 1986), 1170--1183. Google ScholarDigital Library
Hong, S. et al. 2011. Accelerating CUDA graph algorithms at maximum warp. Proceedings of the 16th ACM symposium on Principles and practice of parallel programming (New York, NY, USA, 2011), 267--276. Google ScholarDigital Library
Hong, S. et al. 2011. Efficient Parallel Graph Exploration for Multi-Core CPU and GPU. (New York, NY, USA, 2011), to appear.Google Scholar
Hussein, M. et al. 2007. On Implementing Graph Cuts on CUDA. First Workshop on General Purpose Processing on Graphics Processing Units (Boston, MA, Oct. 2007).Google Scholar
Leiserson, C.E. and Schardl, T.B. 2010. A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers). Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures (New York, NY, USA, 2010), 303--314. Google ScholarDigital Library
Luo, L. et al. 2010. An effective GPU implementation of breadth-first search. Proceedings of the 47th Design Automation Conference (New York, NY, USA, 2010), 52--55. Google ScholarDigital Library
Merrill, D. and Grimshaw, A. 2011. High Performance and Scalable Radix Sorting: A case study of implementing dynamic parallelism for GPU computing. Parallel Processing Letters. 21, 02 (2011), 245--272.Google ScholarCross Ref
Merrill, D. and Grimshaw, A. 2009. Parallel Scan for Stream Architectures. Technical Report #CS2009--14. Department of Computer Science, University of Virginia.Google Scholar
Merrill, D. et al. 2011. High Performance and Scalable GPU Graph Traversal. Technical Report #CS2011-05. Department of Computer Science, University of Virginia.Google Scholar
Parboil Benchmark suite: http://impact.crhc.illinois.edu/parboil.php. Accessed: 2011-07-11.Google Scholar
Scarpazza, D.P. et al. 2008. Efficient Breadth-First Search on the Cell/BE Processor. IEEE Transactions on Parallel and Distributed Systems. 19, 10 (Oct. 2008), 1381--1395. Google ScholarDigital Library
Sengupta, S. et al. 2008. Efficient parallel scan algorithms for GPUs. Technical Report #NVR-2008-003. NVIDIA.Google Scholar
The Graph 500 List: http://www.graph500.org/. Accessed: 2011-07-11.Google Scholar
Ullman, J. and Yannakakis, M. 1990. High-probability parallel transitive closure algorithms. Proceedings of the second annual ACM symposium on Parallel algorithms and architectures - SPAA '90 (Island of Crete, Greece, 1990), 200--209. Google ScholarDigital Library
University of Florida Sparse Matrix Collection: http://www.cise.ufl.edu/research/sparse/matrices/. Accessed: 2011-07-11.Google Scholar
Xia, Y. and Prasanna, V.K. 2009. Topologically Adaptive Parallel Breadth-first Search on Multicore Processors. 21st International Conference on Parallel and Distributed Computing and Systems (PDCS'09) (Nov. 2009).Google Scholar
Yoo, A. et al. A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L. ACM/IEEE SC 2005 Conference (SC'05) (Seattle, WA, USA), 25--25. Google ScholarDigital Library

Index Terms

Scalable GPU graph traversal

Recommendations

High-Performance and Scalable GPU Graph Traversal
Special Issue on PPOPP 2012

Breadth-First Search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular ...
Read More
Scalable GPU graph traversal
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming

Breadth-first search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular ...
Read More
Using the Intel Many Integrated Core to accelerate graph traversal

Data-intensive applications have drawn more and more attention in the last few years. The basic graph traversal algorithm, the breadth-first search (BFS), a typical data-intensive application, is widely used and the Graph 500 benchmark uses it to rank ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGPLAN Notices Volume 47, Issue 8
PPOPP '12
August 2012
334 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2370036
Issue’s Table of Contents
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
February 2012
352 pages
ISBN:9781450311601
DOI:10.1145/2145816
General Chair:
J. Ramanujam
Louisiana State University, USA
,
Program Chair:
P. Sadayappan
The Ohio State University, USA
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 February 2012
Check for updates
Author Tags
GPU
breadth-first search
cooperative allocation
graph algorithms
graph traversal
parallel algorithms
prefix sum
sparse graph
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 453
  Total Citations
  View Citations
- 3,805
  Total Downloads
- Downloads (Last 12 months)166
- Downloads (Last 6 weeks)28
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Scalable GPU graph traversal

ACM SIGPLAN Notices

Abstract

References

Cited By

Index Terms

Recommendations

High-Performance and Scalable GPU Graph Traversal

Scalable GPU graph traversal

Using the Intel Many Integrated Core to accelerate graph traversal