skip to main content
research-article

Scalable GPU graph traversal

Published:25 February 2012Publication History
Skip Abstract Section

Abstract

Breadth-first search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and data-dependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with non-trivial diameter.

We present a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum that achieves an asymptotically optimal O(|V|+|E|) work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single and quad-GPU configurations, respectively. This level of performance is several times faster than state-of-the-art implementations both CPU and GPU platforms.

References

  1. 10th DIMACS Implementation Challenge: http://www.cc.gatech.edu/dimacs10/index.shtml. Accessed: 2011-07-11.Google ScholarGoogle Scholar
  2. 9th DIMACS Implementation Challenge: http://www.dis.uniroma1.it/~challenge9/download.shtml. Accessed: 2011-07-11.Google ScholarGoogle Scholar
  3. Agarwal, V. et al. 2010. Scalable Graph Exploration on Multicore Processors. 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (New Orleans, LA, USA, Nov. 2010), 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bader, D.A. and Madduri, K. Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2. 2006 International Conference on Parallel Processing (ICPP'06) (Columbus, OH, USA), 523--530. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bader, D.A. et al. On the Architectural Requirements for Efficient Execution of Graph Algorithms. 2005 International Conference on Parallel Processing (ICPP'05) (Oslo, Norway), 547--556. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bell, N. and Garland, M. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (New York, NY, USA, 2009), 18:1--18:11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Blelloch, G.E. 1990. Prefix Sums and Their Applications. Synthesis of Parallel Algorithms.Google ScholarGoogle Scholar
  8. Blelloch, G.E. 1989. Scans as primitive parallel operations. IEEE Transactions on Computers. 38, 11 (Nov. 1989), 1526--1538. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chatterjee, S. et al. 1990. Scan primitives for vector computers. Proceedings of the 1990 ACM/IEEE conference on Supercomputing (Los Alamitos, CA, USA, 1990), 666--675. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Che, S. et al. 2009. Rodinia: A benchmark suite for heterogeneous computing. 2009 IEEE International Symposium on Workload Characterization (IISWC) (Austin, TX, USA, Oct. 2009), 44--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Cormen, T.H. et al. 2001. Introduction to Algorithms. MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Deng, Y. (Steve) et al. 2009. Taming irregular EDA applications on GPUs. Proceedings of the 2009 International Conference on Computer-Aided Design (New York, NY, USA, 2009), 539--546. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Dotsenko, Y. et al. 2008. Fast scan algorithms on graphics processors. Proceedings of the 22nd annual international conference on Supercomputing (New York, NY, USA, 2008), 205--213. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Garland, M. 2008. Sparse matrix computations on manycore GPU's. Proceedings of the 45th annual Design Automation Conference (New York, NY, USA, 2008), 2--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. GTgraph: A suite of synthetic random graph generators: https://sdm.lbl.gov/~kamesh/software/GTgraph/. Accessed: 2011-07-11.Google ScholarGoogle Scholar
  16. Harish, P. and Narayanan, P.J. 2007. Accelerating large graph algorithms on the GPU using CUDA. Proceedings of the 14th international conference on High performance computing (Berlin, Heidelberg, 2007), 197--208. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Hillis, W.D. and Steele, G.L. 1986. Data parallel algorithms. Communications of the ACM. 29, 12 (Dec. 1986), 1170--1183. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Hong, S. et al. 2011. Accelerating CUDA graph algorithms at maximum warp. Proceedings of the 16th ACM symposium on Principles and practice of parallel programming (New York, NY, USA, 2011), 267--276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Hong, S. et al. 2011. Efficient Parallel Graph Exploration for Multi-Core CPU and GPU. (New York, NY, USA, 2011), to appear.Google ScholarGoogle Scholar
  20. Hussein, M. et al. 2007. On Implementing Graph Cuts on CUDA. First Workshop on General Purpose Processing on Graphics Processing Units (Boston, MA, Oct. 2007).Google ScholarGoogle Scholar
  21. Leiserson, C.E. and Schardl, T.B. 2010. A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers). Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures (New York, NY, USA, 2010), 303--314. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Luo, L. et al. 2010. An effective GPU implementation of breadth-first search. Proceedings of the 47th Design Automation Conference (New York, NY, USA, 2010), 52--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Merrill, D. and Grimshaw, A. 2011. High Performance and Scalable Radix Sorting: A case study of implementing dynamic parallelism for GPU computing. Parallel Processing Letters. 21, 02 (2011), 245--272.Google ScholarGoogle ScholarCross RefCross Ref
  24. Merrill, D. and Grimshaw, A. 2009. Parallel Scan for Stream Architectures. Technical Report #CS2009--14. Department of Computer Science, University of Virginia.Google ScholarGoogle Scholar
  25. Merrill, D. et al. 2011. High Performance and Scalable GPU Graph Traversal. Technical Report #CS2011-05. Department of Computer Science, University of Virginia.Google ScholarGoogle Scholar
  26. Parboil Benchmark suite: http://impact.crhc.illinois.edu/parboil.php. Accessed: 2011-07-11.Google ScholarGoogle Scholar
  27. Scarpazza, D.P. et al. 2008. Efficient Breadth-First Search on the Cell/BE Processor. IEEE Transactions on Parallel and Distributed Systems. 19, 10 (Oct. 2008), 1381--1395. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Sengupta, S. et al. 2008. Efficient parallel scan algorithms for GPUs. Technical Report #NVR-2008-003. NVIDIA.Google ScholarGoogle Scholar
  29. The Graph 500 List: http://www.graph500.org/. Accessed: 2011-07-11.Google ScholarGoogle Scholar
  30. Ullman, J. and Yannakakis, M. 1990. High-probability parallel transitive closure algorithms. Proceedings of the second annual ACM symposium on Parallel algorithms and architectures - SPAA '90 (Island of Crete, Greece, 1990), 200--209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. University of Florida Sparse Matrix Collection: http://www.cise.ufl.edu/research/sparse/matrices/. Accessed: 2011-07-11.Google ScholarGoogle Scholar
  32. Xia, Y. and Prasanna, V.K. 2009. Topologically Adaptive Parallel Breadth-first Search on Multicore Processors. 21st International Conference on Parallel and Distributed Computing and Systems (PDCS'09) (Nov. 2009).Google ScholarGoogle Scholar
  33. Yoo, A. et al. A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L. ACM/IEEE SC 2005 Conference (SC'05) (Seattle, WA, USA), 25--25. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Scalable GPU graph traversal

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGPLAN Notices
          ACM SIGPLAN Notices  Volume 47, Issue 8
          PPOPP '12
          August 2012
          334 pages
          ISSN:0362-1340
          EISSN:1558-1160
          DOI:10.1145/2370036
          Issue’s Table of Contents
          • cover image ACM Conferences
            PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
            February 2012
            352 pages
            ISBN:9781450311601
            DOI:10.1145/2145816

          Copyright © 2012 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 25 February 2012

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader