skip to main content
research-article

A GPU implementation of inclusion-based points-to analysis

Published:25 February 2012Publication History
Skip Abstract Section

Abstract

Graphics Processing Units (GPUs) have emerged as powerful accelerators for many regular algorithms that operate on dense arrays and matrices. In contrast, we know relatively little about using GPUs to accelerate highly irregular algorithms that operate on pointer-based data structures such as graphs. For the most part, research has focused on GPU implementations of graph analysis algorithms that do not modify the structure of the graph, such as algorithms for breadth-first search and strongly-connected components.

In this paper, we describe a high-performance GPU implementation of an important graph algorithm used in compilers such as gcc and LLVM: Andersen-style inclusion-based points-to analysis. This algorithm is challenging to parallelize effectively on GPUs because it makes extensive modifications to the structure of the underlying graph and performs relatively little computation. In spite of this, our program, when executed on a 14 Streaming Multiprocessor GPU, achieves an average speedup of 7x compared to a sequential CPU implementation and outperforms a parallel implementation of the same algorithm running on 16 CPU cores.

Our implementation provides general insights into how to produce high-performance GPU implementations of graph algorithms, and it highlights key differences between optimizing parallel programs for multicore CPUs and for GPUs.

References

  1. NVIDIA's Next Generation CUDA Compute Architecture: Fermi. http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf, 2010.Google ScholarGoogle Scholar
  2. CUDA C Programming Guide 4.0. NVIDIA, 2011.Google ScholarGoogle Scholar
  3. L. O. Andersen. Program Analysis and Specialization for the C Programming Language. PhD thesis, DIKU, University of Copenhagen, May 1994. (DIKU report 94/19).Google ScholarGoogle Scholar
  4. David A. Bader and Kamesh Madduri. Designing multithreaded algorithms for breadth-first search and st-connectivity on the cray mta-2. In Proceedings of the 2006 International Conference on Parallel Processing, ICPP '06, pages 523--530, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Barnat, P. Bauch, L. Brim, and M. Ceska. Computing Strongly Connected Components in Parallel on CUDA. In Proceedings of the 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS'11), pages 541--552. IEEE Computer Society, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Marc Berndl, Ondrej Lhoták, Feng Qian, Laurie Hendren, and Navindra Umanee. Points-to analysis using BDDs. In Proc. Conf. on Programming Language Design and Implementation (PLDI), pages 103--114, New York, NY, USA, 2003. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Ulrik Brandes and Thomas Erlebach, editors. Network Analysis: Methodological Foundations. Springer-Verlag, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Randal E. Bryant. Graph-based algorithms for boolean function manipulation. IEEE Transactions on Computers, 35:677--691, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Martin Burtscher and Keshav Pingali. An efficient CUDA implementation of the tree-based barnes hut n-body algorithm. In GPU Computing Gems Emerald Edition, pages 75--92. Morgan Kaufmann, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  10. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, and Kevin Skadron. A performance study of general-purpose applications on graphics processors using cuda. J. Parallel Distrib. Comput., 68:1370--1380, October 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. L. Paul Chew. Guaranteed-quality mesh generation for curved surfaces. In Proc. Symp. on Computational Geometry (SCG), 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Manuel Fahndrich, Jeffrey S. Foster, Zhendong Su, and Alexander Aiken. Partial online cycle elimination in inclusion constraint graphs. In Proc. Conf. on Programming Language Design and Implementation (PLDI), pages 85--96, New York, NY, USA, 1998. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ben Hardekopf and Calvin Lin. The ant and the grasshopper: fast and accurate pointer analysis for millions of lines of code. In Proc. Conf. on Programming Language Design and Implementation (PLDI), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Pawan Harish and P. J. Narayanan. Accelerating large graph algorithms on the gpu using cuda. In HiPC'07: Proceedings of the 14th international conference on High performance computing, pages 197--208, Berlin, Heidelberg, 2007. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Nevin Heintze and Olivier Tardieu. Ultra-fast aliasing analysis using cla: a million lines of c code in a second. SIGPLAN Not., 36(5):254--263, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Fritz Henglein. Type inference and semi-unification. In Proceedings of the 1988 ACM conference on LISP and functional programming, LFP '88, pages 184--197, New York, NY, USA, 1988. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Michael Hind. Pointer analysis: haven't we solved this problem yet? In PASTE '01: Proceedings of the 2001 ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering, pages 54--61, New York, NY, USA, 2001. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Sungpack Hong, Sang Kyun Kim, Tayo Oguntebi, and Kunle Olukotun. Accelerating cuda graph algorithms at maximum warp. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP '11, pages 267--276, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun. Efficient parallel graph exploration on multi-core cpu and gpu. In 20th International Conference on Parallel Architectures and Compilation Techniques, PACT'11, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Song Huang, Shucai Xiao, and Wu chun Feng. On the energy efficiency of graphics processing units for scientific computing. In IPDPS, pages 1--8, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Milind Kulkarni, Keshav Pingali, Bruce Walter, Ganesh Ramanarayanan, Kavita Bala, and L. Paul Chew. Optimistic parallelism requires abstractions. SIGPLAN Not. (Proceedings of PLDI), 42(6):211--222, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal, and Pradeep Dubey. Debunking the 100x gpu vs. cpu myth: an evaluation of throughput computing on cpu and gpu. In Proceedings of the 37th annual international symposium on Computer architecture, ISCA '10, pages 451--460, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ondrej Lhoták and Laurie Hendren. Scaling Java points-to analysis using Spark. In G. Hedin, editor, Compiler Construction, 12th International Conference, volume 2622 of LNCS, pages 153--169, Warsaw, Poland, April 2003. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Lijuan Luo, Martin Wong, and Wen-mei Hwu. An effective gpu implementation of breadth-first search. In Proceedings of the 47th Design Automation Conference, DAC '10, pages 52--55, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Mario Méndez-Lojo, Augustine Mathew, and Keshav Pingali. Parallel inclusion-based points-to analysis. In Proceedings of the 24th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA'10), October 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Duane G. Merrill, Michael Garland, and Andrew S. Grimshaw. Scalable gpu graph traversal. In 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'12, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Donald Nguyen and Keshav Pingali. Synthesizing concurrent schedulers for irregular algorithms. In ASPLOS '11: Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. NVIDIA. Thrust library version 1.4.0. http://code.google.com/p/thrust/.Google ScholarGoogle Scholar
  29. Keshav Pingali, Donald Nguyen, Milind Kulkarni, Martin Burtscher, M. Amber Hassaan, Rashid Kaleem, Tsung-Hsien Lee, Andrew Lenharth, Roman Manevich, Mario Méndez-Lojo, Dimitrios Prountzos, and Xin Sui. The tao of parallelism in algorithms. In Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation, PLDI '11, pages 12--25, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Tarun Prabhu, Shreyas Ramalingam, Matthew Might, and Mary Hall. Eigencfa: accelerating flow analysis with gpus. In Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, POPL '11, pages 511--522, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Thomas W. Reps. Program analysis via graph reachability. Technical Report Technical Report Number 1386, University of Wisconsin, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  32. Atanas Rountev and Satish Chandra. Off-line variable substitution for scaling points-to analysis. In Proc. Conf. on Programming Language Design and Implementation (PLDI), pages 47--56, New York, NY, USA, 2000. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Bjarne Steensgaard. Points-to analysis in almost linear time. In POPL '96: Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 32--41, New York, NY, USA, 1996. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Richard Vuduc, Aparna Chandramowlishwaran, Jee Choi, Murat Guney, and Aashay Shringarpure. On the limits of gpu acceleration. In Proceedings of the 2nd USENIX conference on Hot topics in parallelism, HotPar'10, pages 13--13, Berkeley, CA, USA, 2010. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. John Whaley and Monica S. Lam. Cloning-based context-sensitive pointer alias analysis using binary decision diagrams. In Proc. Conf. on Programming Language Design and Implementation (PLDI), pages 131--144, New York, NY, USA, 2004. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Andy Yoo, Edmond Chow, Keith Henderson, William McLendon, Bruce Hendrickson, and Umit Catalyurek. A scalable distributed parallel breadth-first search algorithm on bluegene/l. In Proceedings of the 2005 ACM/IEEE conference on Supercomputing, SC '05, pages 25--, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A GPU implementation of inclusion-based points-to analysis

      Recommendations

      Reviews

      Henk Sips

      Modern computing devices like graphical processing units (GPUs) require high-performance algorithms to adhere to a few principles like regular computations on aligned data structures and the careful use of synchronization constructs. This is challenging for many current algorithms and especially for graph-based algorithms. Graphs generally exhibit an irregular connection structure that may even change during computation. This paper deals with the implementation on GPUs of a graph algorithm necessary for inclusion-based points-to analysis, an algorithm that is used within compilers to determine which variables each pointer in a program points to. It is a challenging algorithm because it is highly irregular, requires relatively little computation, and makes extensive modification of the underlying graph during the processing of that graph. The solution described in the paper is to modify the representation of the graph, modify the algorithm, and add GPU-based optimizations. Classical representation of points-to graphs may lead to excessive memory space since these graphs are very sparse. Alternative compressed sparse schemes add new edges that cannot be statically predicted, leading to efficiency problems on GPUs. The authors use wide sparse bit vectors, chosen because a good alignment exists with the resources of the GPU (such as bus and memory banks). As such, data is aligned, avoiding shared memory bank conflicts and thread divergence. The price, however, is the need for more memory to accommodate the representation, which seems to be a fair trade-off in this case. The modification of the graph algorithm relates to the parallel rule application in this algorithm. Parallel execution of the graph rewrite rules requires many fine-grained synchronizations in the graph data structure, which may lead to severe performance problems when done on a GPU. To solve these problems, the authors propose modified rewrite rules that require lower synchronization overhead by storing the reversed copy edge instead of the original copy edge in the graph. Finally, the authors describe a number of GPU-specific optimizations for the processing of these types of points-to graphs. In the section on experimental validation, the authors compare their GPU approach to an earlier derived multicore central processing unit (CPU) solution, and to a classical sequential solution. The results are that, on average, the GPU solution is slightly faster (seven times) than the multicore solution (six times), but requires 35 percent more programming effort, according to the authors (assuming that the intellectual work of rethinking the algorithm is not included). The main contribution of this paper is not so much the absolute speed that can be obtained from using a GPU with this specific algorithm, but rather that seemingly GPU-unfriendly algorithms like graph algorithms can be recast into GPU-friendly ones by careful reconsideration and redesign of the shape and representation of the data structures and the associated algorithms. Online Computing Reviews Service

      Access critical reviews of Computing literature here

      Become a reviewer for Computing Reviews.

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 47, Issue 8
        PPOPP '12
        August 2012
        334 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/2370036
        Issue’s Table of Contents
        • cover image ACM Conferences
          PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
          February 2012
          352 pages
          ISBN:9781450311601
          DOI:10.1145/2145816

        Copyright © 2012 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 25 February 2012

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader