research-article

A GPU implementation of inclusion-based points-to analysis

Authors:
Mario Mendez-Lojo

University of Texas, Austin, TX, USA

University of Texas, Austin, TX, USA
View Profile

,
Martin Burtscher

Texas State University, San Marcos, TX, USA

Texas State University, San Marcos, TX, USA
View Profile

,
Keshav Pingali

University of Texas, Austin, TX, USA

University of Texas, Austin, TX, USA
View Profile

Authors Info & Claims

ACM SIGPLAN Notices Volume 47 Issue 8August 2012pp 107–116https://doi.org/10.1145/2370036.2145831

Published:25 February 2012Publication History

ACM SIGPLAN Notices

Abstract

Graphics Processing Units (GPUs) have emerged as powerful accelerators for many regular algorithms that operate on dense arrays and matrices. In contrast, we know relatively little about using GPUs to accelerate highly irregular algorithms that operate on pointer-based data structures such as graphs. For the most part, research has focused on GPU implementations of graph analysis algorithms that do not modify the structure of the graph, such as algorithms for breadth-first search and strongly-connected components.

In this paper, we describe a high-performance GPU implementation of an important graph algorithm used in compilers such as gcc and LLVM: Andersen-style inclusion-based points-to analysis. This algorithm is challenging to parallelize effectively on GPUs because it makes extensive modifications to the structure of the underlying graph and performs relatively little computation. In spite of this, our program, when executed on a 14 Streaming Multiprocessor GPU, achieves an average speedup of 7x compared to a sequential CPU implementation and outperforms a parallel implementation of the same algorithm running on 16 CPU cores.

Our implementation provides general insights into how to produce high-performance GPU implementations of graph algorithms, and it highlights key differences between optimizing parallel programs for multicore CPUs and for GPUs.

References

NVIDIA's Next Generation CUDA Compute Architecture: Fermi. http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf, 2010.Google Scholar
CUDA C Programming Guide 4.0. NVIDIA, 2011.Google Scholar
L. O. Andersen. Program Analysis and Specialization for the C Programming Language. PhD thesis, DIKU, University of Copenhagen, May 1994. (DIKU report 94/19).Google Scholar
David A. Bader and Kamesh Madduri. Designing multithreaded algorithms for breadth-first search and st-connectivity on the cray mta-2. In Proceedings of the 2006 International Conference on Parallel Processing, ICPP '06, pages 523--530, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarDigital Library
J. Barnat, P. Bauch, L. Brim, and M. Ceska. Computing Strongly Connected Components in Parallel on CUDA. In Proceedings of the 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS'11), pages 541--552. IEEE Computer Society, 2011. Google ScholarDigital Library
Marc Berndl, Ondrej Lhoták, Feng Qian, Laurie Hendren, and Navindra Umanee. Points-to analysis using BDDs. In Proc. Conf. on Programming Language Design and Implementation (PLDI), pages 103--114, New York, NY, USA, 2003. ACM. Google ScholarDigital Library
Ulrik Brandes and Thomas Erlebach, editors. Network Analysis: Methodological Foundations. Springer-Verlag, 2005. Google ScholarDigital Library
Randal E. Bryant. Graph-based algorithms for boolean function manipulation. IEEE Transactions on Computers, 35:677--691, 1986. Google ScholarDigital Library
Martin Burtscher and Keshav Pingali. An efficient CUDA implementation of the tree-based barnes hut n-body algorithm. In GPU Computing Gems Emerald Edition, pages 75--92. Morgan Kaufmann, 2011.Google ScholarCross Ref
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, and Kevin Skadron. A performance study of general-purpose applications on graphics processors using cuda. J. Parallel Distrib. Comput., 68:1370--1380, October 2008. Google ScholarDigital Library
L. Paul Chew. Guaranteed-quality mesh generation for curved surfaces. In Proc. Symp. on Computational Geometry (SCG), 1993. Google ScholarDigital Library
Manuel Fahndrich, Jeffrey S. Foster, Zhendong Su, and Alexander Aiken. Partial online cycle elimination in inclusion constraint graphs. In Proc. Conf. on Programming Language Design and Implementation (PLDI), pages 85--96, New York, NY, USA, 1998. ACM. Google ScholarDigital Library
Ben Hardekopf and Calvin Lin. The ant and the grasshopper: fast and accurate pointer analysis for millions of lines of code. In Proc. Conf. on Programming Language Design and Implementation (PLDI), 2007. Google ScholarDigital Library
Pawan Harish and P. J. Narayanan. Accelerating large graph algorithms on the gpu using cuda. In HiPC'07: Proceedings of the 14th international conference on High performance computing, pages 197--208, Berlin, Heidelberg, 2007. Springer-Verlag. Google ScholarDigital Library
Nevin Heintze and Olivier Tardieu. Ultra-fast aliasing analysis using cla: a million lines of c code in a second. SIGPLAN Not., 36(5):254--263, 2001. Google ScholarDigital Library
Fritz Henglein. Type inference and semi-unification. In Proceedings of the 1988 ACM conference on LISP and functional programming, LFP '88, pages 184--197, New York, NY, USA, 1988. ACM. Google ScholarDigital Library
Michael Hind. Pointer analysis: haven't we solved this problem yet? In PASTE '01: Proceedings of the 2001 ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering, pages 54--61, New York, NY, USA, 2001. ACM. Google ScholarDigital Library
Sungpack Hong, Sang Kyun Kim, Tayo Oguntebi, and Kunle Olukotun. Accelerating cuda graph algorithms at maximum warp. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP '11, pages 267--276, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun. Efficient parallel graph exploration on multi-core cpu and gpu. In 20th International Conference on Parallel Architectures and Compilation Techniques, PACT'11, 2011. Google ScholarDigital Library
Song Huang, Shucai Xiao, and Wu chun Feng. On the energy efficiency of graphics processing units for scientific computing. In IPDPS, pages 1--8, 2009. Google ScholarDigital Library
Milind Kulkarni, Keshav Pingali, Bruce Walter, Ganesh Ramanarayanan, Kavita Bala, and L. Paul Chew. Optimistic parallelism requires abstractions. SIGPLAN Not. (Proceedings of PLDI), 42(6):211--222, 2007. Google ScholarDigital Library
Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal, and Pradeep Dubey. Debunking the 100x gpu vs. cpu myth: an evaluation of throughput computing on cpu and gpu. In Proceedings of the 37th annual international symposium on Computer architecture, ISCA '10, pages 451--460, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
Ondrej Lhoták and Laurie Hendren. Scaling Java points-to analysis using Spark. In G. Hedin, editor, Compiler Construction, 12th International Conference, volume 2622 of LNCS, pages 153--169, Warsaw, Poland, April 2003. Springer. Google ScholarDigital Library
Lijuan Luo, Martin Wong, and Wen-mei Hwu. An effective gpu implementation of breadth-first search. In Proceedings of the 47th Design Automation Conference, DAC '10, pages 52--55, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
Mario Méndez-Lojo, Augustine Mathew, and Keshav Pingali. Parallel inclusion-based points-to analysis. In Proceedings of the 24th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA'10), October 2010. Google ScholarDigital Library
Duane G. Merrill, Michael Garland, and Andrew S. Grimshaw. Scalable gpu graph traversal. In 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'12, 2012. Google ScholarDigital Library
Donald Nguyen and Keshav Pingali. Synthesizing concurrent schedulers for irregular algorithms. In ASPLOS '11: Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems, 2011. Google ScholarDigital Library
NVIDIA. Thrust library version 1.4.0. http://code.google.com/p/thrust/.Google Scholar
Keshav Pingali, Donald Nguyen, Milind Kulkarni, Martin Burtscher, M. Amber Hassaan, Rashid Kaleem, Tsung-Hsien Lee, Andrew Lenharth, Roman Manevich, Mario Méndez-Lojo, Dimitrios Prountzos, and Xin Sui. The tao of parallelism in algorithms. In Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation, PLDI '11, pages 12--25, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
Tarun Prabhu, Shreyas Ramalingam, Matthew Might, and Mary Hall. Eigencfa: accelerating flow analysis with gpus. In Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, POPL '11, pages 511--522, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
Thomas W. Reps. Program analysis via graph reachability. Technical Report Technical Report Number 1386, University of Wisconsin, 1998.Google ScholarCross Ref
Atanas Rountev and Satish Chandra. Off-line variable substitution for scaling points-to analysis. In Proc. Conf. on Programming Language Design and Implementation (PLDI), pages 47--56, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
Bjarne Steensgaard. Points-to analysis in almost linear time. In POPL '96: Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 32--41, New York, NY, USA, 1996. ACM. Google ScholarDigital Library
Richard Vuduc, Aparna Chandramowlishwaran, Jee Choi, Murat Guney, and Aashay Shringarpure. On the limits of gpu acceleration. In Proceedings of the 2nd USENIX conference on Hot topics in parallelism, HotPar'10, pages 13--13, Berkeley, CA, USA, 2010. USENIX Association. Google ScholarDigital Library
John Whaley and Monica S. Lam. Cloning-based context-sensitive pointer alias analysis using binary decision diagrams. In Proc. Conf. on Programming Language Design and Implementation (PLDI), pages 131--144, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
Andy Yoo, Edmond Chow, Keith Henderson, William McLendon, Bruce Hendrickson, and Umit Catalyurek. A scalable distributed parallel breadth-first search algorithm on bluegene/l. In Proceedings of the 2005 ACM/IEEE conference on Supercomputing, SC '05, pages 25--, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library

Index Terms

A GPU implementation of inclusion-based points-to analysis
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

A GPU implementation of inclusion-based points-to analysis
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming

Graphics Processing Units (GPUs) have emerged as powerful accelerators for many regular algorithms that operate on dense arrays and matrices. In contrast, we know relatively little about using GPUs to accelerate highly irregular algorithms that operate ...
Read More
Out-of-core implementation for accelerator kernels on heterogeneous clouds

Cloud environments today are increasingly featuring hybrid nodes containing multicore CPU processors and a diverse mix of accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors, and Field-Programmable Gate Arrays (FPGAs) to ...
Read More
Architecture-Aware Mapping and Optimization on a 1600-Core GPU
ICPADS '11: Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems

The graphics processing unit (GPU) continues to make in-roads as a computational accelerator for high-performance computing (HPC). However, despite its increasing popularity, mapping and optimizing GPU code remains a difficult task, it is a multi-...
Read More

Reviews

Reviewer: Henk Sips

Modern computing devices like graphical processing units (GPUs) require high-performance algorithms to adhere to a few principles like regular computations on aligned data structures and the careful use of synchronization constructs. This is challenging for many current algorithms and especially for graph-based algorithms. Graphs generally exhibit an irregular connection structure that may even change during computation. This paper deals with the implementation on GPUs of a graph algorithm necessary for inclusion-based points-to analysis, an algorithm that is used within compilers to determine which variables each pointer in a program points to. It is a challenging algorithm because it is highly irregular, requires relatively little computation, and makes extensive modification of the underlying graph during the processing of that graph. The solution described in the paper is to modify the representation of the graph, modify the algorithm, and add GPU-based optimizations. Classical representation of points-to graphs may lead to excessive memory space since these graphs are very sparse. Alternative compressed sparse schemes add new edges that cannot be statically predicted, leading to efficiency problems on GPUs. The authors use wide sparse bit vectors, chosen because a good alignment exists with the resources of the GPU (such as bus and memory banks). As such, data is aligned, avoiding shared memory bank conflicts and thread divergence. The price, however, is the need for more memory to accommodate the representation, which seems to be a fair trade-off in this case. The modification of the graph algorithm relates to the parallel rule application in this algorithm. Parallel execution of the graph rewrite rules requires many fine-grained synchronizations in the graph data structure, which may lead to severe performance problems when done on a GPU. To solve these problems, the authors propose modified rewrite rules that require lower synchronization overhead by storing the reversed copy edge instead of the original copy edge in the graph. Finally, the authors describe a number of GPU-specific optimizations for the processing of these types of points-to graphs. In the section on experimental validation, the authors compare their GPU approach to an earlier derived multicore central processing unit (CPU) solution, and to a classical sequential solution. The results are that, on average, the GPU solution is slightly faster (seven times) than the multicore solution (six times), but requires 35 percent more programming effort, according to the authors (assuming that the intellectual work of rethinking the algorithm is not included). The main contribution of this paper is not so much the absolute speed that can be obtained from using a GPU with this specific algorithm, but rather that seemingly GPU-unfriendly algorithms like graph algorithms can be recast into GPU-friendly ones by careful reconsideration and redesign of the shape and representation of the data structures and the associated algorithms. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGPLAN Notices Volume 47, Issue 8
PPOPP '12
August 2012
334 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2370036
Issue’s Table of Contents
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
February 2012
352 pages
ISBN:9781450311601
DOI:10.1145/2145816
General Chair:
J. Ramanujam
Louisiana State University, USA
,
Program Chair:
P. Sadayappan
The Ohio State University, USA
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 February 2012
Check for updates
Author Tags
CUDA
GPU
graph algorithms
inclusion-based points-to analysis
irregular programs
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 98
  Total Citations
  View Citations
- 998
  Total Downloads
- Downloads (Last 12 months)49
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A GPU implementation of inclusion-based points-to analysis

ACM SIGPLAN Notices

Abstract

References

Cited By

Index Terms

Recommendations

A GPU implementation of inclusion-based points-to analysis

Out-of-core implementation for accelerator kernels on heterogeneous clouds

Architecture-Aware Mapping and Optimization on a 1600-Core GPU

Reviews

Access critical reviews of Computing literature here