ABSTRACT
GPUs are a class of specialized parallel architectures with tremendous computational power. The new Compute Unified Device Architecture (CUDA) programming model from NVIDIA facilitates programming of general purpose applications on their GPUs. However, manual development of high-performance parallel code for GPUs is still very challenging. In this paper, a number of issues are addressed towards the goal of developing a compiler framework for automatic parallelization and performance optimization of affine loop nests on GPGPUs: 1) approach to program transformation for efficient data access from GPU global memory, using a polyhedral compiler model of data dependence abstraction and program transformation; 2) determination of optimal padding factors for conflict-minimal data access from GPU shared memory; and 3) model-driven empirical search to determine optimal parameters for unrolling and tiling. Experimental results on a number of kernels demonstrate the effectiveness of the compiler optimization approaches developed.
- C. Ancourt and F. Irigoin. Scanning polyhedra with do loops. In PPoPP'91, pages 39--50, 1991. Google ScholarDigital Library
- M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Automatic data movement and computation mapping for multilevel parallel architectures with explicitly managed memories. In ACM SIGPLAN PPoPP 2008, Feb. 2008. Google ScholarDigital Library
- C. Bastoul. Code generation in the polyhedral model is easier than you think. In PACT'04, pages 7--16, 2004. Google ScholarDigital Library
- U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Automatic transformations for communicationminimized parallelization and locality optimization in the polyhedral model. In International Conference on Compiler Construction (ETAPS CC), Apr. 2008. Google ScholarDigital Library
- U. Bondhugula, A. Hartono, J. Ramanujan, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In ACM SIGPLAN Programming Languages Design and Implementation (PLDI '08), 2008. Google ScholarDigital Library
- I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: stream computing on graphics hardware. In SIGGRAPH '04, pages 777--786, 2004. Google ScholarDigital Library
- CLooG: The Chunky Loop Generator. http://www.cloog.org.Google Scholar
- K. Fatahalian, J. Sugerman, and P. Hanrahan. Understanding the efficiency of GPU algorithms for matrixmatrix multiplication. In ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, pages 133--137, 2004. Google ScholarDigital Library
- P. Feautrier. Dataflow analysis of array and scalar references. IJPP, 20(1):23--53, 1991.Google ScholarCross Ref
- P. Feautrier. Some efficient solutions to the affine scheduling problem, part I: onedimensional time. IJPP, 21(5):313--348, 1992. Google ScholarDigital Library
- P. Feautrier. Some efficient solutions to the affine scheduling problem, part II: multidimensional time. IJPP, 21(6):389--420, 1992. Google ScholarDigital Library
- N. K. Govindaraju, S. Larsen, J. Gray, and D. Manocha. A memory model for scientific algorithms on graphics processors. In SC '06, 2006. Google ScholarDigital Library
- GeneralPurpose Computation Using Graphics Hardware. http://www.gpgpu.org/.Google Scholar
- M. Griebl. Automatic Parallelization of Loop Programs for Distributed Memory Architectures. FMI, University of Passau, 2004. Habilitation Thesis.Google Scholar
- S.W. Liao, Z. Du, G. Wu, and G.Y. Lueh. Data and computation transformations for Brook streaming applications on multiprocessors. In CGO '06, pages 196--207, 2006. Google ScholarDigital Library
- A. Lim. Improving Parallelism And Data Locality With Affine Partitioning. PhD thesis, Stanford University, Aug. 2001. Google ScholarDigital Library
- A. W. Lim and M. S. Lam. Maximizing parallelism and minimizing synchronization with affine transforms. In POPL, pages 201--214, 1997. Google ScholarDigital Library
- NVIDIA CUDA. http://developer.nvidia.com/object/cuda.html.Google Scholar
- NVIDIA GeForce 8800. http://www.nvidia.com/page/geforce_8800.html.Google Scholar
- PLuTo: A polyhedral automatic parallelizer and locality optimizer for multicores. http://plutocompiler.sourceforge.net.Google Scholar
- L.N. Pouchet, C. Bastoul, A. Cohen, and N. Vasilache. Iterative optimization in the polyhedral model: Part I, onedimensional time. In CGO '07, pages 144--156, 2007. Google ScholarDigital Library
- W. Pugh. The Omega test: a fast and practical integer programming algorithm for dependence analysis. Communications of the ACM, 8:102--114, Aug. 1992. Google ScholarDigital Library
- F. Quilleré, S. V. Rajopadhye, and D. Wilde. Generation of efficient nested loops from polyhedra. IJPP, 28(5):469--498, 2000. Google ScholarDigital Library
- S. Ryoo, C. Rodrigues, S. Baghsorkhi, S. Stone, D. Kirk, and W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In ACM SIGPLAN PPoPP 2008, Feb. 2008. Google ScholarDigital Library
- S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S. Ueng, and W. Hwu. Program optimization study on a 128core GPU. In The First Workshop on General Purpose Processing on Graphics Processing Units, October 2007.Google Scholar
- S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S. Ueng, J. Stratton, and W. Hwu. Program optimization space pruning for a multithreaded GPU. In CGO, 2008. Google ScholarDigital Library
- D. Tarditi, S. Puri, and J. Oglesby. Accelerator: using data parallelism to program GPUs for generalpurpose uses. In ASPLOSXII, pages 325--335, 2006. Google ScholarDigital Library
- N. Vasilache, C. Bastoul, and A. Cohen. Polyhedral code generation in the real world. In International Conference on Compiler Construction (ETAPS CC'06), pages 185--201, Mar. 2006. Google ScholarDigital Library
Index Terms
- A compiler framework for optimization of affine loop nests for gpgpus
Recommendations
Polyhedral parallel code generation for CUDA
Special Issue on High-Performance Embedded Architectures and CompilersThis article addresses the compilation of a sequential program for parallel execution on a modern GPU. To this end, we present a novel source-to-source compiler called PPCG. PPCG singles out for its ability to accelerate computations from any static ...
gpucc: an open-source GPGPU compiler
CGO '16: Proceedings of the 2016 International Symposium on Code Generation and OptimizationGraphics Processing Units have emerged as powerful accelerators for massively parallel, numerically intensive workloads. The two dominant software models for these devices are NVIDIA’s CUDA and the cross-platform OpenCL standard. Until now, there has ...
Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Highlights- Generate parallel CUDA code from sequential C input code using a compiler-based tool for key operators in Geometric Multigrid.
AbstractGPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model ...
Comments