research-article

Free Access

Polyhedral parallel code generation for CUDA

Authors:
Sven Verdoolaege

INRIA and École Normale Supérieure

INRIA and École Normale Supérieure
View Profile

,
Juan Carlos Juega

Universidad Complutense de Madrid

Universidad Complutense de Madrid
View Profile

,
Albert Cohen

INRIA and École Normale Supérieure

INRIA and École Normale Supérieure
View Profile

,
José Ignacio Gómez

Universidad Complutense de Madrid

Universidad Complutense de Madrid
View Profile

,
Christian Tenllado

Universidad Complutense de Madrid

Universidad Complutense de Madrid
View Profile

,
Francky Catthoor

IMEC

IMEC
View Profile

ACM Transactions on Architecture and Code Optimization Volume 9 Issue 4Article No.: 54pp 1–23https://doi.org/10.1145/2400682.2400713

Published:20 January 2013Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

This article addresses the compilation of a sequential program for parallel execution on a modern GPU. To this end, we present a novel source-to-source compiler called PPCG. PPCG singles out for its ability to accelerate computations from any static control loop nest, generating multiple CUDA kernels when necessary. We introduce a multilevel tiling strategy and a code generation scheme for the parallelization and locality optimization of imperfectly nested loops, managing memory and exposing concurrency according to the constraints of modern GPUs. We evaluate our algorithms and tool on the entire PolyBench suite.

References

Allen, R. and Kennedy, K. 1987. Automatic translation of fortran programs to vector form. ACM Trans. Program. Lang. Syst. 9, 4, 491--542. Google ScholarDigital Library
Allen, R. and Kennedy, K. 2001. Optimizing Compilers for Modern Architectures. Morgan Kaufmann Publishers. Google ScholarDigital Library
Amini, M., Coelho, F., Irigoin, F., and Keryell, R. 2011. Static compilation analysis for host-accelerator communication optimization. In Workshop on Languages and Compilers for Parallel Computing (LCPC'11). Lecture Notes in Computer Science. Springer.Google Scholar
AMP 2011. C++ accelerated massive parallelism. http://msdn.microsoft.com/en-us/library/hh265137Google Scholar
Baghdadi, S., Grösslinger, A., and Cohen, A. 2010. Putting automatic polyhedral compilation for GPGPU to work. In Proceedings of the International Workshop Compilers for Parallel Computer (CPC).Google Scholar
Baskaran, M., Ramanujam, J., and Sadayappan, P. 2010. Automatic C-to-CUDA code generation for affine programs. In Compiler Construction (CC 10), Held as Part of the Joint European Conferences on Theory and Practice of Software, (ETAPS 10), Lecture Notes in Computer Science, vol. 6011. Springer, 244--263. Google ScholarDigital Library
Bastoul, C. 2004. Code generation in the polyhedral model is easier than you think. In Proceedings of the International Conference on parallel Architectures and Compilation Techniques (PACT'04). IEEE Computer Society, Washington, DC, 7--16. Google ScholarDigital Library
Benabderrahmane, M.-W., Pouchet, L.-N., Cohen, A., and Bastoul, C. 2010. The polyhedral model is more widely applicable than you think. In Proceedings of the International Conference on Compiler Construction (CC'10). Lecture Notes in Computer Science, vol. 6011. Springer. Google ScholarDigital Library
Bik, A. J. C. 2004. The Software Vectorization Handbook. Applying Multimedia Extensions for Maximum Performance. Intel Press. Google ScholarDigital Library
Bondhugula, U. 2012. PLuTo: An automatic parallelizer and locality optimizer for multicores, version 0.7. http://pluto-compiler.sourceforge.net/Google Scholar
Bondhugula, U., Hartono, A., Ramanujam, J., and Sadayappan, P. 2008a. A practical automatic polyhedral parallelizer and locality optimizer. SIGPLAN Not. 43, 6, 101--113. Google ScholarDigital Library
Bondhugula, U., Ramanujam, J., 2008b. PLuTo: A practical and fully automatic polyhedral program optimization system. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). Google ScholarDigital Library
Boulet, P., Darte, A., Silber, G.-A., and Vivien, F. 1998. Loop parallelization algorithms: From parallelism extraction to code generation. Parallel Comput. 24, 421--444. Google ScholarDigital Library
Chen, C., Chame, J., and Hall, M. 2008. A framework for composing high-level loop transformations. Tech. rep., USC Computer Science.Google Scholar
Cooper, P., Dolinsky, U., Donaldson, A. F., Richards, A., Riley, C., and Russell, G. 2010. Offload - Automating code migration to heterogeneous multicore systems. In Proceedings of International Conference on High-Performance Embedded Architectures and Compilers (HIPEAC). 337--352. Google ScholarDigital Library
Feautrier, P. 1991. Dataflow analysis of array and scalar references. Int. J. Parallel Program. 20, 1, 23--53.Google ScholarCross Ref
Feautrier, P. 1992a. Some efficient solutions to the affine scheduling problem. Part I. One-Dimensional time. Int. J. Parallel Program. 21, 313--347. Google ScholarDigital Library
Feautrier, P. 1992b. Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time. Int. J. Parallel Program. 21, 389--420. Google ScholarDigital Library
Ferrer, R., Beltran, V., González, M., Martorell, X., and Ayguadé, E. 2010. Analysis of task offloading for accelerators. In Proceedings of the International Conference on High-Performance Embedded Architectures and Compilers (HiPEAC). 322--336. Google ScholarDigital Library
Grosser, T., Zheng, H., A, R., Simbürger, A., Grösslinger, A., and Pouchet, L.-N. 2011. Polly: Polyhedral optimization in llvm. In 1st Interantional Workshop on Polyhedral Compilation Techniques (IMPACT'11).Google Scholar
Grösslinger, A. 2009. Precise management of scratchpad memories for localising array accesses in scientific codes. In CC'09. Springer, 236--250. Google ScholarDigital Library
HMPP 2010. HMPP workbench: directive-based multi-language and multi-target hybrid programming model. http://www.caps-entreprise.com/hmpp.htmlGoogle Scholar
HPC Project. 2012. Par4All automatic parallelization version 1.3. http://www.par4all.orgGoogle Scholar
Lee, S. and Eigenmann, R. 2010. OpenMPC: Extended openmp programming and tuning for GPUs. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC'10). IEEE Computer Society, Washington, DC, 1--11. Google ScholarDigital Library
Lee, S., Min, S.-J., and Eigenmann, R. 2009. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In Proceedings of the Symposium on Principles and Practice of Parallel Programming. Google ScholarDigital Library
Leung, A., Vasilache, N., Meister, B., Baskaran, M., Wohlford, D., Bastoul, C., and Lethin, R. 2010. A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU'10). ACM, New York, 51--61. Google ScholarDigital Library
Lim, A. 2001. Improving parallelism and data locality with affine partitioning. M.S. thesis, Stanford University. Google ScholarDigital Library
Liu, Y., Zhang, E. Z., and Shen, X. 2009. A cross-input adaptive framework for gpu programs optimization. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium. Google ScholarDigital Library
Nuzman, D., Rosen, I., and Zaks, A. 2006. Auto-vectorization of interleaved data for SIMD. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI'06). Google ScholarDigital Library
Nuzman, D. and Zaks, A. 2008. Outer-loop vectorization - revisited for short SIMD architectures. In International Conference on Parallel Architecture and Compilation Techniques (PACT'08). Google ScholarDigital Library
NVIDIA Corporation 2011. NVIDIA CUDA Programming guide 4.0. NVIDIA Corporation.Google Scholar
OpenACC 2011. OpenACC: Directives for accelerators. http://www.openacc-standard.orgGoogle Scholar
PoCC 2012. PoCC: the polyhedral compiler collection version 1.1. http://www.cse.ohio-state.edu/~pouchet/software/pocc/Google Scholar
Ravi, N., Yang, Y., Bao, T., and Chakradhar, S. 2012. Apricot: An optimizing compiler and productivity tool for x86-compatible many-core coprocessors. In International Conference on Supercomputing (ICS'12). Google ScholarDigital Library
Renganarayanan, L., Kim, D., Rajopadhye, S., and Strout, M. 2007. Parameterized tiled loops for free. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). Google ScholarDigital Library
Rudy, G., Khan, M. M., Hall, M., Chen, C., and Jacqueline, C. 2011. A programming language interface to describe transformations and code generation. In Proceedings of the 23rd international conference on Languages and Compilers for Parallel Computing (LCPC'10). Springer, 136--150. Google ScholarDigital Library
Ryoo, S., Rodrigues, C. I., Baghsorkhi, S. S., Stone, S. S., Kirk, D. B., and Hwu, W.-M. 2008a. Optimization principles and application perform- ance evaluation of a multithreaded GPU using CUDA. In Proceedings of the Symposium on Principles and Practice of Parallel Programming (PPoPP'08). Google ScholarDigital Library
Ryoo, S., Rodrigues, C. I., Stone, S. S., Baghsorkhi, S. S., S. Ueng, J. A. S., and Hwu, W.-M. 2008b. Optimization space pruning for a multithreaded GPU. In Proceedings of the International Symposium on Code Generation and Optimization (CGO'08). Google ScholarDigital Library
Shin, J., Chame, J., and Hall, M. W. 2002. Compiler-controlled caching in superword register files for multimedia extension architectures. In International Conference on Parallel Architecture and Compilation Techniques (PACT'02). Google ScholarDigital Library
Shin, J., Hall, M., and Chame, J. 2005. Superword-level parallelism in the presence of control flow. In Proceedings of the Interanational Symposium on Code Generation and Optimization (CGO'05). Google ScholarDigital Library
The Portland Group 2010. PGI Accelerator Programming Model for Fortran & C v1.3 ed. The Portland Group.Google Scholar
Trifunović, K., Nuzman, D., Cohen, A., Zaks, A., and Rosen, I. 2009. Polyhedral-model guided loop-nest auto-vectorization. In International Conference on Parallel Architecture and Compilation Techniques (PACT'09). Google ScholarDigital Library
Ueng, S., Lathara, M., Baghsorkhi, S. S., and Hwu, W.-M. 2008. CUDA-lite: Reducing GPU programming complexity. In Proceedings of the Workshop on Languages and Compilers for Parallel Computing (LCPC'08). Google ScholarDigital Library
Unkule, S., Shaltz, C., and Qasem, A. 2012. Automatic restructuring of gpu kernels for exploiting inter-thread data locality. In International Conference on Compiler Construction (CC'12). Lecture Notes in Computer Science, vol. 7210. Springer. Google ScholarDigital Library
Vasilache, N., Meister, B., Baskaran, M., and Lethin, R. 2012. Joint scheduling and layout optimization to enable multi-level vectorization. In Proceedings of the International workshop on Polyhedral Compilation Techniques (IMPACT'12).Google Scholar
Verdoolaege, S. 2010. isl: An integer set library for the polyhedral model. In International Conference on Mathematical Software (ICMS'10), K. Fukuda, J. Hoeven, M. Joswig, and N. Takayama, Eds. Lecture Notes in Computer Science Series, vol. 6327. Springer, 299--302. Google ScholarDigital Library
Verdoolaege, S. and Grosser, T. 2012. Polyhedral extraction tool. In Proceedings of the International Workshop on Polyhedral Compilation Techniques (IMPACT'12).Google Scholar
Whaley, R. C. and Petitet, A. 2005. Minimizing development and maintenance costs in supporting persistently optimized BLAS. Softw. Pract. Exper. 35, 2, 101--121. http://www.cs.utsa.edu/~whaley/papers/spercw04.ps+. Google ScholarDigital Library
Wolfe, M. 1996. High Performance Compilers for Parallel Computing. Addison Wesley. Google ScholarDigital Library
Wu, P., Eichenberger, A. E., Wang, A., and Zhao, P. 2005. An integrated Simdization framework using virtual vectors. In International Conference on Supercomputing (ICS'05). Google ScholarDigital Library

Index Terms

Polyhedral parallel code generation for CUDA
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

A practical automatic polyhedral parallelizer and locality optimizer
PLDI '08

We present the design and implementation of an automatic polyhedral source-to-source transformation framework that can optimize regular programs (sequences of possibly imperfectly nested loops) for parallelism and locality simultaneously. Through this ...
Read More
Split tiling for GPUs: automatic parallelization using trapezoidal tiles
GPGPU-6: Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units

Tiling is a key technique to enhance data reuse. For computations structured as one sequential outer "time" loop enclosing a set of parallel inner loops, tiling only the parallel inner loops may not enable enough data reuse in the cache. Tiling the ...
Read More
Adaptive Mapping and Parameter Selection Scheme to Improve Automatic Code Generation for GPUs
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Graphics Processing Units (GPUs) are today's most powerful coprocessors for accelerating massive data-parallel algorithms. However, programmers are forced to adopt new programming paradigms to take full advantage of their computing capabilities; this ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Architecture and Code Optimization Volume 9, Issue 4
Special Issue on High-Performance Embedded Architectures and Compilers
January 2013
876 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/2400682
Issue’s Table of Contents

Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 January 2013
- Accepted: 1 October 2012
- Revised: 1 September 2012
- Received: 1 June 2012
Published in taco Volume 9, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
C-to-CUDA
CUDA
GPU
PPCG.
Par4All
Polyhedral model
code generation
compilers
loop transformations
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 286
  Total Citations
  View Citations
- 5,356
  Total Downloads
- Downloads (Last 12 months)722
- Downloads (Last 6 weeks)99
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Polyhedral parallel code generation for CUDA

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

A practical automatic polyhedral parallelizer and locality optimizer

Split tiling for GPUs: automatic parallelization using trapezoidal tiles

Adaptive Mapping and Parameter Selection Scheme to Improve Automatic Code Generation for GPUs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Polyhedral parallel code generation for CUDA

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

A practical automatic polyhedral parallelizer and locality optimizer

Split tiling for GPUs: automatic parallelization using trapezoidal tiles

Adaptive Mapping and Parameter Selection Scheme to Improve Automatic Code Generation for GPUs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media