Abstract
This article addresses the compilation of a sequential program for parallel execution on a modern GPU. To this end, we present a novel source-to-source compiler called PPCG. PPCG singles out for its ability to accelerate computations from any static control loop nest, generating multiple CUDA kernels when necessary. We introduce a multilevel tiling strategy and a code generation scheme for the parallelization and locality optimization of imperfectly nested loops, managing memory and exposing concurrency according to the constraints of modern GPUs. We evaluate our algorithms and tool on the entire PolyBench suite.
- Allen, R. and Kennedy, K. 1987. Automatic translation of fortran programs to vector form. ACM Trans. Program. Lang. Syst. 9, 4, 491--542. Google ScholarDigital Library
- Allen, R. and Kennedy, K. 2001. Optimizing Compilers for Modern Architectures. Morgan Kaufmann Publishers. Google ScholarDigital Library
- Amini, M., Coelho, F., Irigoin, F., and Keryell, R. 2011. Static compilation analysis for host-accelerator communication optimization. In Workshop on Languages and Compilers for Parallel Computing (LCPC'11). Lecture Notes in Computer Science. Springer.Google Scholar
- AMP 2011. C++ accelerated massive parallelism. http://msdn.microsoft.com/en-us/library/hh265137Google Scholar
- Baghdadi, S., Grösslinger, A., and Cohen, A. 2010. Putting automatic polyhedral compilation for GPGPU to work. In Proceedings of the International Workshop Compilers for Parallel Computer (CPC).Google Scholar
- Baskaran, M., Ramanujam, J., and Sadayappan, P. 2010. Automatic C-to-CUDA code generation for affine programs. In Compiler Construction (CC 10), Held as Part of the Joint European Conferences on Theory and Practice of Software, (ETAPS 10), Lecture Notes in Computer Science, vol. 6011. Springer, 244--263. Google ScholarDigital Library
- Bastoul, C. 2004. Code generation in the polyhedral model is easier than you think. In Proceedings of the International Conference on parallel Architectures and Compilation Techniques (PACT'04). IEEE Computer Society, Washington, DC, 7--16. Google ScholarDigital Library
- Benabderrahmane, M.-W., Pouchet, L.-N., Cohen, A., and Bastoul, C. 2010. The polyhedral model is more widely applicable than you think. In Proceedings of the International Conference on Compiler Construction (CC'10). Lecture Notes in Computer Science, vol. 6011. Springer. Google ScholarDigital Library
- Bik, A. J. C. 2004. The Software Vectorization Handbook. Applying Multimedia Extensions for Maximum Performance. Intel Press. Google ScholarDigital Library
- Bondhugula, U. 2012. PLuTo: An automatic parallelizer and locality optimizer for multicores, version 0.7. http://pluto-compiler.sourceforge.net/Google Scholar
- Bondhugula, U., Hartono, A., Ramanujam, J., and Sadayappan, P. 2008a. A practical automatic polyhedral parallelizer and locality optimizer. SIGPLAN Not. 43, 6, 101--113. Google ScholarDigital Library
- Bondhugula, U., Ramanujam, J., 2008b. PLuTo: A practical and fully automatic polyhedral program optimization system. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). Google ScholarDigital Library
- Boulet, P., Darte, A., Silber, G.-A., and Vivien, F. 1998. Loop parallelization algorithms: From parallelism extraction to code generation. Parallel Comput. 24, 421--444. Google ScholarDigital Library
- Chen, C., Chame, J., and Hall, M. 2008. A framework for composing high-level loop transformations. Tech. rep., USC Computer Science.Google Scholar
- Cooper, P., Dolinsky, U., Donaldson, A. F., Richards, A., Riley, C., and Russell, G. 2010. Offload - Automating code migration to heterogeneous multicore systems. In Proceedings of International Conference on High-Performance Embedded Architectures and Compilers (HIPEAC). 337--352. Google ScholarDigital Library
- Feautrier, P. 1991. Dataflow analysis of array and scalar references. Int. J. Parallel Program. 20, 1, 23--53.Google ScholarCross Ref
- Feautrier, P. 1992a. Some efficient solutions to the affine scheduling problem. Part I. One-Dimensional time. Int. J. Parallel Program. 21, 313--347. Google ScholarDigital Library
- Feautrier, P. 1992b. Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time. Int. J. Parallel Program. 21, 389--420. Google ScholarDigital Library
- Ferrer, R., Beltran, V., González, M., Martorell, X., and Ayguadé, E. 2010. Analysis of task offloading for accelerators. In Proceedings of the International Conference on High-Performance Embedded Architectures and Compilers (HiPEAC). 322--336. Google ScholarDigital Library
- Grosser, T., Zheng, H., A, R., Simbürger, A., Grösslinger, A., and Pouchet, L.-N. 2011. Polly: Polyhedral optimization in llvm. In 1st Interantional Workshop on Polyhedral Compilation Techniques (IMPACT'11).Google Scholar
- Grösslinger, A. 2009. Precise management of scratchpad memories for localising array accesses in scientific codes. In CC'09. Springer, 236--250. Google ScholarDigital Library
- HMPP 2010. HMPP workbench: directive-based multi-language and multi-target hybrid programming model. http://www.caps-entreprise.com/hmpp.htmlGoogle Scholar
- HPC Project. 2012. Par4All automatic parallelization version 1.3. http://www.par4all.orgGoogle Scholar
- Lee, S. and Eigenmann, R. 2010. OpenMPC: Extended openmp programming and tuning for GPUs. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC'10). IEEE Computer Society, Washington, DC, 1--11. Google ScholarDigital Library
- Lee, S., Min, S.-J., and Eigenmann, R. 2009. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In Proceedings of the Symposium on Principles and Practice of Parallel Programming. Google ScholarDigital Library
- Leung, A., Vasilache, N., Meister, B., Baskaran, M., Wohlford, D., Bastoul, C., and Lethin, R. 2010. A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU'10). ACM, New York, 51--61. Google ScholarDigital Library
- Lim, A. 2001. Improving parallelism and data locality with affine partitioning. M.S. thesis, Stanford University. Google ScholarDigital Library
- Liu, Y., Zhang, E. Z., and Shen, X. 2009. A cross-input adaptive framework for gpu programs optimization. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium. Google ScholarDigital Library
- Nuzman, D., Rosen, I., and Zaks, A. 2006. Auto-vectorization of interleaved data for SIMD. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI'06). Google ScholarDigital Library
- Nuzman, D. and Zaks, A. 2008. Outer-loop vectorization - revisited for short SIMD architectures. In International Conference on Parallel Architecture and Compilation Techniques (PACT'08). Google ScholarDigital Library
- NVIDIA Corporation 2011. NVIDIA CUDA Programming guide 4.0. NVIDIA Corporation.Google Scholar
- OpenACC 2011. OpenACC: Directives for accelerators. http://www.openacc-standard.orgGoogle Scholar
- PoCC 2012. PoCC: the polyhedral compiler collection version 1.1. http://www.cse.ohio-state.edu/~pouchet/software/pocc/Google Scholar
- Ravi, N., Yang, Y., Bao, T., and Chakradhar, S. 2012. Apricot: An optimizing compiler and productivity tool for x86-compatible many-core coprocessors. In International Conference on Supercomputing (ICS'12). Google ScholarDigital Library
- Renganarayanan, L., Kim, D., Rajopadhye, S., and Strout, M. 2007. Parameterized tiled loops for free. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). Google ScholarDigital Library
- Rudy, G., Khan, M. M., Hall, M., Chen, C., and Jacqueline, C. 2011. A programming language interface to describe transformations and code generation. In Proceedings of the 23rd international conference on Languages and Compilers for Parallel Computing (LCPC'10). Springer, 136--150. Google ScholarDigital Library
- Ryoo, S., Rodrigues, C. I., Baghsorkhi, S. S., Stone, S. S., Kirk, D. B., and Hwu, W.-M. 2008a. Optimization principles and application perform- ance evaluation of a multithreaded GPU using CUDA. In Proceedings of the Symposium on Principles and Practice of Parallel Programming (PPoPP'08). Google ScholarDigital Library
- Ryoo, S., Rodrigues, C. I., Stone, S. S., Baghsorkhi, S. S., S. Ueng, J. A. S., and Hwu, W.-M. 2008b. Optimization space pruning for a multithreaded GPU. In Proceedings of the International Symposium on Code Generation and Optimization (CGO'08). Google ScholarDigital Library
- Shin, J., Chame, J., and Hall, M. W. 2002. Compiler-controlled caching in superword register files for multimedia extension architectures. In International Conference on Parallel Architecture and Compilation Techniques (PACT'02). Google ScholarDigital Library
- Shin, J., Hall, M., and Chame, J. 2005. Superword-level parallelism in the presence of control flow. In Proceedings of the Interanational Symposium on Code Generation and Optimization (CGO'05). Google ScholarDigital Library
- The Portland Group 2010. PGI Accelerator Programming Model for Fortran & C v1.3 ed. The Portland Group.Google Scholar
- Trifunović, K., Nuzman, D., Cohen, A., Zaks, A., and Rosen, I. 2009. Polyhedral-model guided loop-nest auto-vectorization. In International Conference on Parallel Architecture and Compilation Techniques (PACT'09). Google ScholarDigital Library
- Ueng, S., Lathara, M., Baghsorkhi, S. S., and Hwu, W.-M. 2008. CUDA-lite: Reducing GPU programming complexity. In Proceedings of the Workshop on Languages and Compilers for Parallel Computing (LCPC'08). Google ScholarDigital Library
- Unkule, S., Shaltz, C., and Qasem, A. 2012. Automatic restructuring of gpu kernels for exploiting inter-thread data locality. In International Conference on Compiler Construction (CC'12). Lecture Notes in Computer Science, vol. 7210. Springer. Google ScholarDigital Library
- Vasilache, N., Meister, B., Baskaran, M., and Lethin, R. 2012. Joint scheduling and layout optimization to enable multi-level vectorization. In Proceedings of the International workshop on Polyhedral Compilation Techniques (IMPACT'12).Google Scholar
- Verdoolaege, S. 2010. isl: An integer set library for the polyhedral model. In International Conference on Mathematical Software (ICMS'10), K. Fukuda, J. Hoeven, M. Joswig, and N. Takayama, Eds. Lecture Notes in Computer Science Series, vol. 6327. Springer, 299--302. Google ScholarDigital Library
- Verdoolaege, S. and Grosser, T. 2012. Polyhedral extraction tool. In Proceedings of the International Workshop on Polyhedral Compilation Techniques (IMPACT'12).Google Scholar
- Whaley, R. C. and Petitet, A. 2005. Minimizing development and maintenance costs in supporting persistently optimized BLAS. Softw. Pract. Exper. 35, 2, 101--121. http://www.cs.utsa.edu/~whaley/papers/spercw04.ps+. Google ScholarDigital Library
- Wolfe, M. 1996. High Performance Compilers for Parallel Computing. Addison Wesley. Google ScholarDigital Library
- Wu, P., Eichenberger, A. E., Wang, A., and Zhao, P. 2005. An integrated Simdization framework using virtual vectors. In International Conference on Supercomputing (ICS'05). Google ScholarDigital Library
Index Terms
- Polyhedral parallel code generation for CUDA
Recommendations
A practical automatic polyhedral parallelizer and locality optimizer
PLDI '08We present the design and implementation of an automatic polyhedral source-to-source transformation framework that can optimize regular programs (sequences of possibly imperfectly nested loops) for parallelism and locality simultaneously. Through this ...
Split tiling for GPUs: automatic parallelization using trapezoidal tiles
GPGPU-6: Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing UnitsTiling is a key technique to enhance data reuse. For computations structured as one sequential outer "time" loop enclosing a set of parallel inner loops, tiling only the parallel inner loops may not enable enough data reuse in the cache. Tiling the ...
Adaptive Mapping and Parameter Selection Scheme to Improve Automatic Code Generation for GPUs
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and OptimizationGraphics Processing Units (GPUs) are today's most powerful coprocessors for accelerating massive data-parallel algorithms. However, programmers are forced to adopt new programming paradigms to take full advantage of their computing capabilities; this ...
Comments