skip to main content
research-article
Free Access

Polyhedral parallel code generation for CUDA

Published:20 January 2013Publication History
Skip Abstract Section

Abstract

This article addresses the compilation of a sequential program for parallel execution on a modern GPU. To this end, we present a novel source-to-source compiler called PPCG. PPCG singles out for its ability to accelerate computations from any static control loop nest, generating multiple CUDA kernels when necessary. We introduce a multilevel tiling strategy and a code generation scheme for the parallelization and locality optimization of imperfectly nested loops, managing memory and exposing concurrency according to the constraints of modern GPUs. We evaluate our algorithms and tool on the entire PolyBench suite.

References

  1. Allen, R. and Kennedy, K. 1987. Automatic translation of fortran programs to vector form. ACM Trans. Program. Lang. Syst. 9, 4, 491--542. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Allen, R. and Kennedy, K. 2001. Optimizing Compilers for Modern Architectures. Morgan Kaufmann Publishers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Amini, M., Coelho, F., Irigoin, F., and Keryell, R. 2011. Static compilation analysis for host-accelerator communication optimization. In Workshop on Languages and Compilers for Parallel Computing (LCPC'11). Lecture Notes in Computer Science. Springer.Google ScholarGoogle Scholar
  4. AMP 2011. C++ accelerated massive parallelism. http://msdn.microsoft.com/en-us/library/hh265137Google ScholarGoogle Scholar
  5. Baghdadi, S., Grösslinger, A., and Cohen, A. 2010. Putting automatic polyhedral compilation for GPGPU to work. In Proceedings of the International Workshop Compilers for Parallel Computer (CPC).Google ScholarGoogle Scholar
  6. Baskaran, M., Ramanujam, J., and Sadayappan, P. 2010. Automatic C-to-CUDA code generation for affine programs. In Compiler Construction (CC 10), Held as Part of the Joint European Conferences on Theory and Practice of Software, (ETAPS 10), Lecture Notes in Computer Science, vol. 6011. Springer, 244--263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Bastoul, C. 2004. Code generation in the polyhedral model is easier than you think. In Proceedings of the International Conference on parallel Architectures and Compilation Techniques (PACT'04). IEEE Computer Society, Washington, DC, 7--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Benabderrahmane, M.-W., Pouchet, L.-N., Cohen, A., and Bastoul, C. 2010. The polyhedral model is more widely applicable than you think. In Proceedings of the International Conference on Compiler Construction (CC'10). Lecture Notes in Computer Science, vol. 6011. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Bik, A. J. C. 2004. The Software Vectorization Handbook. Applying Multimedia Extensions for Maximum Performance. Intel Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Bondhugula, U. 2012. PLuTo: An automatic parallelizer and locality optimizer for multicores, version 0.7. http://pluto-compiler.sourceforge.net/Google ScholarGoogle Scholar
  11. Bondhugula, U., Hartono, A., Ramanujam, J., and Sadayappan, P. 2008a. A practical automatic polyhedral parallelizer and locality optimizer. SIGPLAN Not. 43, 6, 101--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Bondhugula, U., Ramanujam, J., 2008b. PLuTo: A practical and fully automatic polyhedral program optimization system. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Boulet, P., Darte, A., Silber, G.-A., and Vivien, F. 1998. Loop parallelization algorithms: From parallelism extraction to code generation. Parallel Comput. 24, 421--444. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Chen, C., Chame, J., and Hall, M. 2008. A framework for composing high-level loop transformations. Tech. rep., USC Computer Science.Google ScholarGoogle Scholar
  15. Cooper, P., Dolinsky, U., Donaldson, A. F., Richards, A., Riley, C., and Russell, G. 2010. Offload - Automating code migration to heterogeneous multicore systems. In Proceedings of International Conference on High-Performance Embedded Architectures and Compilers (HIPEAC). 337--352. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Feautrier, P. 1991. Dataflow analysis of array and scalar references. Int. J. Parallel Program. 20, 1, 23--53.Google ScholarGoogle ScholarCross RefCross Ref
  17. Feautrier, P. 1992a. Some efficient solutions to the affine scheduling problem. Part I. One-Dimensional time. Int. J. Parallel Program. 21, 313--347. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Feautrier, P. 1992b. Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time. Int. J. Parallel Program. 21, 389--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Ferrer, R., Beltran, V., González, M., Martorell, X., and Ayguadé, E. 2010. Analysis of task offloading for accelerators. In Proceedings of the International Conference on High-Performance Embedded Architectures and Compilers (HiPEAC). 322--336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Grosser, T., Zheng, H., A, R., Simbürger, A., Grösslinger, A., and Pouchet, L.-N. 2011. Polly: Polyhedral optimization in llvm. In 1st Interantional Workshop on Polyhedral Compilation Techniques (IMPACT'11).Google ScholarGoogle Scholar
  21. Grösslinger, A. 2009. Precise management of scratchpad memories for localising array accesses in scientific codes. In CC'09. Springer, 236--250. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. HMPP 2010. HMPP workbench: directive-based multi-language and multi-target hybrid programming model. http://www.caps-entreprise.com/hmpp.htmlGoogle ScholarGoogle Scholar
  23. HPC Project. 2012. Par4All automatic parallelization version 1.3. http://www.par4all.orgGoogle ScholarGoogle Scholar
  24. Lee, S. and Eigenmann, R. 2010. OpenMPC: Extended openmp programming and tuning for GPUs. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC'10). IEEE Computer Society, Washington, DC, 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Lee, S., Min, S.-J., and Eigenmann, R. 2009. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In Proceedings of the Symposium on Principles and Practice of Parallel Programming. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Leung, A., Vasilache, N., Meister, B., Baskaran, M., Wohlford, D., Bastoul, C., and Lethin, R. 2010. A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU'10). ACM, New York, 51--61. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Lim, A. 2001. Improving parallelism and data locality with affine partitioning. M.S. thesis, Stanford University. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Liu, Y., Zhang, E. Z., and Shen, X. 2009. A cross-input adaptive framework for gpu programs optimization. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Nuzman, D., Rosen, I., and Zaks, A. 2006. Auto-vectorization of interleaved data for SIMD. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI'06). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Nuzman, D. and Zaks, A. 2008. Outer-loop vectorization - revisited for short SIMD architectures. In International Conference on Parallel Architecture and Compilation Techniques (PACT'08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. NVIDIA Corporation 2011. NVIDIA CUDA Programming guide 4.0. NVIDIA Corporation.Google ScholarGoogle Scholar
  32. OpenACC 2011. OpenACC: Directives for accelerators. http://www.openacc-standard.orgGoogle ScholarGoogle Scholar
  33. PoCC 2012. PoCC: the polyhedral compiler collection version 1.1. http://www.cse.ohio-state.edu/~pouchet/software/pocc/Google ScholarGoogle Scholar
  34. Ravi, N., Yang, Y., Bao, T., and Chakradhar, S. 2012. Apricot: An optimizing compiler and productivity tool for x86-compatible many-core coprocessors. In International Conference on Supercomputing (ICS'12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Renganarayanan, L., Kim, D., Rajopadhye, S., and Strout, M. 2007. Parameterized tiled loops for free. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Rudy, G., Khan, M. M., Hall, M., Chen, C., and Jacqueline, C. 2011. A programming language interface to describe transformations and code generation. In Proceedings of the 23rd international conference on Languages and Compilers for Parallel Computing (LCPC'10). Springer, 136--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Ryoo, S., Rodrigues, C. I., Baghsorkhi, S. S., Stone, S. S., Kirk, D. B., and Hwu, W.-M. 2008a. Optimization principles and application perform- ance evaluation of a multithreaded GPU using CUDA. In Proceedings of the Symposium on Principles and Practice of Parallel Programming (PPoPP'08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Ryoo, S., Rodrigues, C. I., Stone, S. S., Baghsorkhi, S. S., S. Ueng, J. A. S., and Hwu, W.-M. 2008b. Optimization space pruning for a multithreaded GPU. In Proceedings of the International Symposium on Code Generation and Optimization (CGO'08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Shin, J., Chame, J., and Hall, M. W. 2002. Compiler-controlled caching in superword register files for multimedia extension architectures. In International Conference on Parallel Architecture and Compilation Techniques (PACT'02). Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Shin, J., Hall, M., and Chame, J. 2005. Superword-level parallelism in the presence of control flow. In Proceedings of the Interanational Symposium on Code Generation and Optimization (CGO'05). Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. The Portland Group 2010. PGI Accelerator Programming Model for Fortran & C v1.3 ed. The Portland Group.Google ScholarGoogle Scholar
  42. Trifunović, K., Nuzman, D., Cohen, A., Zaks, A., and Rosen, I. 2009. Polyhedral-model guided loop-nest auto-vectorization. In International Conference on Parallel Architecture and Compilation Techniques (PACT'09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Ueng, S., Lathara, M., Baghsorkhi, S. S., and Hwu, W.-M. 2008. CUDA-lite: Reducing GPU programming complexity. In Proceedings of the Workshop on Languages and Compilers for Parallel Computing (LCPC'08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Unkule, S., Shaltz, C., and Qasem, A. 2012. Automatic restructuring of gpu kernels for exploiting inter-thread data locality. In International Conference on Compiler Construction (CC'12). Lecture Notes in Computer Science, vol. 7210. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Vasilache, N., Meister, B., Baskaran, M., and Lethin, R. 2012. Joint scheduling and layout optimization to enable multi-level vectorization. In Proceedings of the International workshop on Polyhedral Compilation Techniques (IMPACT'12).Google ScholarGoogle Scholar
  46. Verdoolaege, S. 2010. isl: An integer set library for the polyhedral model. In International Conference on Mathematical Software (ICMS'10), K. Fukuda, J. Hoeven, M. Joswig, and N. Takayama, Eds. Lecture Notes in Computer Science Series, vol. 6327. Springer, 299--302. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Verdoolaege, S. and Grosser, T. 2012. Polyhedral extraction tool. In Proceedings of the International Workshop on Polyhedral Compilation Techniques (IMPACT'12).Google ScholarGoogle Scholar
  48. Whaley, R. C. and Petitet, A. 2005. Minimizing development and maintenance costs in supporting persistently optimized BLAS. Softw. Pract. Exper. 35, 2, 101--121. http://www.cs.utsa.edu/~whaley/papers/spercw04.ps+. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Wolfe, M. 1996. High Performance Compilers for Parallel Computing. Addison Wesley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Wu, P., Eichenberger, A. E., Wang, A., and Zhao, P. 2005. An integrated Simdization framework using virtual vectors. In International Conference on Supercomputing (ICS'05). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Polyhedral parallel code generation for CUDA

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Architecture and Code Optimization
      ACM Transactions on Architecture and Code Optimization  Volume 9, Issue 4
      Special Issue on High-Performance Embedded Architectures and Compilers
      January 2013
      876 pages
      ISSN:1544-3566
      EISSN:1544-3973
      DOI:10.1145/2400682
      Issue’s Table of Contents

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 20 January 2013
      • Accepted: 1 October 2012
      • Revised: 1 September 2012
      • Received: 1 June 2012
      Published in taco Volume 9, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader