skip to main content
10.1145/1375527.1375562acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

A compiler framework for optimization of affine loop nests for gpgpus

Published:07 June 2008Publication History

ABSTRACT

GPUs are a class of specialized parallel architectures with tremendous computational power. The new Compute Unified Device Architecture (CUDA) programming model from NVIDIA facilitates programming of general purpose applications on their GPUs. However, manual development of high-performance parallel code for GPUs is still very challenging. In this paper, a number of issues are addressed towards the goal of developing a compiler framework for automatic parallelization and performance optimization of affine loop nests on GPGPUs: 1) approach to program transformation for efficient data access from GPU global memory, using a polyhedral compiler model of data dependence abstraction and program transformation; 2) determination of optimal padding factors for conflict-minimal data access from GPU shared memory; and 3) model-driven empirical search to determine optimal parameters for unrolling and tiling. Experimental results on a number of kernels demonstrate the effectiveness of the compiler optimization approaches developed.

References

  1. C. Ancourt and F. Irigoin. Scanning polyhedra with do loops. In PPoPP'91, pages 39--50, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Automatic data movement and computation mapping for multilevel parallel architectures with explicitly managed memories. In ACM SIGPLAN PPoPP 2008, Feb. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Bastoul. Code generation in the polyhedral model is easier than you think. In PACT'04, pages 7--16, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Automatic transformations for communicationminimized parallelization and locality optimization in the polyhedral model. In International Conference on Compiler Construction (ETAPS CC), Apr. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. U. Bondhugula, A. Hartono, J. Ramanujan, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In ACM SIGPLAN Programming Languages Design and Implementation (PLDI '08), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: stream computing on graphics hardware. In SIGGRAPH '04, pages 777--786, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. CLooG: The Chunky Loop Generator. http://www.cloog.org.Google ScholarGoogle Scholar
  8. K. Fatahalian, J. Sugerman, and P. Hanrahan. Understanding the efficiency of GPU algorithms for matrixmatrix multiplication. In ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, pages 133--137, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. Feautrier. Dataflow analysis of array and scalar references. IJPP, 20(1):23--53, 1991.Google ScholarGoogle ScholarCross RefCross Ref
  10. P. Feautrier. Some efficient solutions to the affine scheduling problem, part I: onedimensional time. IJPP, 21(5):313--348, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. Feautrier. Some efficient solutions to the affine scheduling problem, part II: multidimensional time. IJPP, 21(6):389--420, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. N. K. Govindaraju, S. Larsen, J. Gray, and D. Manocha. A memory model for scientific algorithms on graphics processors. In SC '06, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. GeneralPurpose Computation Using Graphics Hardware. http://www.gpgpu.org/.Google ScholarGoogle Scholar
  14. M. Griebl. Automatic Parallelization of Loop Programs for Distributed Memory Architectures. FMI, University of Passau, 2004. Habilitation Thesis.Google ScholarGoogle Scholar
  15. S.W. Liao, Z. Du, G. Wu, and G.Y. Lueh. Data and computation transformations for Brook streaming applications on multiprocessors. In CGO '06, pages 196--207, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Lim. Improving Parallelism And Data Locality With Affine Partitioning. PhD thesis, Stanford University, Aug. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. W. Lim and M. S. Lam. Maximizing parallelism and minimizing synchronization with affine transforms. In POPL, pages 201--214, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. NVIDIA CUDA. http://developer.nvidia.com/object/cuda.html.Google ScholarGoogle Scholar
  19. NVIDIA GeForce 8800. http://www.nvidia.com/page/geforce_8800.html.Google ScholarGoogle Scholar
  20. PLuTo: A polyhedral automatic parallelizer and locality optimizer for multicores. http://plutocompiler.sourceforge.net.Google ScholarGoogle Scholar
  21. L.N. Pouchet, C. Bastoul, A. Cohen, and N. Vasilache. Iterative optimization in the polyhedral model: Part I, onedimensional time. In CGO '07, pages 144--156, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. W. Pugh. The Omega test: a fast and practical integer programming algorithm for dependence analysis. Communications of the ACM, 8:102--114, Aug. 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. F. Quilleré, S. V. Rajopadhye, and D. Wilde. Generation of efficient nested loops from polyhedra. IJPP, 28(5):469--498, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Ryoo, C. Rodrigues, S. Baghsorkhi, S. Stone, D. Kirk, and W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In ACM SIGPLAN PPoPP 2008, Feb. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S. Ueng, and W. Hwu. Program optimization study on a 128core GPU. In The First Workshop on General Purpose Processing on Graphics Processing Units, October 2007.Google ScholarGoogle Scholar
  26. S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S. Ueng, J. Stratton, and W. Hwu. Program optimization space pruning for a multithreaded GPU. In CGO, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. Tarditi, S. Puri, and J. Oglesby. Accelerator: using data parallelism to program GPUs for generalpurpose uses. In ASPLOSXII, pages 325--335, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. N. Vasilache, C. Bastoul, and A. Cohen. Polyhedral code generation in the real world. In International Conference on Compiler Construction (ETAPS CC'06), pages 185--201, Mar. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A compiler framework for optimization of affine loop nests for gpgpus

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ICS '08: Proceedings of the 22nd annual international conference on Supercomputing
      June 2008
      390 pages
      ISBN:9781605581583
      DOI:10.1145/1375527

      Copyright © 2008 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 7 June 2008

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate629of2,180submissions,29%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader