skip to main content
10.1145/2884045.2884046acmotherconferencesArticle/Chapter ViewAbstractPublication PagesgpgpuConference Proceedingsconference-collections
research-article

Performance portable GPU code generation for matrix multiplication

Published:12 March 2016Publication History

ABSTRACT

Parallel accelerators such as GPUs are notoriously hard to program; exploiting their full performance potential is a job best left for ninja programmers. High-level programming languages coupled with optimizing compilers have been proposed to attempt to address this issue. However, they rely on device-specific heuristics or hard-coded library implementations to achieve good performance resulting in non-portable solutions that need to be re-optimized for every new device.

Achieving performance portability is the holy grail of high-performance computing and has so far remained an open problem even for well studied applications like matrix multiplication. We argue that what is needed is a way to describe applications at a high-level without committing to particular implementations. To this end, we developed in a previous paper a functional data-parallel language which allows applications to be expressed in a device neutral way. We use a set of well-defined rewrite rules to automatically transform programs into semantically equivalent device-specific forms, from which OpenCL code is generated.

In this paper, we demonstrate how this approach produces high-performance OpenCL code for GPUs with a well-studied, well-understood application: matrix multiplication. Starting from a single high-level program, our compiler automatically generate highly optimized and specialized implementations. We group simple rewrite rules into more complex macro-rules, each describing a well-known optimization like tiling and register blocking in a composable way. Using an exploration strategy our compiler automatically generates 50,000 OpenCL kernels, each providing a differently optimized -- but provably correct -- implementation of matrix multiplication. The automatically generated code offers competitive performance compared to the manually tuned MAGMA library implementations of matrix multiplication on Nvidia and even outperforms AMD's clBLAS library.

References

  1. AMD. APP OpenCL programming guide.Google ScholarGoogle Scholar
  2. J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. Petabricks: A language and compiler for algorithmic choice. PLDI. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy, S. Verdoolaege, J. Absar, S. van Haastregt, A. Kravets, et al. Pencil: A platform-neutral compute intermediate language for accelerator programming. PACT. ACM, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic C-to-CUDA code generation for affine programs. CC/ETAPS. Springer-Verlag, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. Cao, J. Dongarra, P. Du, M. Gates, P. Luszczek, and S. Tomov. clMAGMA: High performance dense linear algebra with OpenCL. IWOCL. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Dubach, P. Cheng, R. Rabbah, D. F. Bacon, and S. J. Fink. Compiling a high-level language for GPUs: (via language support for architectures and compilers). PLDI. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Karrenberg and S. Hack. Whole-function vectorization. CGO. IEEE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Kurzak, S. Tomov, and J. Dongarra. Autotuning GEMM kernels for the fermi GPU. IEEE TPDS, 23(11), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Lai and A. Seznec. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs. CGO. IEEE, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. H. Lee, K. J. Brown, A. K. Sujeeth, T. Rompf, and K. Olukotun. Locality-aware mapping of nested parallel patterns on GPUs. MICRO. IEEE, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. Leißa, M. Köster, and S. Hack. A graph-based higher-order intermediate representation. CGO. IEEE, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  12. K. Matsumoto, N. Nakasato, and S. G. Sedukhin. Performance tuning of matrix multiplication in OpenCL on different GPUs and CPUs. SCC. IEEE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. L. McDonell, M. M. Chakravarty, G. Keller, and B. Lippmeier. Optimising purely functional GPU programs. ICFP. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. C. McKellar and E. G. Coffman, Jr. Organizing matrices and matrix operations for paged memory systems. Commun. ACM, 12(3), 1969. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. Nugteren and H. Corporaal. Bones: An automatic skeleton-based C-to-CUDA compiler for GPUs. ACM TACO, 11(4), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. M. Phothilimthana, J. Ansel, J. Ragan-Kelley, and S. Amarasinghe. Portable performance on heterogeneous architectures. ASPLOS. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Steuwer, P. Kegel, and S. Gorlatch. SkelCL - A portable skeleton library for high-level GPU programming. IPDPSW. IEEE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Steuwer, C. Fensch, S. Lindley, and C. Dubach. Generating performance portable code using rewrite rules: From high-level functional expressions to high-performance opencl code. ICFP. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. K. Sujeeth, K. J. Brown, H. Lee, T. Rompf, H. Chafi, M. Odersky, and K. Olukotun. Delite: A compiler architecture for performance-oriented embedded domain-specific languages. ACM TECS, 13(4s), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Verdoolaege, J. Carlos Juega, A. Cohen, J. Ignacio Gómez, C. Tenllado, and F. Catthoor. Polyhedral parallel code generation for cuda. ACM TACO, 9(4), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Performance portable GPU code generation for matrix multiplication

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      GPGPU '16: Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit
      March 2016
      107 pages
      ISBN:9781450341950
      DOI:10.1145/2884045

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 12 March 2016

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      GPGPU '16 Paper Acceptance Rate9of23submissions,39%Overall Acceptance Rate57of129submissions,44%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader