research-article

Performance portable GPU code generation for matrix multiplication

Authors:
Toomas Remmelg

University of Edinburgh

University of Edinburgh
View Profile

,
Thibaut Lutz

University of Edinburgh

University of Edinburgh
View Profile

,
Michel Steuwer

University of Edinburgh

University of Edinburgh
View Profile

,
Christophe Dubach

University of Edinburgh

University of Edinburgh
View Profile

GPGPU '16: Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing UnitMarch 2016Pages 22–31https://doi.org/10.1145/2884045.2884046

Published:12 March 2016Publication History

GPGPU '16: Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit

Pages 22–31

ABSTRACT

Parallel accelerators such as GPUs are notoriously hard to program; exploiting their full performance potential is a job best left for ninja programmers. High-level programming languages coupled with optimizing compilers have been proposed to attempt to address this issue. However, they rely on device-specific heuristics or hard-coded library implementations to achieve good performance resulting in non-portable solutions that need to be re-optimized for every new device.

Achieving performance portability is the holy grail of high-performance computing and has so far remained an open problem even for well studied applications like matrix multiplication. We argue that what is needed is a way to describe applications at a high-level without committing to particular implementations. To this end, we developed in a previous paper a functional data-parallel language which allows applications to be expressed in a device neutral way. We use a set of well-defined rewrite rules to automatically transform programs into semantically equivalent device-specific forms, from which OpenCL code is generated.

In this paper, we demonstrate how this approach produces high-performance OpenCL code for GPUs with a well-studied, well-understood application: matrix multiplication. Starting from a single high-level program, our compiler automatically generate highly optimized and specialized implementations. We group simple rewrite rules into more complex macro-rules, each describing a well-known optimization like tiling and register blocking in a composable way. Using an exploration strategy our compiler automatically generates 50,000 OpenCL kernels, each providing a differently optimized -- but provably correct -- implementation of matrix multiplication. The automatically generated code offers competitive performance compared to the manually tuned MAGMA library implementations of matrix multiplication on Nvidia and even outperforms AMD's clBLAS library.

References

AMD. APP OpenCL programming guide.Google Scholar
J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. Petabricks: A language and compiler for algorithmic choice. PLDI. ACM, 2009. Google ScholarDigital Library
R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy, S. Verdoolaege, J. Absar, S. van Haastregt, A. Kravets, et al. Pencil: A platform-neutral compute intermediate language for accelerator programming. PACT. ACM, 2015.Google ScholarDigital Library
M. M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic C-to-CUDA code generation for affine programs. CC/ETAPS. Springer-Verlag, 2010. Google ScholarDigital Library
C. Cao, J. Dongarra, P. Du, M. Gates, P. Luszczek, and S. Tomov. clMAGMA: High performance dense linear algebra with OpenCL. IWOCL. ACM, 2014. Google ScholarDigital Library
C. Dubach, P. Cheng, R. Rabbah, D. F. Bacon, and S. J. Fink. Compiling a high-level language for GPUs: (via language support for architectures and compilers). PLDI. ACM, 2012. Google ScholarDigital Library
R. Karrenberg and S. Hack. Whole-function vectorization. CGO. IEEE, 2011. Google ScholarDigital Library
J. Kurzak, S. Tomov, and J. Dongarra. Autotuning GEMM kernels for the fermi GPU. IEEE TPDS, 23(11), 2012. Google ScholarDigital Library
J. Lai and A. Seznec. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs. CGO. IEEE, 2013. Google ScholarDigital Library
H. Lee, K. J. Brown, A. K. Sujeeth, T. Rompf, and K. Olukotun. Locality-aware mapping of nested parallel patterns on GPUs. MICRO. IEEE, 2014. Google ScholarDigital Library
R. Leißa, M. Köster, and S. Hack. A graph-based higher-order intermediate representation. CGO. IEEE, 2015.Google ScholarCross Ref
K. Matsumoto, N. Nakasato, and S. G. Sedukhin. Performance tuning of matrix multiplication in OpenCL on different GPUs and CPUs. SCC. IEEE, 2012. Google ScholarDigital Library
T. L. McDonell, M. M. Chakravarty, G. Keller, and B. Lippmeier. Optimising purely functional GPU programs. ICFP. ACM, 2013. Google ScholarDigital Library
A. C. McKellar and E. G. Coffman, Jr. Organizing matrices and matrix operations for paged memory systems. Commun. ACM, 12(3), 1969. Google ScholarDigital Library
C. Nugteren and H. Corporaal. Bones: An automatic skeleton-based C-to-CUDA compiler for GPUs. ACM TACO, 11(4), 2014. Google ScholarDigital Library
P. M. Phothilimthana, J. Ansel, J. Ragan-Kelley, and S. Amarasinghe. Portable performance on heterogeneous architectures. ASPLOS. ACM, 2013. Google ScholarDigital Library
M. Steuwer, P. Kegel, and S. Gorlatch. SkelCL - A portable skeleton library for high-level GPU programming. IPDPSW. IEEE, 2011. Google ScholarDigital Library
M. Steuwer, C. Fensch, S. Lindley, and C. Dubach. Generating performance portable code using rewrite rules: From high-level functional expressions to high-performance opencl code. ICFP. ACM, 2015. Google ScholarDigital Library
A. K. Sujeeth, K. J. Brown, H. Lee, T. Rompf, H. Chafi, M. Odersky, and K. Olukotun. Delite: A compiler architecture for performance-oriented embedded domain-specific languages. ACM TECS, 13(4s), 2014. Google ScholarDigital Library
S. Verdoolaege, J. Carlos Juega, A. Cohen, J. Ignacio Gómez, C. Tenllado, and F. Catthoor. Polyhedral parallel code generation for cuda. ACM TACO, 9(4), 2013. Google ScholarDigital Library

Index Terms

Performance portable GPU code generation for matrix multiplication
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code
ICFP '15

Computers have become increasingly complex with the emergence of heterogeneous hardware combining multicore CPUs and GPUs. These parallel systems exhibit tremendous computational power at the cost of increased programming effort resulting in a tension ...
Read More
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
Read More
Evaluation of a performance portable lattice Boltzmann code using OpenCL
IWOCL '14: Proceedings of the International Workshop on OpenCL 2013 & 2014

With the advent of many-core computer architectures such as GPGPUs from NVIDIA and AMD, and more recently Intel's Xeon Phi, ensuring performance portability of HPC codes is potentially becoming more complex. In this work we have focused on one important ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
GPGPU '16: Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit
March 2016
107 pages
ISBN:9781450341950
DOI:10.1145/2884045
Conference Chairs:
David Kaeli,
John Cavazos
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 March 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPU
code generation
high-level parallel programming
matrix multiplication
performance portability
Qualifiers
- research-article
Conference

Acceptance Rates
GPGPU '16 Paper Acceptance Rate9of23submissions,39%Overall Acceptance Rate57of129submissions,44%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 19
  Total Citations
  View Citations
- 295
  Total Downloads
- Downloads (Last 12 months)25
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Performance portable GPU code generation for matrix multiplication

GPGPU '16: Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit

ABSTRACT

References

Cited By

Index Terms

Recommendations

Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs

Evaluation of a performance portable lattice Boltzmann code using OpenCL