research-article

CLBlast: A Tuned OpenCL BLAS Library

Author:
Cedric Nugteren

TomTom, Amsterdam, The Netherlands

TomTom, Amsterdam, The Netherlands
View Profile

IWOCL '18: Proceedings of the International Workshop on OpenCLMay 2018Article No.: 5Pages 1–10https://doi.org/10.1145/3204919.3204924

Published:14 May 2018Publication History

IWOCL '18: Proceedings of the International Workshop on OpenCL

Pages 1–10

ABSTRACT

This work introduces CLBlast, an open-source BLAS library providing optimized OpenCL routines to accelerate dense linear algebra for a wide variety of devices. It is targeted at machine learning and HPC applications and thus provides a fast matrix-multiplication routine (GEMM) to accelerate the core of many applications (e.g. deep learning, iterative solvers, astrophysics, computational fluid dynamics, quantum chemistry). CLBlast has five main advantages over other OpenCL BLAS libraries: 1) it is optimized for and tested on a large variety of OpenCL devices including less commonly used devices such as embedded and low-power GPUs, 2) it can be explicitly tuned for specific problem-sizes on specific hardware platforms, 3) it can perform operations in half-precision floating-point FP16 saving bandwidth, time and energy, 4) it has an optional CUDA back-end, 5) and it can combine multiple operations in a single batched routine, accelerating smaller problems significantly. This paper describes the library and demonstrates the advantages of CLBlast experimentally for different use-cases on a wide variety of OpenCL hardware.

References

Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra. 2016. Performance, Design, and Autotuning of Batched GEMM for GPUs. In 31st International Conference on High Performance Computing. Springer.Google ScholarCross Ref
R. Ballester-Ripoll, E. G. Paredes, and R. Pajarola. 2017. Sobol Tensor Trains for Global Sensitivity Analysis. ArXiv e-prints (Dec. 2017). arXiv:1712.00233Google Scholar
Shekhar Borkar and Andrew A. Chien. 2011. The Future of Microprocessors. Commun. ACM 54, 5 (May 2011), 67--77. Google ScholarDigital Library
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. (2014).Google Scholar
Tingxing Dong, Azzam Haidar, Piotr Luszczek, Stanimire Tomov, Ahmad Abdelfattah, and Jack Dongarra. 2016. MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs. Technical Report.Google Scholar
T. L. Falch and A. C. Elster. 2015. Machine Learning Based Auto-Tuning for Enhanced OpenCL Performance Portability. In IPDPSW: International Parallel and Distributed Processing Symposium Workshop. IEEE. Google ScholarDigital Library
Junjie Lai and A. Seznec. 2013. Performance Upper Bound Analysis and Optimization of SGEMM on Fermi and Kepler GPUs. In CGO '13: Code Generation and Optimization. IEEE. Google ScholarDigital Library
Yinan Li, Jack Dongarra, and Stanimire Tomov. 2009. A Note on Auto-tuning GEMM for GPUs. In ICCS: Int. Conf. on Computational Science. Springer. Google ScholarDigital Library
Ian Masliah, Ahmad Abdelfattah, A. Haidar, S. Tomov, Marc Baboulin, J. Falcou, and J. Dongarra. 2016. High-Performance Matrix-Matrix Multiplications of Very Small Matrices. In Euro-Par: International Conf. on Parallel and Distributed Computing. Springer. Google ScholarDigital Library
K. Matsumoto, N. Nakasato, and S.G. Sedukhin. 2014. Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units. Technical Report TR 2014-001. The University of Aizu.Google Scholar
K. Matsumoto, N. Nakasato, and S. G. Sedukhin. 2012. Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs. In SC Companion: High Performance Computing, Networking Storage and Analysis. IEEE, 396--405. Google ScholarDigital Library
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory F. Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2017. Mixed Precision Training. arXiv abs/1710.03740 (2017). arXiv:1710.03740 http://arxiv.org/abs/1710.03740Google Scholar
Matthew W. Moskewicz, Ali Jannesari, and Kurt Keutzer. 2017. Boda: A Holistic Approach for Implementing Neural Network Computations. In Computing Frontiers (CF'17). ACM. Google ScholarDigital Library
Cedric Nugteren and Valeriu Codreanu. 2015. CLTune: A Generic Auto-Tuner for OpenCL Kernels. In MCSOC '15: International Symposium on Embedded Multicore/Many-core Systems-on-Chip. IEEE. Google ScholarDigital Library
Hugh Perkins. 2017. CUDA-on-CL: A Compiler and Runtime for Running NVIDIA CUDA C++11 Applications on OpenCL 1.2 Devices. In Proceedings of the 5th International Workshop on OpenCL (IWOCL 2017). ACM, New York, NY, USA, Article 6, 4 pages. Google ScholarDigital Library
Samuel D Relton, Pedro Valero-Lara, and Mawussi Zounon. 2016. A Comparison of Potential Interfaces for Batched BLAS Computations. (2016).Google Scholar
Philippe Tillet and David Cox. 2017. Input-aware Auto-tuning of Compute-bound HPC Kernels. In SC '17: International Conf. for High Performance Computing, Networking, Storage and Analysis. ACM, Article 43, 12 pages. Google ScholarDigital Library
Fabian Tschopp. 2015. Efficient Convolutional Neural Networks for Pixelwise Classification on Heterogeneous Hardware Systems. arXiv abs/1509.03371 (2015). arXiv:1509.03371 http://arxiv.org/abs/1509.03371Google Scholar
Aravind Vasudevan, Andrew Anderson, and David Gregg. 2017. Parallel Multi Channel Convolution using General Matrix Multiplication. arXiv abs/1704.04428 (2017).Google Scholar
Pete Warden. 2015. Why GEMM is at the heart of deep learning. https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning (2015).Google Scholar
Rick Weber and Gregory D. Peterson. 2012. A Trip to Tahiti: Approaching a 5 TFlop SGEMM Using 3 AMD GPUs. In Proceedings of the 2012 Symposium on Application Accelerators in High Performance Computing (SAAHPC '12). IEEE Computer Society, Washington, DC, USA, 19--25. Google ScholarDigital Library

Index Terms

CLBlast: A Tuned OpenCL BLAS Library

Recommendations

A Trip to Tahiti: Approaching a 5 TFlop SGEMM Using 3 AMD GPUs
SAAHPC '12: Proceedings of the 2012 Symposium on Application Accelerators in High Performance Computing

Using GPUs as computational accelerators has been a growing area of research in the past several years. One particular area amenable to exploiting video card hardware is dense linear algebra. We continue this trend by generalizing the MAGMA xGEMM ...
Read More
Developing High-Performance, Portable OpenCL Code via Multi-Dimensional Homomorphisms
IWOCL '19: Proceedings of the International Workshop on OpenCL

A key challenge in programming high-performance applications is achieving portable performance, such that the same program code can reach a consistent level of performance over the variety of modern parallel processors, including multi-core CPU and ...
Read More
A Portable and High-Performance General Matrix-Multiply (GEMM) Library for GPUs and Single-Chip CPU/GPU Systems
PDP '14: Proceedings of the 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing

OpenCL is a vendor neutral and portable interface for programming parallel compute devices such as GPUs. Tuning OpenCL implementations of important library functions such as dense general matrix multiply (GEMM) for a particular device is a difficult ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

IWOCL '18: Proceedings of the International Workshop on OpenCL
May 2018
108 pages
ISBN:9781450364393
DOI:10.1145/3204919

Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 May 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Auto-Tuning
BLAS
Batched GEMM
CUDA
Deep Learning
FP16
GEMM
GPU
HPC
OpenCL
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
IWOCL '18 Paper Acceptance Rate16of33submissions,48%Overall Acceptance Rate84of152submissions,55%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 42
  Total Citations
  View Citations
- 291
  Total Downloads
- Downloads (Last 12 months)43
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

CLBlast: A Tuned OpenCL BLAS Library

IWOCL '18: Proceedings of the International Workshop on OpenCL

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Trip to Tahiti: Approaching a 5 TFlop SGEMM Using 3 AMD GPUs

Developing High-Performance, Portable OpenCL Code via Multi-Dimensional Homomorphisms

A Portable and High-Performance General Matrix-Multiply (GEMM) Library for GPUs and Single-Chip CPU/GPU Systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

CLBlast: A Tuned OpenCL BLAS Library

IWOCL '18: Proceedings of the International Workshop on OpenCL

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Trip to Tahiti: Approaching a 5 TFlop SGEMM Using 3 AMD GPUs

Developing High-Performance, Portable OpenCL Code via Multi-Dimensional Homomorphisms

A Portable and High-Performance General Matrix-Multiply (GEMM) Library for GPUs and Single-Chip CPU/GPU Systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media