ABSTRACT
This work introduces CLBlast, an open-source BLAS library providing optimized OpenCL routines to accelerate dense linear algebra for a wide variety of devices. It is targeted at machine learning and HPC applications and thus provides a fast matrix-multiplication routine (GEMM) to accelerate the core of many applications (e.g. deep learning, iterative solvers, astrophysics, computational fluid dynamics, quantum chemistry). CLBlast has five main advantages over other OpenCL BLAS libraries: 1) it is optimized for and tested on a large variety of OpenCL devices including less commonly used devices such as embedded and low-power GPUs, 2) it can be explicitly tuned for specific problem-sizes on specific hardware platforms, 3) it can perform operations in half-precision floating-point FP16 saving bandwidth, time and energy, 4) it has an optional CUDA back-end, 5) and it can combine multiple operations in a single batched routine, accelerating smaller problems significantly. This paper describes the library and demonstrates the advantages of CLBlast experimentally for different use-cases on a wide variety of OpenCL hardware.
- Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra. 2016. Performance, Design, and Autotuning of Batched GEMM for GPUs. In 31st International Conference on High Performance Computing. Springer.Google ScholarCross Ref
- R. Ballester-Ripoll, E. G. Paredes, and R. Pajarola. 2017. Sobol Tensor Trains for Global Sensitivity Analysis. ArXiv e-prints (Dec. 2017). arXiv:1712.00233Google Scholar
- Shekhar Borkar and Andrew A. Chien. 2011. The Future of Microprocessors. Commun. ACM 54, 5 (May 2011), 67--77. Google ScholarDigital Library
- Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. (2014).Google Scholar
- Tingxing Dong, Azzam Haidar, Piotr Luszczek, Stanimire Tomov, Ahmad Abdelfattah, and Jack Dongarra. 2016. MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs. Technical Report.Google Scholar
- T. L. Falch and A. C. Elster. 2015. Machine Learning Based Auto-Tuning for Enhanced OpenCL Performance Portability. In IPDPSW: International Parallel and Distributed Processing Symposium Workshop. IEEE. Google ScholarDigital Library
- Junjie Lai and A. Seznec. 2013. Performance Upper Bound Analysis and Optimization of SGEMM on Fermi and Kepler GPUs. In CGO '13: Code Generation and Optimization. IEEE. Google ScholarDigital Library
- Yinan Li, Jack Dongarra, and Stanimire Tomov. 2009. A Note on Auto-tuning GEMM for GPUs. In ICCS: Int. Conf. on Computational Science. Springer. Google ScholarDigital Library
- Ian Masliah, Ahmad Abdelfattah, A. Haidar, S. Tomov, Marc Baboulin, J. Falcou, and J. Dongarra. 2016. High-Performance Matrix-Matrix Multiplications of Very Small Matrices. In Euro-Par: International Conf. on Parallel and Distributed Computing. Springer. Google ScholarDigital Library
- K. Matsumoto, N. Nakasato, and S.G. Sedukhin. 2014. Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units. Technical Report TR 2014-001. The University of Aizu.Google Scholar
- K. Matsumoto, N. Nakasato, and S. G. Sedukhin. 2012. Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs. In SC Companion: High Performance Computing, Networking Storage and Analysis. IEEE, 396--405. Google ScholarDigital Library
- Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory F. Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2017. Mixed Precision Training. arXiv abs/1710.03740 (2017). arXiv:1710.03740 http://arxiv.org/abs/1710.03740Google Scholar
- Matthew W. Moskewicz, Ali Jannesari, and Kurt Keutzer. 2017. Boda: A Holistic Approach for Implementing Neural Network Computations. In Computing Frontiers (CF'17). ACM. Google ScholarDigital Library
- Cedric Nugteren and Valeriu Codreanu. 2015. CLTune: A Generic Auto-Tuner for OpenCL Kernels. In MCSOC '15: International Symposium on Embedded Multicore/Many-core Systems-on-Chip. IEEE. Google ScholarDigital Library
- Hugh Perkins. 2017. CUDA-on-CL: A Compiler and Runtime for Running NVIDIA CUDA C++11 Applications on OpenCL 1.2 Devices. In Proceedings of the 5th International Workshop on OpenCL (IWOCL 2017). ACM, New York, NY, USA, Article 6, 4 pages. Google ScholarDigital Library
- Samuel D Relton, Pedro Valero-Lara, and Mawussi Zounon. 2016. A Comparison of Potential Interfaces for Batched BLAS Computations. (2016).Google Scholar
- Philippe Tillet and David Cox. 2017. Input-aware Auto-tuning of Compute-bound HPC Kernels. In SC '17: International Conf. for High Performance Computing, Networking, Storage and Analysis. ACM, Article 43, 12 pages. Google ScholarDigital Library
- Fabian Tschopp. 2015. Efficient Convolutional Neural Networks for Pixelwise Classification on Heterogeneous Hardware Systems. arXiv abs/1509.03371 (2015). arXiv:1509.03371 http://arxiv.org/abs/1509.03371Google Scholar
- Aravind Vasudevan, Andrew Anderson, and David Gregg. 2017. Parallel Multi Channel Convolution using General Matrix Multiplication. arXiv abs/1704.04428 (2017).Google Scholar
- Pete Warden. 2015. Why GEMM is at the heart of deep learning. https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning (2015).Google Scholar
- Rick Weber and Gregory D. Peterson. 2012. A Trip to Tahiti: Approaching a 5 TFlop SGEMM Using 3 AMD GPUs. In Proceedings of the 2012 Symposium on Application Accelerators in High Performance Computing (SAAHPC '12). IEEE Computer Society, Washington, DC, USA, 19--25. Google ScholarDigital Library
Index Terms
- CLBlast: A Tuned OpenCL BLAS Library
Recommendations
A Trip to Tahiti: Approaching a 5 TFlop SGEMM Using 3 AMD GPUs
SAAHPC '12: Proceedings of the 2012 Symposium on Application Accelerators in High Performance ComputingUsing GPUs as computational accelerators has been a growing area of research in the past several years. One particular area amenable to exploiting video card hardware is dense linear algebra. We continue this trend by generalizing the MAGMA xGEMM ...
Developing High-Performance, Portable OpenCL Code via Multi-Dimensional Homomorphisms
IWOCL '19: Proceedings of the International Workshop on OpenCLA key challenge in programming high-performance applications is achieving portable performance, such that the same program code can reach a consistent level of performance over the variety of modern parallel processors, including multi-core CPU and ...
A Portable and High-Performance General Matrix-Multiply (GEMM) Library for GPUs and Single-Chip CPU/GPU Systems
PDP '14: Proceedings of the 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based ProcessingOpenCL is a vendor neutral and portable interface for programming parallel compute devices such as GPUs. Tuning OpenCL implementations of important library functions such as dense general matrix multiply (GEMM) for a particular device is a difficult ...
Comments