skip to main content
10.1145/3204919.3204924acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiwoclConference Proceedingsconference-collections
research-article

CLBlast: A Tuned OpenCL BLAS Library

Published:14 May 2018Publication History

ABSTRACT

This work introduces CLBlast, an open-source BLAS library providing optimized OpenCL routines to accelerate dense linear algebra for a wide variety of devices. It is targeted at machine learning and HPC applications and thus provides a fast matrix-multiplication routine (GEMM) to accelerate the core of many applications (e.g. deep learning, iterative solvers, astrophysics, computational fluid dynamics, quantum chemistry). CLBlast has five main advantages over other OpenCL BLAS libraries: 1) it is optimized for and tested on a large variety of OpenCL devices including less commonly used devices such as embedded and low-power GPUs, 2) it can be explicitly tuned for specific problem-sizes on specific hardware platforms, 3) it can perform operations in half-precision floating-point FP16 saving bandwidth, time and energy, 4) it has an optional CUDA back-end, 5) and it can combine multiple operations in a single batched routine, accelerating smaller problems significantly. This paper describes the library and demonstrates the advantages of CLBlast experimentally for different use-cases on a wide variety of OpenCL hardware.

References

  1. Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra. 2016. Performance, Design, and Autotuning of Batched GEMM for GPUs. In 31st International Conference on High Performance Computing. Springer.Google ScholarGoogle ScholarCross RefCross Ref
  2. R. Ballester-Ripoll, E. G. Paredes, and R. Pajarola. 2017. Sobol Tensor Trains for Global Sensitivity Analysis. ArXiv e-prints (Dec. 2017). arXiv:1712.00233Google ScholarGoogle Scholar
  3. Shekhar Borkar and Andrew A. Chien. 2011. The Future of Microprocessors. Commun. ACM 54, 5 (May 2011), 67--77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. (2014).Google ScholarGoogle Scholar
  5. Tingxing Dong, Azzam Haidar, Piotr Luszczek, Stanimire Tomov, Ahmad Abdelfattah, and Jack Dongarra. 2016. MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs. Technical Report.Google ScholarGoogle Scholar
  6. T. L. Falch and A. C. Elster. 2015. Machine Learning Based Auto-Tuning for Enhanced OpenCL Performance Portability. In IPDPSW: International Parallel and Distributed Processing Symposium Workshop. IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Junjie Lai and A. Seznec. 2013. Performance Upper Bound Analysis and Optimization of SGEMM on Fermi and Kepler GPUs. In CGO '13: Code Generation and Optimization. IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Yinan Li, Jack Dongarra, and Stanimire Tomov. 2009. A Note on Auto-tuning GEMM for GPUs. In ICCS: Int. Conf. on Computational Science. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ian Masliah, Ahmad Abdelfattah, A. Haidar, S. Tomov, Marc Baboulin, J. Falcou, and J. Dongarra. 2016. High-Performance Matrix-Matrix Multiplications of Very Small Matrices. In Euro-Par: International Conf. on Parallel and Distributed Computing. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. K. Matsumoto, N. Nakasato, and S.G. Sedukhin. 2014. Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units. Technical Report TR 2014-001. The University of Aizu.Google ScholarGoogle Scholar
  11. K. Matsumoto, N. Nakasato, and S. G. Sedukhin. 2012. Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs. In SC Companion: High Performance Computing, Networking Storage and Analysis. IEEE, 396--405. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory F. Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2017. Mixed Precision Training. arXiv abs/1710.03740 (2017). arXiv:1710.03740 http://arxiv.org/abs/1710.03740Google ScholarGoogle Scholar
  13. Matthew W. Moskewicz, Ali Jannesari, and Kurt Keutzer. 2017. Boda: A Holistic Approach for Implementing Neural Network Computations. In Computing Frontiers (CF'17). ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Cedric Nugteren and Valeriu Codreanu. 2015. CLTune: A Generic Auto-Tuner for OpenCL Kernels. In MCSOC '15: International Symposium on Embedded Multicore/Many-core Systems-on-Chip. IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Hugh Perkins. 2017. CUDA-on-CL: A Compiler and Runtime for Running NVIDIA CUDA C++11 Applications on OpenCL 1.2 Devices. In Proceedings of the 5th International Workshop on OpenCL (IWOCL 2017). ACM, New York, NY, USA, Article 6, 4 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Samuel D Relton, Pedro Valero-Lara, and Mawussi Zounon. 2016. A Comparison of Potential Interfaces for Batched BLAS Computations. (2016).Google ScholarGoogle Scholar
  17. Philippe Tillet and David Cox. 2017. Input-aware Auto-tuning of Compute-bound HPC Kernels. In SC '17: International Conf. for High Performance Computing, Networking, Storage and Analysis. ACM, Article 43, 12 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Fabian Tschopp. 2015. Efficient Convolutional Neural Networks for Pixelwise Classification on Heterogeneous Hardware Systems. arXiv abs/1509.03371 (2015). arXiv:1509.03371 http://arxiv.org/abs/1509.03371Google ScholarGoogle Scholar
  19. Aravind Vasudevan, Andrew Anderson, and David Gregg. 2017. Parallel Multi Channel Convolution using General Matrix Multiplication. arXiv abs/1704.04428 (2017).Google ScholarGoogle Scholar
  20. Pete Warden. 2015. Why GEMM is at the heart of deep learning. https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning (2015).Google ScholarGoogle Scholar
  21. Rick Weber and Gregory D. Peterson. 2012. A Trip to Tahiti: Approaching a 5 TFlop SGEMM Using 3 AMD GPUs. In Proceedings of the 2012 Symposium on Application Accelerators in High Performance Computing (SAAHPC '12). IEEE Computer Society, Washington, DC, USA, 19--25. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. CLBlast: A Tuned OpenCL BLAS Library

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Other conferences
            IWOCL '18: Proceedings of the International Workshop on OpenCL
            May 2018
            108 pages
            ISBN:9781450364393
            DOI:10.1145/3204919

            Copyright © 2018 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 14 May 2018

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed limited

            Acceptance Rates

            IWOCL '18 Paper Acceptance Rate16of33submissions,48%Overall Acceptance Rate84of152submissions,55%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader