skip to main content
10.1145/3456669.3456694acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiwoclConference Proceedingsconference-collections
research-article

Toward Performance Portability of Highly Parametrizable TRSM Algorithm Using SYCL

Published:27 April 2021Publication History

ABSTRACT

Presented in 1979, BLAS is, to this day, the de-facto standard for low-level linear algebra routines. BLAS provides essential linear algebra routines used in various domains such as numerical and scientific computing, weather simulation, computational fluid dynamics, machine learning and adopted for a broad range of hardware from HPC to embedded systems and AI specialized accelerators.

While originally BLAS routines have been implemented for CPU, with the emergence of GPGPU, BLAS routines had to be re-written to exploit the provided extensive computational power. Machine learning is rapidly changing this landscape again by incentivizing the development of specialized hardware that can perform certain operations more efficiently. With a wide range of hardware available, each with a new kind of memory hierarchy, different cache line sizes, and various memory access patterns required for performance, with different number of registers and different type of memory connections, performance portability of BLAS routine across various platforms while avoiding rewrites of existing code is a major challenge of the heterogeneous programming world.

Written in SYCL, SYCL-BLAS is an open-source BLAS library that provides performance portability across various SYCL-enabled platforms. This paper presents the implementation of a parametric tile-based TRSM routine for SYCL-BLAS by employing a formulation that leverages a highly optimized GEMM routine already provided in SYCL-BLAS. Our results shows that we can achieve up to 2.6x speedup on Intel GPU, 7x on AMD GPU and up to 3.4x speedup on ARM GPU compared with the highly optimized clBLAST and clBLAS libraries by tuning the tile size per-device without reimplementing the kernel.

References

  1. [n.d.]. The ARM Computer Vision and Machine Learning library. https://github.com/ARM-software/ComputeLibrary/Google ScholarGoogle Scholar
  2. [n.d.]. BLAS (Basic Linear Algebra Subprograms). https://www.netlib.org/blas/Google ScholarGoogle Scholar
  3. [n.d.]. clBLAS. https://rocmdocs.amd.com/en/latest/ROCm_Tools/clBLA.htmlGoogle ScholarGoogle Scholar
  4. [n.d.]. The HiKey 960 development platform. https://www.96boards.org/product/hikey960Google ScholarGoogle Scholar
  5. [n.d.]. OpenBLAS - An optimized BLAS library. http://www.openblas.net/Google ScholarGoogle Scholar
  6. [n.d.]. SYCL-BLAS: An implementation of BLAS using the SYCL open standard. https://github.com/CodeplaySoftware/SYCL-BLAS. Accessed: 2019-04-09.Google ScholarGoogle Scholar
  7. [n.d.]. SYCL Specification. https://www.khronos.org/registry/SYCL/Google ScholarGoogle Scholar
  8. 2020. Intel® oneAPI Math Kernel Library. https://software.intel.com/content/www/us/en/develop/documentation/oneapi-mkl-dpcpp-developer-reference/top.htmlGoogle ScholarGoogle Scholar
  9. 2020. The oneAPI Specification. https://www.oneapi.com/Google ScholarGoogle Scholar
  10. M. Abadi, P. Barham, J. Chen, Z. Chen, Andy Davis, J. Dean, M. Devin, Sanjay Ghemawat, Geoffrey Irving, M. Isard, M. Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, D. Murray, B. Steiner, P. Tucker, V. Vasudevan, Pete Warden, Martin Wicke, Y. Yu, and Xiaoqiang Zhang. 2016. TensorFlow: A system for large-scale machine learning. In OSDI.Google ScholarGoogle Scholar
  11. José I Aliaga, Ruymán Reyes, and Mehdi Goli. 2017. SYCL-BLAS: leveraging expression trees for linear algebra. In Proceedings of the 5th International Workshop on OpenCL. 1–5.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Ed Anderson, Zhaojun Bai, Jack Dongarra, A. Greenbaum, A. McKenney, Jeremy Croz, Sven Hammarling, James Demmel, Christian Bischof, and Danny Sorensen. 1990. LAPACK: A Portable Linear Algebra Library for High-Performance Computers. [No source information available], 2–11.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. 2006. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report. TECHNICAL REPORT, UC BERKELEY.Google ScholarGoogle Scholar
  14. John Cheng, Max Grossman, and Ty McKercher. 2014. Professional CUDA c programming. John Wiley & Sons.Google ScholarGoogle Scholar
  15. Ronan Collobert, Samy Bengio, and Johnny Marithoz. 2002. Torch: A Modular Machine Learning Software Library. (11 2002).Google ScholarGoogle Scholar
  16. Jack Dongarra, Mark Gates, Azzam Haidar, Jakub Kurzak, Piotr Luszczek, Stanimire Tomov, and Ichitaro Yamazaki. 2014. Accelerating Numerical Dense Linear Algebra Calculations with GPUs. Numerical Computations with GPUs(2014), 1–26.Google ScholarGoogle Scholar
  17. Mehdi Goli, Luke Iwanski, John Lawson, Uwe Dolinsky, and Andrew Richards. 2018. TensorFlow Acceleration on ARM Hikey Board. In Proceedings of the International Workshop on OpenCL (Oxford, United Kingdom) (IWOCL ’18). Association for Computing Machinery, New York, NY, USA, Article 7, 4 pages. https://doi.org/10.1145/3204919.3204926Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Gael Guennebaud, Benoit Jacob, 2014. Eigen: a c++ linear algebra library. URL http://eigen. tuxfamily. org, Accessed 22 (2014).Google ScholarGoogle Scholar
  19. Azzam Haidar, Chongxiao Cao, Ichitaro Yamazaki, Jack Dongarra, Mark Gates, Piotr Luszczek, and Stanimire Tomov. 2014. Performance and Portability with OpenCL for Throughput-Oriented HPC Workloads Across Accelerators, Coprocessors, and Multicore Processors. In 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA 14). IEEE, New Orleans, LA. https://doi.org/10.1109/ScalA.2014.8Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. arxiv:1408.5093 [cs.CV]Google ScholarGoogle Scholar
  21. Bo Kågström, Per Ling, and Charles van Loan. 1998. GEMM-Based Level 3 BLAS: High-Performance Model Implementations and Performance Evaluation Benchmark. ACM Trans. Math. Softw. 24, 3 (Sept. 1998), 268–302. https://doi.org/10.1145/292395.292412Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. 1979. Basic Linear Algebra Subprograms for Fortran Usage. ACM Trans. Math. Softw. 5, 3 (Sept. 1979), 308–323. https://doi.org/10.1145/355841.355847Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. John W. Lawson, Mehdi Goli, Duncan McBain, Daniel Soutar, and Louis Sugy. 2019. Cross-Platform Performance Portability Using Highly Parametrized SYCL Kernels. CoRR abs/1904.05347(2019). arxiv:1904.05347http://arxiv.org/abs/1904.05347Google ScholarGoogle Scholar
  24. Codeplay Software Ltd.2021. ComputeCpp CE 2.3.0. https://developer.codeplay.com/products/computecpp/ce/homeGoogle ScholarGoogle Scholar
  25. Cedric Nugteren. 2018. CLBlast: A Tuned OpenCL BLAS Library. In Proceedings of the International Workshop on OpenCL (Oxford, United Kingdom) (IWOCL ’18). Association for Computing Machinery, New York, NY, USA, Article 5, 10 pages. https://doi.org/10.1145/3204919.3204924Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M.A. Oliver and R. Webster. 2014. A tutorial guide to geostatistics: Computing and modelling variograms and kriging. CATENA 113(2014), 56 – 69. https://doi.org/10.1016/j.catena.2013.09.006Google ScholarGoogle ScholarCross RefCross Ref
  27. Ruyman Reyes and Victor Lomüller. 2015. SYCL: Single-source C++ accelerator programming.. In PARCO. 673–682.Google ScholarGoogle Scholar
  28. Thales Luis Sabino, Gisele Tavares, Leonardo Goliatt, Marcelo Lobosco, Filipe Chaves, and Rodrigo Santos. 2017. A parallel implementation of the ordinary kriging algorithm for heterogeneous computing environments. (08 2017).Google ScholarGoogle Scholar
  29. John E Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in science & engineering 12, 3 (2010), 66.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Cuda Toolkit. [n.d.]. CUBLAS Library.Google ScholarGoogle Scholar
  31. Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, and Yajuan Wang. 2014. Intel Math Kernel Library. Springer International Publishing, Cham, 167–188. https://doi.org/10.1007/978-3-319-06486-4_7Google ScholarGoogle ScholarCross RefCross Ref
  32. Field Zee, Ernie Chan, Robert van de Geijn, Enrique S Quintana-Orti, and Gregorio Quintana-Ortí. 2009. Introducing: the LIBFLAME library for dense matrix computations. Computing in Science and Engineering 11 (11 2009), 56–63. https://doi.org/10.1109/MCSE.2009.207Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Toward Performance Portability of Highly Parametrizable TRSM Algorithm Using SYCL
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          IWOCL '21: Proceedings of the 9th International Workshop on OpenCL
          April 2021
          112 pages
          ISBN:9781450390330
          DOI:10.1145/3456669

          Copyright © 2021 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 27 April 2021

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed limited

          Acceptance Rates

          Overall Acceptance Rate84of152submissions,55%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format