research-article

Toward Performance Portability of Highly Parametrizable TRSM Algorithm Using SYCL

Authors:
Thales Sabino

Codeplay Software Ltd., UK

Codeplay Software Ltd., UK
View Profile

,
Mehdi Goli

Codeplay software Ltd, UK

Codeplay software Ltd, UK
View Profile

IWOCL '21: Proceedings of the 9th International Workshop on OpenCLApril 2021Article No.: 5Pages 1–10https://doi.org/10.1145/3456669.3456694

Published:27 April 2021Publication History

IWOCL '21: Proceedings of the 9th International Workshop on OpenCL

Pages 1–10

ABSTRACT

Presented in 1979, BLAS is, to this day, the de-facto standard for low-level linear algebra routines. BLAS provides essential linear algebra routines used in various domains such as numerical and scientific computing, weather simulation, computational fluid dynamics, machine learning and adopted for a broad range of hardware from HPC to embedded systems and AI specialized accelerators.

While originally BLAS routines have been implemented for CPU, with the emergence of GPGPU, BLAS routines had to be re-written to exploit the provided extensive computational power. Machine learning is rapidly changing this landscape again by incentivizing the development of specialized hardware that can perform certain operations more efficiently. With a wide range of hardware available, each with a new kind of memory hierarchy, different cache line sizes, and various memory access patterns required for performance, with different number of registers and different type of memory connections, performance portability of BLAS routine across various platforms while avoiding rewrites of existing code is a major challenge of the heterogeneous programming world.

Written in SYCL, SYCL-BLAS is an open-source BLAS library that provides performance portability across various SYCL-enabled platforms. This paper presents the implementation of a parametric tile-based TRSM routine for SYCL-BLAS by employing a formulation that leverages a highly optimized GEMM routine already provided in SYCL-BLAS. Our results shows that we can achieve up to 2.6x speedup on Intel GPU, 7x on AMD GPU and up to 3.4x speedup on ARM GPU compared with the highly optimized clBLAST and clBLAS libraries by tuning the tile size per-device without reimplementing the kernel.

References

[n.d.]. The ARM Computer Vision and Machine Learning library. https://github.com/ARM-software/ComputeLibrary/Google Scholar
[n.d.]. BLAS (Basic Linear Algebra Subprograms). https://www.netlib.org/blas/Google Scholar
[n.d.]. clBLAS. https://rocmdocs.amd.com/en/latest/ROCm_Tools/clBLA.htmlGoogle Scholar
[n.d.]. The HiKey 960 development platform. https://www.96boards.org/product/hikey960Google Scholar
[n.d.]. OpenBLAS - An optimized BLAS library. http://www.openblas.net/Google Scholar
[n.d.]. SYCL-BLAS: An implementation of BLAS using the SYCL open standard. https://github.com/CodeplaySoftware/SYCL-BLAS. Accessed: 2019-04-09.Google Scholar
[n.d.]. SYCL Specification. https://www.khronos.org/registry/SYCL/Google Scholar
2020. Intel® oneAPI Math Kernel Library. https://software.intel.com/content/www/us/en/develop/documentation/oneapi-mkl-dpcpp-developer-reference/top.htmlGoogle Scholar
2020. The oneAPI Specification. https://www.oneapi.com/Google Scholar
M. Abadi, P. Barham, J. Chen, Z. Chen, Andy Davis, J. Dean, M. Devin, Sanjay Ghemawat, Geoffrey Irving, M. Isard, M. Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, D. Murray, B. Steiner, P. Tucker, V. Vasudevan, Pete Warden, Martin Wicke, Y. Yu, and Xiaoqiang Zhang. 2016. TensorFlow: A system for large-scale machine learning. In OSDI.Google Scholar
José I Aliaga, Ruymán Reyes, and Mehdi Goli. 2017. SYCL-BLAS: leveraging expression trees for linear algebra. In Proceedings of the 5th International Workshop on OpenCL. 1–5.Google ScholarDigital Library
Ed Anderson, Zhaojun Bai, Jack Dongarra, A. Greenbaum, A. McKenney, Jeremy Croz, Sven Hammarling, James Demmel, Christian Bischof, and Danny Sorensen. 1990. LAPACK: A Portable Linear Algebra Library for High-Performance Computers. [No source information available], 2–11.Google ScholarDigital Library
Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. 2006. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report. TECHNICAL REPORT, UC BERKELEY.Google Scholar
John Cheng, Max Grossman, and Ty McKercher. 2014. Professional CUDA c programming. John Wiley & Sons.Google Scholar
Ronan Collobert, Samy Bengio, and Johnny Marithoz. 2002. Torch: A Modular Machine Learning Software Library. (11 2002).Google Scholar
Jack Dongarra, Mark Gates, Azzam Haidar, Jakub Kurzak, Piotr Luszczek, Stanimire Tomov, and Ichitaro Yamazaki. 2014. Accelerating Numerical Dense Linear Algebra Calculations with GPUs. Numerical Computations with GPUs(2014), 1–26.Google Scholar
Mehdi Goli, Luke Iwanski, John Lawson, Uwe Dolinsky, and Andrew Richards. 2018. TensorFlow Acceleration on ARM Hikey Board. In Proceedings of the International Workshop on OpenCL (Oxford, United Kingdom) (IWOCL ’18). Association for Computing Machinery, New York, NY, USA, Article 7, 4 pages. https://doi.org/10.1145/3204919.3204926Google ScholarDigital Library
Gael Guennebaud, Benoit Jacob, 2014. Eigen: a c++ linear algebra library. URL http://eigen. tuxfamily. org, Accessed 22 (2014).Google Scholar
Azzam Haidar, Chongxiao Cao, Ichitaro Yamazaki, Jack Dongarra, Mark Gates, Piotr Luszczek, and Stanimire Tomov. 2014. Performance and Portability with OpenCL for Throughput-Oriented HPC Workloads Across Accelerators, Coprocessors, and Multicore Processors. In 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA 14). IEEE, New Orleans, LA. https://doi.org/10.1109/ScalA.2014.8Google ScholarDigital Library
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. arxiv:1408.5093 [cs.CV]Google Scholar
Bo Kågström, Per Ling, and Charles van Loan. 1998. GEMM-Based Level 3 BLAS: High-Performance Model Implementations and Performance Evaluation Benchmark. ACM Trans. Math. Softw. 24, 3 (Sept. 1998), 268–302. https://doi.org/10.1145/292395.292412Google ScholarDigital Library
C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. 1979. Basic Linear Algebra Subprograms for Fortran Usage. ACM Trans. Math. Softw. 5, 3 (Sept. 1979), 308–323. https://doi.org/10.1145/355841.355847Google ScholarDigital Library
John W. Lawson, Mehdi Goli, Duncan McBain, Daniel Soutar, and Louis Sugy. 2019. Cross-Platform Performance Portability Using Highly Parametrized SYCL Kernels. CoRR abs/1904.05347(2019). arxiv:1904.05347http://arxiv.org/abs/1904.05347Google Scholar
Codeplay Software Ltd.2021. ComputeCpp CE 2.3.0. https://developer.codeplay.com/products/computecpp/ce/homeGoogle Scholar
Cedric Nugteren. 2018. CLBlast: A Tuned OpenCL BLAS Library. In Proceedings of the International Workshop on OpenCL (Oxford, United Kingdom) (IWOCL ’18). Association for Computing Machinery, New York, NY, USA, Article 5, 10 pages. https://doi.org/10.1145/3204919.3204924Google ScholarDigital Library
M.A. Oliver and R. Webster. 2014. A tutorial guide to geostatistics: Computing and modelling variograms and kriging. CATENA 113(2014), 56 – 69. https://doi.org/10.1016/j.catena.2013.09.006Google ScholarCross Ref
Ruyman Reyes and Victor Lomüller. 2015. SYCL: Single-source C++ accelerator programming.. In PARCO. 673–682.Google Scholar
Thales Luis Sabino, Gisele Tavares, Leonardo Goliatt, Marcelo Lobosco, Filipe Chaves, and Rodrigo Santos. 2017. A parallel implementation of the ordinary kriging algorithm for heterogeneous computing environments. (08 2017).Google Scholar
John E Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in science & engineering 12, 3 (2010), 66.Google ScholarDigital Library
Cuda Toolkit. [n.d.]. CUBLAS Library.Google Scholar
Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, and Yajuan Wang. 2014. Intel Math Kernel Library. Springer International Publishing, Cham, 167–188. https://doi.org/10.1007/978-3-319-06486-4_7Google ScholarCross Ref
Field Zee, Ernie Chan, Robert van de Geijn, Enrique S Quintana-Orti, and Gregorio Quintana-Ortí. 2009. Introducing: the LIBFLAME library for dense matrix computations. Computing in Science and Engineering 11 (11 2009), 56–63. https://doi.org/10.1109/MCSE.2009.207Google ScholarCross Ref

Index Terms

Toward Performance Portability of Highly Parametrizable TRSM Algorithm Using SYCL

Index terms have been assigned to the content through auto-classification.

Recommendations

Exploring the performance and portability of the k-means algorithm on SYCL across CPU and GPU architectures
Abstract
The aim of SYCL is to reduce the gap between the performance and code portability of the main accelerators used in HPC, such as multi-vendor CPUs, GPUs, and FPGAs. To evaluate SYCL’s performance portability, this paper uses the k-means algorithm ...
Read More
Evaluating the performance portability of SYCL across CPUs and GPUs on bandwidth-bound applications
SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

In this paper, we evaluate the portability of the SYCL programming model on some of the latest CPUs and GPUs from a wide range of vendors, utilizing the two main compilers: DPC++ and hipSYCL/OpenSYCL. Both compilers currently support GPUs from all three ...
Read More
Evaluating the Performance of Integer Sum Reduction in SYCL on GPUs
ICPP Workshops '21: 50th International Conference on Parallel Processing Workshop

SYCL is a promising programming model for heterogeneous computing—allowing a single-source code to target devices from multiple vendors. One significant task performed on these accelerators is a primitive operation for integer sum reduction. This paper ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

IWOCL '21: Proceedings of the 9th International Workshop on OpenCL
April 2021
112 pages
ISBN:9781450390330
DOI:10.1145/3456669

Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 April 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
BLAS
GPGPU
Linear Algebra
SYCL
TRSM
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate84of152submissions,55%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 87
  Total Downloads
- Downloads (Last 12 months)16
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Toward Performance Portability of Highly Parametrizable TRSM Algorithm Using SYCL

IWOCL '21: Proceedings of the 9th International Workshop on OpenCL

ABSTRACT

References

Cited By

Index Terms

Recommendations

Exploring the performance and portability of the k-means algorithm on SYCL across CPU and GPU architectures

Evaluating the performance portability of SYCL across CPUs and GPUs on bandwidth-bound applications

Evaluating the Performance of Integer Sum Reduction in SYCL on GPUs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Toward Performance Portability of Highly Parametrizable TRSM Algorithm Using SYCL

IWOCL '21: Proceedings of the 9th International Workshop on OpenCL

ABSTRACT

References

Cited By

Index Terms

Recommendations

Exploring the performance and portability of the k-means algorithm on SYCL across CPU and GPU architectures

Evaluating the performance portability of SYCL across CPUs and GPUs on bandwidth-bound applications

Evaluating the Performance of Integer Sum Reduction in SYCL on GPUs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media