ABSTRACT
Presented in 1979, BLAS is, to this day, the de-facto standard for low-level linear algebra routines. BLAS provides essential linear algebra routines used in various domains such as numerical and scientific computing, weather simulation, computational fluid dynamics, machine learning and adopted for a broad range of hardware from HPC to embedded systems and AI specialized accelerators.
While originally BLAS routines have been implemented for CPU, with the emergence of GPGPU, BLAS routines had to be re-written to exploit the provided extensive computational power. Machine learning is rapidly changing this landscape again by incentivizing the development of specialized hardware that can perform certain operations more efficiently. With a wide range of hardware available, each with a new kind of memory hierarchy, different cache line sizes, and various memory access patterns required for performance, with different number of registers and different type of memory connections, performance portability of BLAS routine across various platforms while avoiding rewrites of existing code is a major challenge of the heterogeneous programming world.
Written in SYCL, SYCL-BLAS is an open-source BLAS library that provides performance portability across various SYCL-enabled platforms. This paper presents the implementation of a parametric tile-based TRSM routine for SYCL-BLAS by employing a formulation that leverages a highly optimized GEMM routine already provided in SYCL-BLAS. Our results shows that we can achieve up to 2.6x speedup on Intel GPU, 7x on AMD GPU and up to 3.4x speedup on ARM GPU compared with the highly optimized clBLAST and clBLAS libraries by tuning the tile size per-device without reimplementing the kernel.
- [n.d.]. The ARM Computer Vision and Machine Learning library. https://github.com/ARM-software/ComputeLibrary/Google Scholar
- [n.d.]. BLAS (Basic Linear Algebra Subprograms). https://www.netlib.org/blas/Google Scholar
- [n.d.]. clBLAS. https://rocmdocs.amd.com/en/latest/ROCm_Tools/clBLA.htmlGoogle Scholar
- [n.d.]. The HiKey 960 development platform. https://www.96boards.org/product/hikey960Google Scholar
- [n.d.]. OpenBLAS - An optimized BLAS library. http://www.openblas.net/Google Scholar
- [n.d.]. SYCL-BLAS: An implementation of BLAS using the SYCL open standard. https://github.com/CodeplaySoftware/SYCL-BLAS. Accessed: 2019-04-09.Google Scholar
- [n.d.]. SYCL Specification. https://www.khronos.org/registry/SYCL/Google Scholar
- 2020. Intel® oneAPI Math Kernel Library. https://software.intel.com/content/www/us/en/develop/documentation/oneapi-mkl-dpcpp-developer-reference/top.htmlGoogle Scholar
- 2020. The oneAPI Specification. https://www.oneapi.com/Google Scholar
- M. Abadi, P. Barham, J. Chen, Z. Chen, Andy Davis, J. Dean, M. Devin, Sanjay Ghemawat, Geoffrey Irving, M. Isard, M. Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, D. Murray, B. Steiner, P. Tucker, V. Vasudevan, Pete Warden, Martin Wicke, Y. Yu, and Xiaoqiang Zhang. 2016. TensorFlow: A system for large-scale machine learning. In OSDI.Google Scholar
- José I Aliaga, Ruymán Reyes, and Mehdi Goli. 2017. SYCL-BLAS: leveraging expression trees for linear algebra. In Proceedings of the 5th International Workshop on OpenCL. 1–5.Google ScholarDigital Library
- Ed Anderson, Zhaojun Bai, Jack Dongarra, A. Greenbaum, A. McKenney, Jeremy Croz, Sven Hammarling, James Demmel, Christian Bischof, and Danny Sorensen. 1990. LAPACK: A Portable Linear Algebra Library for High-Performance Computers. [No source information available], 2–11.Google ScholarDigital Library
- Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. 2006. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report. TECHNICAL REPORT, UC BERKELEY.Google Scholar
- John Cheng, Max Grossman, and Ty McKercher. 2014. Professional CUDA c programming. John Wiley & Sons.Google Scholar
- Ronan Collobert, Samy Bengio, and Johnny Marithoz. 2002. Torch: A Modular Machine Learning Software Library. (11 2002).Google Scholar
- Jack Dongarra, Mark Gates, Azzam Haidar, Jakub Kurzak, Piotr Luszczek, Stanimire Tomov, and Ichitaro Yamazaki. 2014. Accelerating Numerical Dense Linear Algebra Calculations with GPUs. Numerical Computations with GPUs(2014), 1–26.Google Scholar
- Mehdi Goli, Luke Iwanski, John Lawson, Uwe Dolinsky, and Andrew Richards. 2018. TensorFlow Acceleration on ARM Hikey Board. In Proceedings of the International Workshop on OpenCL (Oxford, United Kingdom) (IWOCL ’18). Association for Computing Machinery, New York, NY, USA, Article 7, 4 pages. https://doi.org/10.1145/3204919.3204926Google ScholarDigital Library
- Gael Guennebaud, Benoit Jacob, 2014. Eigen: a c++ linear algebra library. URL http://eigen. tuxfamily. org, Accessed 22 (2014).Google Scholar
- Azzam Haidar, Chongxiao Cao, Ichitaro Yamazaki, Jack Dongarra, Mark Gates, Piotr Luszczek, and Stanimire Tomov. 2014. Performance and Portability with OpenCL for Throughput-Oriented HPC Workloads Across Accelerators, Coprocessors, and Multicore Processors. In 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA 14). IEEE, New Orleans, LA. https://doi.org/10.1109/ScalA.2014.8Google ScholarDigital Library
- Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. arxiv:1408.5093 [cs.CV]Google Scholar
- Bo Kågström, Per Ling, and Charles van Loan. 1998. GEMM-Based Level 3 BLAS: High-Performance Model Implementations and Performance Evaluation Benchmark. ACM Trans. Math. Softw. 24, 3 (Sept. 1998), 268–302. https://doi.org/10.1145/292395.292412Google ScholarDigital Library
- C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. 1979. Basic Linear Algebra Subprograms for Fortran Usage. ACM Trans. Math. Softw. 5, 3 (Sept. 1979), 308–323. https://doi.org/10.1145/355841.355847Google ScholarDigital Library
- John W. Lawson, Mehdi Goli, Duncan McBain, Daniel Soutar, and Louis Sugy. 2019. Cross-Platform Performance Portability Using Highly Parametrized SYCL Kernels. CoRR abs/1904.05347(2019). arxiv:1904.05347http://arxiv.org/abs/1904.05347Google Scholar
- Codeplay Software Ltd.2021. ComputeCpp CE 2.3.0. https://developer.codeplay.com/products/computecpp/ce/homeGoogle Scholar
- Cedric Nugteren. 2018. CLBlast: A Tuned OpenCL BLAS Library. In Proceedings of the International Workshop on OpenCL (Oxford, United Kingdom) (IWOCL ’18). Association for Computing Machinery, New York, NY, USA, Article 5, 10 pages. https://doi.org/10.1145/3204919.3204924Google ScholarDigital Library
- M.A. Oliver and R. Webster. 2014. A tutorial guide to geostatistics: Computing and modelling variograms and kriging. CATENA 113(2014), 56 – 69. https://doi.org/10.1016/j.catena.2013.09.006Google ScholarCross Ref
- Ruyman Reyes and Victor Lomüller. 2015. SYCL: Single-source C++ accelerator programming.. In PARCO. 673–682.Google Scholar
- Thales Luis Sabino, Gisele Tavares, Leonardo Goliatt, Marcelo Lobosco, Filipe Chaves, and Rodrigo Santos. 2017. A parallel implementation of the ordinary kriging algorithm for heterogeneous computing environments. (08 2017).Google Scholar
- John E Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in science & engineering 12, 3 (2010), 66.Google ScholarDigital Library
- Cuda Toolkit. [n.d.]. CUBLAS Library.Google Scholar
- Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, and Yajuan Wang. 2014. Intel Math Kernel Library. Springer International Publishing, Cham, 167–188. https://doi.org/10.1007/978-3-319-06486-4_7Google ScholarCross Ref
- Field Zee, Ernie Chan, Robert van de Geijn, Enrique S Quintana-Orti, and Gregorio Quintana-Ortí. 2009. Introducing: the LIBFLAME library for dense matrix computations. Computing in Science and Engineering 11 (11 2009), 56–63. https://doi.org/10.1109/MCSE.2009.207Google ScholarCross Ref
Index Terms
- Toward Performance Portability of Highly Parametrizable TRSM Algorithm Using SYCL
Recommendations
Exploring the performance and portability of the k-means algorithm on SYCL across CPU and GPU architectures
AbstractThe aim of SYCL is to reduce the gap between the performance and code portability of the main accelerators used in HPC, such as multi-vendor CPUs, GPUs, and FPGAs. To evaluate SYCL’s performance portability, this paper uses the k-means algorithm ...
Evaluating the performance portability of SYCL across CPUs and GPUs on bandwidth-bound applications
SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and AnalysisIn this paper, we evaluate the portability of the SYCL programming model on some of the latest CPUs and GPUs from a wide range of vendors, utilizing the two main compilers: DPC++ and hipSYCL/OpenSYCL. Both compilers currently support GPUs from all three ...
Evaluating the Performance of Integer Sum Reduction in SYCL on GPUs
ICPP Workshops '21: 50th International Conference on Parallel Processing WorkshopSYCL is a promising programming model for heterogeneous computing—allowing a single-source code to target devices from multiple vendors. One significant task performed on these accelerators is a primitive operation for integer sum reduction. This paper ...
Comments