ABSTRACT
In this paper, we evaluate the portability of the SYCL programming model on some of the latest CPUs and GPUs from a wide range of vendors, utilizing the two main compilers: DPC++ and hipSYCL/OpenSYCL. Both compilers currently support GPUs from all three major vendors; we evaluate performance on the Intel(R) Data Center GPU Max 1100, the NVIDIA A100 GPU, and the AMD MI250X GPU. Support on CPUs currently is less established, with DPC++ only supporting x86 CPUs through OpenCL, however, OpenSYCL does have an OpenMP backend capable of targeting all modern CPUs; we benchmark the Intel Xeon Platinum 8360Y Processor (Ice Lake), the AMD EPYC 9V33X (Genoa-X), and the Ampere Altra platforms. We study a range of primarily bandwidth-bound applications implemented using the OPS and OP2 DSLs, evaluate different formulations in SYCL, and contrast their performance to “native” programming approaches where available (CUDA/HIP/OpenMP). On GPU architectures SCYL on average even slightly outperforms native approaches, while on CPUs it falls behind - highlighting a continued need for improving CPU performance. While SYCL does not solve all the challenges of performance portability (e.g. needing different algorithms on different hardware), it does provide a single programming model and ecosystem to target most current HPC architectures productively.
- Aksel Alpay and Vincent Heuveline. 2020. SYCL beyond OpenCL: The architecture, current state and future direction of hipSYCL. In Proceedings of the International Workshop on OpenCL. 1–1.Google ScholarDigital Library
- Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, 2006. The landscape of parallel computing research: A view from berkeley. (2006).Google Scholar
- Cédric Chevalier and François Pellegrini. 2008. PT-Scotch: A tool for efficient parallel graph ordering. Parallel computing 34, 6-8 (2008), 318–331.Google Scholar
- Steffen Christgau and Thomas Steinke. 2020. Porting a legacy cuda stencil code to oneapi. In 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 359–367.Google ScholarCross Ref
- Tom Deakin, James Price, Matt Martineau, and Simon McIntosh-Smith. 2018. Evaluating attainable memory bandwidth of parallel programming models via BabelStream. International Journal of Computational Science and Engineering 17, 3 (2018), 247–262.Google ScholarCross Ref
- H Carter Edwards, Christian R Trott, and Daniel Sunderland. 2014. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. Journal of parallel and distributed computing 74, 12 (2014), 3202–3216.Google ScholarDigital Library
- Mehdi Goli, Kumudha Narasimhan, Ruyman Reyes, Ben Tracy, Daniel Soutar, Svetlozar Georgiev, Evarist M Fomenko, and Eugene Chereshnev. 2020. Towards cross-platform performance portability of dnn models using sycl. In 2020 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). IEEE, 25–35.Google ScholarCross Ref
- Intel. [n. d.]. Intel/LLVM: Intel staging area for llvm.org contribution. home for Intel LLVM-based projects.https://github.com/intel/llvmGoogle Scholar
- C. T. Jacobs, S. P. Jammy, and N. D. Sandham. 2017. OpenSBLI: A framework for the automated derivation and parallel execution of finite difference solvers on a range of computer architectures. Journal of Computational Science 18 (2017), 12–23. https://doi.org/10.1016/j.jocs.2016.11.001Google ScholarCross Ref
- Zheming Jin and Jeffrey S Vetter. 2022. Understanding performance portability of bioinformatics applications in sycl on an nvidia gpu. In 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2190–2195.Google ScholarCross Ref
- John Kessenich, Boaz Ouriel, and Raun Krisch. 2018. Spir-v specification. Khronos Group 3 (2018), 17.Google Scholar
- Richard O Kirk, Gihan R Mudalige, Istvan Z Reguly, Steven A Wright, Matt J Martineau, and Stephen A Jarvis. 2017. Achieving performance portability for a heat conduction solver mini-application on modern multi-core systems. In 2017 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 834–841.Google ScholarCross Ref
- Andrew Mallinson, David A Beckingsale, Wayne Gaudin, J Herdman, John Levesque, and Stephen A Jarvis. 2013. Cloverleaf: Preparing hydrodynamics codes for exascale. The Cray User Group 2013 (2013).Google Scholar
- Aaftab Munshi. 2009. The opencl specification. In 2009 IEEE Hot Chips 21 Symposium (HCS). IEEE, 1–314.Google ScholarCross Ref
- AMB Owenson, Steven A Wright, Richard A Bunt, YK Ho, Matthew J Street, and Stephen A Jarvis. 2020. An unstructured CFD mini-application for the performance prediction of a production CFD code. Concurrency and Computation: Practice and Experience 32, 10 (2020), e5443.Google ScholarCross Ref
- S. John Pennycook and Jason D. Sewall. 2021. Revisiting a Metric for Performance Portability. In 2021 International Workshop on Performance, Portability and Productivity in HPC (P3HPC). 1–9. https://doi.org/10.1109/P3HPC54578.2021.00004Google ScholarCross Ref
- I Reguly. 2012. Op2: An active library framework for solving unstructured mesh-based applications on multi-core and many-core architectures. In 2012 Innovative Parallel Computing (InPar). IEEE, 1–12.Google Scholar
- IZ Reguly, AC Mallinson, WP Gaudin, and JA Herdman. 2015. Performance analysis of a high-level abstractions-based hydrocode on future computing systems. In High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation: 5th International Workshop, PMBS 2014, New Orleans, LA, USA, November 16, 2014. Revised Selected Papers 5. Springer, 85–104.Google Scholar
- István Z Reguly, Gihan R Mudalige, Michael B Giles, Dan Curran, and Simon McIntosh-Smith. 2014. The ops domain specific abstraction for multi-block structured grid computations. In 2014 Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing. IEEE, 58–67.Google ScholarCross Ref
- Yuhsiang M Tsai, Terry Cojean, and Hartwig Anzt. 2021. Porting sparse linear algebra to Intel GPUs. In European Conference on Parallel Processing. Springer, 57–68.Google Scholar
Index Terms
- Evaluating the performance portability of SYCL across CPUs and GPUs on bandwidth-bound applications
Recommendations
Exploring the performance and portability of the k-means algorithm on SYCL across CPU and GPU architectures
AbstractThe aim of SYCL is to reduce the gap between the performance and code portability of the main accelerators used in HPC, such as multi-vendor CPUs, GPUs, and FPGAs. To evaluate SYCL’s performance portability, this paper uses the k-means algorithm ...
Improving Performance Portability of the Procedurally Generated High Energy Physics Event Generator MadGraph Using SYCL
IWOCL '24: Proceedings of the 12th International Workshop on OpenCL and SYCLEvent Generators are essential tools for simulating Standard Model particle interactions, representing the initial step in modeling proton-proton collisions in the Large Hadron Collider (LHC) at CERN. Traditionally relying on a few algorithms like ...
On the Portability of the OpenCL Dwarfs on Fixed and Reconfigurable Parallel Platforms
ICPADS '13: Proceedings of the 2013 International Conference on Parallel and Distributed SystemsThe proliferation of heterogeneous computing systems presents the parallel computing community with the challenge of porting legacy and emerging applications to multiple processors with diverse programming abstractions. OpenCL is a vendor-agnostic and ...
Comments