Top

Published in:

2020 | OriginalPaper | Chapter

Evaluation of Directive-Based GPU Programming Models on a Block Eigensolver with Consideration of Large Sparse Matrices

Authors : Fazlay Rabbi, Christopher S. Daley, Hasan Metin Aktulga, Nicholas J. Wright

Published in: Accelerator Programming Using Directives

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Achieving high performance and performance portability for large-scale scientific applications is a major challenge on heterogeneous computing systems such as many-core CPUs and accelerators like GPUs. In this work, we implement a widely used block eigensolver, Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG), using two popular directive based programming models (OpenMP and OpenACC) for GPU-accelerated systems. Our work differs from existing work in that it adopts a holistic approach that optimizes the full solver performance rather than narrowing the problem into small kernels (e.g., SpMM, SpMV). Our LOPBCG GPU implementation achieves a 2.8\({\times }\)–4.3\({\times }\) speedup over an optimized CPU implementation when tested with four different input matrices. The evaluated configuration compared one Skylake CPU to one Skylake CPU and one NVIDIA V100 GPU. Our OpenMP and OpenACC LOBPCG GPU implementations gave nearly identical performance. We also consider how to create an efficient LOBPCG solver that can solve problems larger than GPU memory capacity. To this end, we create microbenchmarks representing the two dominant kernels (inner product and SpMM kernel) in LOBPCG and then evaluate performance when using two different programming approaches: tiling the kernels, and using Unified Memory with the original kernels. Our tiled SpMM implementation achieves a 2.9\({\times }\) and 48.2\({\times }\) speedup over the Unified Memory implementation on supercomputers with PCIe Gen3 and NVLink 2.0 CPU to GPU interconnects, respectively.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Accelerating the Performance of Modal Aerosol Module of E3SM Using OpenACC

next chapter Performance of the RI-MP2 Fortran Kernel of GAMESS on GPUs via Directive-Based Offloading with Math Libraries

Alternatively, we could have copied the data to the device using OpenMP/OpenACC and then passed the device pointer to the CUDA library functions using OpenMP’s use_device_ptr clause or OpenACC’s use_device clause. We did not use this approach because we wanted the option to use cudaMallocManaged to allocate data in managed memory.

Cori-GPU system configuration. https://docs-dev.nersc.gov/cgpu/

Openmp specification. https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf

Summit system configuration. https://www.olcf.ornl.gov/summit/

HIP : Convert CUDA to Portable C++ Code (2019). https://github.com/ROCm-Developer-Tools/HIP. Accessed 4 Sept 2019

Aktulga, H.M., Buluç, A., Williams, S., Yang, C.: Optimizing sparse matrix-multiple vectors multiplication for nuclear configuration interaction calculations. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 1213–1222. IEEE (2014)

Anzt, H., Tomov, S., Dongarra, J.: Implementing a sparse matrix vector product for the SELL-C/SELL-C-\(\sigma \) formats on nvidia gpus. University of Tennessee, Technical report. ut-eecs-14-727 (2014)

Anzt, H., Tomov, S., Dongarra, J.: Accelerating the LOBPCG method on GPUs using a blocked sparse matrix vector product. In: Proceedings of the Symposium on High Performance Computing, pp. 75–82. Society for Computer Simulation International (2015)

Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. p. 18. ACM (2009)

Choi, J.W., Singh, A., Vuduc, R.W.: Model-driven autotuning of sparse matrix-vector multiply on GPUs. ACM SIGPLAN Not. 45, 115–126 (2010)CrossRef

10.

Cui, X., Scogland, T.R.W., de Supinski, B.R., Feng, W.: Directive-based partitioning and pipelining for graphics processing units. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 575–584, May 2017. https://doi.org/10.1109/IPDPS.2017.96

11.

Davis, T., Hu, Y., Kolodziej, S.: The suitesparse matrix collection (2018). http://faculty.cse.tamu.edu/davis/suitesparse.html

12.

Deldon, S., Beyer, J., Miles, D.: OpenACC and CUDA unified memory. Cray User Group (CUG), May 2018

13.

Dziekonski, A., Rewienski, M., Sypek, P., Lamecki, A., Mrozowski, M.: GPU-accelerated LOBPCG method with inexact null-space filtering for solving generalized eigenvalue problems in computational electromagnetics analysis with higher-order fem. Commun. Comput. Phys. 22(4), 997–1014 (2017)MathSciNetCrossRef

14.

Rabbi, F., Daley, C.S., Aktulga, H.M., Wright, N.J.: Evaluation of directive-based GPU programming models on a block eigensolver with consideration of large sparse matrices (waccpd 2019 paper’s artifact). https://doi.org/10.6084/m9.figshare.11636067, https://github.com/fazlay-rabbi/WACCPD_2019_Artifact

15.

Garland, M.: Sparse matrix computations on manycore GPU’s. In: Proceedings of the 45th annual Design Automation Conference, pp. 2–6. ACM (2008)

16.

Hong, C., et al.: Efficient sparse-matrix multi-vector product on GPUs. In: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, pp. 66–79. ACM (2018)

17.

Khorasani, F., Gupta, R., Bhuyan, L.N.: Scalable SIMD-efficient graph processing on GPUs. In: 2015 International Conference on Parallel Architecture and Compilation (PACT), pp. 39–50. IEEE (2015)

18.

Knap, M., Czarnul, P.: Performance evaluation of unified memory with prefetching and oversubscription for selected parallel CUDA applications on NVIDIA Pascal and Volta GPUs. J. Supercomput. 75, 1–21 (2019)CrossRef

19.

Knyazev, A.V.: Toward the optimal preconditioned eigensolver: locally optimal block preconditioned conjugate gradient method. SIAM J. Sci. Comput. 23(2), 517–541 (2001)MathSciNetMATHCrossRef

20.

Knyazev, A.V., Argentati, M.E.: Implementation of a preconditioned eigensolver using hypre (2005)

21.

Knyazev, A.V., Argentati, M.E., Lashuk, I., Ovtchinnikov, E.E.: Block locally optimal preconditioned eigenvalue xolvers (BLOPEX) in HYPRE and PETSc. SIAM J. Sci. Comput. 29(5), 2224–2239 (2007)MathSciNetMATHCrossRef

22.

Lanczos, C.: An Iteration Method for the Solution of the Eigenvalue Problem of Linear Differential and Integral Operators. United States Government Press Office, Los Angeles (1950)

23.

Larrea, V.G.V., Budiardja, R., Gayatri, R., Daley, C., Hernandez, O., Joubert, W.: Experiences porting mini-applications to OpenACC and OpenMP on heterogeneous systems. In: Cray User Group (CUG), May 2019

24.

Maris, P., et al.: Large-scale ab initio configuration interaction calculations for light nuclei. J. Phys.: Conf. Ser. 403, 012019 (2012)

25.

Maris, P., Sosonkina, M., Vary, J.P., Ng, E., Yang, C.: Scaling of ab-initio nuclear physics calculations on multicore computer architectures. Procedia Comput. Sci. 1(1), 97–106 (2010)CrossRef

26.

Naumov, M., Chien, L., Vandermersch, P., Kapasi, U.: cuSPARSE library. In: GPU Technology Conference (2010)

27.

Ortega, G., Vázquez, F., García, I., Garzón, E.M.: FastSpMM: an efficient library for sparse matrix matrix product on GPUs. Comput. J. 57(7), 968–979 (2014)CrossRef

28.

Sakharnykh, N.: Everything You Need To Know About Unified Memory. Presented at GPU Technology Conference (GTC) (2018). http://on-demand.gputechconf.com/gtc/2018/presentation/s8430-everything-you-need-to-know-about-unified-memory.pdf. Accessed Mar 2018

29.

Shao, M., Aktulga, H.M., Yang, C., Ng, E.G., Maris, P., Vary, J.P.: Accelerating nuclear configuration interaction calculations through a preconditioned block iterative eigensolver. Comput. Phys. Commun. 222, 1–13 (2018)MathSciNetCrossRef

30.

Sternberg, P., et al.: Accelerating configuration interaction calculations for nuclear structure. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, p. 15. IEEE Press (2008)

31.

Vazhkudai, S.S., et al.: The design, deployment, and evaluation of the coral pre-exascale systems. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, p. 52. IEEE Press (2018)

32.

Wang, Y.: Research on matrix multiplication based on the combination of OpenACC and CUDA. In: Xie, Y., Zhang, A., Liu, H., Feng, L. (eds.) GSES 2018. CCIS, vol. 980, pp. 100–108. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-7025-0_10CrossRef

33.

Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for floating-point programs and multicore architectures. Technical report, Lawrence Berkeley National Lab (LBNL), Berkeley, CA, USA (2009)

34.

Yang, C., Buluç, A., Owens, J.D.: Design principles for sparse matrix multiplication on the GPU. In: Aldinucci, M., Padovani, L., Torquati, M. (eds.) Euro-Par 2018. LNCS, vol. 11014, pp. 672–687. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-96983-1_48CrossRef

35.

Yang, X., Parthasarathy, S., Sadayappan, P.: Fast sparse matrix-vector multiplication on GPUs: implications for graph mining. Proc. VLDB Endow. 4(4), 231–242 (2011)CrossRef

Title: Evaluation of Directive-Based GPU Programming Models on a Block Eigensolver with Consideration of Large Sparse Matrices
Authors: Fazlay Rabbi
Christopher S. Daley
Hasan Metin Aktulga
Nicholas J. Wright
Publisher: Springer International Publishing
Book: Accelerator Programming Using Directives
Print ISBN: 978-3-030-49942-6

Electronic ISBN: 978-3-030-49943-3

Copyright Year: 2020
DOI: https://doi.org/10.1007/978-3-030-49943-3_4

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner