nach oben

The Journal of Supercomputing

Erschienen in:

05.09.2019

Implementation and performance evaluation of a communication-avoiding GMRES method for stencil-based code on GPU cluster

verfasst von: Kazuya Matsumoto, Yasuhiro Idomura, Takuya Ina, Akie Mayumi, Susumu Yamada

Erschienen in: The Journal of Supercomputing | Ausgabe 12/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

In this study, a communication-avoiding generalized minimum residual method (CA-GMRES) is implemented on a hybrid CPU–GPU cluster, targeted for the performance acceleration of iterative linear system solver in the gyrokinetic toroidal five-dimensional Eulerian code (GT5D). In the GT5D, its sparse matrix-vector multiplication operation (SpMV) is performed as a 17-point stencil-based computation. The specialized part for the GT5D is only in the SpMV, and the other parts are usable also for other application program codes. In addition to the CA-GMRES, we implement and evaluate a modified variant of CA-GMRES (M-CA-GMRES) proposed in the previous study Idomura et al. (in: Proceedings of the 8th workshop on latest advances in scalable algorithms for large-scale systems (ScalA ’17), 2017. https://doi.org/10.1145/3148226.3148234) to reduce the amount of floating-point calculations. This study demonstrates that beneficial features of the CA-GMRES are in its minimum number of collective communications and its highly efficient calculations based on dense matrix–matrix operations. The performance evaluation is conducted on the Reedbush-L GPU cluster, which contains four NVIDIA Tesla P100 (Pascal GP100) GPUs per compute node. The evaluation results show that the M-CA-GMRES or CA-GMRES for the GT5D is advantageous over the GMRES or the generalized conjugate residual method (GCR) on GPU clusters, especially when the problem size (vector length) is large so that the cost of the SpMV is less dominant. The M-CA-GMRES is 1.09 ×, 1.22 × and 1.50 × faster than the CA-GMRES, GCR and GMRES, respectively, when 64 GPUs are used.

Vorheriger Artikel Weighted label propagation based on Local Edge Betweenness

Nächster Artikel Lower-bound time-complexity greening mechanism for duplication-based scheduling on large-scale computing platforms

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

https://www.pgroup.com/.

https://www.open-mpi.org/.

Actually, this algorithm is a truncated version of the GCR known as the \(\hbox {ORTHOMIN}(k=1)\) method [9], not the standard GCR [6, 22].

https://developer.nvidia.com/cublas.

The average time is calculated from the measured time divided by the step size s.

The amount of the internode communications per node is \(8n_{y}n_{z}n_{v}\cdot 2\cdot 8\) bytes and \((8n_{y}\,+ 1n_x)n_{z}n_{v}\cdot 2\cdot 8\) bytes in case with \(p=16\) and \(p=64\), respectively.

f\(_{\hbox {i3,j,k+2,l}}\), f\(_{\hbox {i4,j,k+2,l}}\), f\(_{\hbox {i1,j,k,l}}\), f\(_{\hbox {i2,j,k,l}}\), f\(_{\hbox {i5,j,k,l}}\), and f\(_{\hbox {i6,j,k,l}}\) in the loop of Algorithm 6.

The Cholesky factorization is redundantly computed on all MPI processes while each computation is identical. We have additionally evaluated a CholQR implementation that firstly gathers all the local products to a single process, conducts the Cholesky factorization on the process, and broadcasts the Cholesky factor \(\varvec{R}\) to all processes; the implementation with gather and broadcast is a little slower than that with allreduce, due to the latency increase associated with the twice calls of collective communications.

The batch count used is 1024, i.e., each sub-calculation performs a multiplication on the transposed \((n/1024)\text{-by-}c\) sub-matrix and \(c\text{-by-}c\) matrix. We have tested other batch counts (512 and 2048); the performance difference among them is small.

The implicit solver of the GT5D is well-conditioned in terms of the convergence property. For ill-conditioned solvers, the use of more stable basis conversion (such as with the Newton basis [12]), or more stable TSQR algorithm (such as the SVQR [25] and CAQR [8]) is probably required.

If the larger number of GPUs (i.e., \(p>64\)) is utilized, the allreduce cost is more dominant in the total solution time and the speedup ratio in the M-CA-GMRES is possibly higher than that in the GMRES or GCR.

Abdelfattah A, Haidar A, Tomov S, Dongarra J (2016) Performance, design, and autotuning of batched GEMM for GPUs. In: Proceedings of the ISC High Performance Computing 2016, LNCS, vol 9697, pp 21–38. Springer

Asahi Y, Latu G, Ina T, Idomura Y, Grandgirard V, Garbet X (2017) Optimization of fusion kernels on accelerators with indirect or strided memory access patterns. IEEE Trans Parallel Distrib Syst 28(7):1974–1988. https://doi.org/10.1109/TPDS.2016.2633349 CrossRef

Bai Z, Hu D, Reichel L (1994) A Newton basis GMRES implementation. IMA J Numer Anal 14(4):563–581. https://doi.org/10.1093/imanum/14.4.563 MathSciNetCrossRefMATH

Carson E (2015) Communication-avoiding Krylov subspace methods in theory and practice. PhD dissertation, University of California, Berkeley

Chronopoulos AT, Gear CW (1989) s-Step iterative methods for symmetric linear systems. J Comput Appl Math 25(2):153–168. https://doi.org/10.1016/0377-0427(89)90045-9 MathSciNetCrossRefMATH

Concus P, Golub GH (1976) A generalized conjugate gradient method for nonsymmetric systems of linear equations. In: Computing Methods in Applied Sciences and Engineering, Lecture Notes in Economics and Mathematical Systems, vol 134. Springer, pp 56–65. https://doi.org/10.1007/978-3-642-85972-4_4

Cumming B (November 2018) STREAM benchmark in CUDA C++. https://github.com/bcumming/cuda-stream. Accessed 5

Demmel J, Grigori L, Hoemmen M, Langou J (2012) Communication-optimal parallel and sequential QR and LU factorizations. SIAM J Sci Comput 34(1):A206–A239. https://doi.org/10.1137/080731992 MathSciNetCrossRefMATH

Eisenstat SC, Elman HC, Schultz MH (1983) Variational iterative methods for nonsymmetric systems of linear equations. SIAM J Numer Anal 20(2):345–357. https://doi.org/10.1137/0720023 MathSciNetCrossRefMATH

10.

Fujita N, Nuga H, Boku T, Idomura Y (2013) Nuclear fusion simulation code optimization on GPU clusters. In: Proceedings of the 19th IEEE International Conference on Parallel and Distributed Systems (ICPADS 2013). IEEE, pp 1266–1274. https://doi.org/10.1109/ICPADS.2013.65

11.

Golub GH, Van Loan CF (2013) Matrix computations, 4th edn. The John Hopkins University Press, BaltimoreMATH

12.

Hoemmen M (2010) Communication-avoiding Krylov subspace methods. PhD dissertation, University of California, Berkeley

13.

Idomura Y, Ida M, Kano T, Aiba N, Tokuda S (2008) Conservative global gyrokinetic toroidal full-f five-dimensional Vlasov simulation. Comput Phys Commun 179(6):391–403. https://doi.org/10.1016/j.cpc.2008.04.005 MathSciNetCrossRefMATH

14.

Idomura Y, Ina T, Mayumi A, Yamada S, Matsumoto K, Asahi Y, Imamura T (2017) Application of a communication-avoiding generalized minimal residual method to a gyrokinetic five dimensional Eulerian code on many core platforms. In: Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA ’17), p 7. https://doi.org/10.1145/3148226.3148234

15.

Idomura Y, Nakata M, Yamada S, Machida M, Imamura T, Watanabe T, Nunami M, Inoue H, Tsutsumi S, Miyoshi I, Shida N (2014) Communication-overlap techniques for improved strong scaling of gyrokinetic Eulerian code beyond 100k cores on the K-computer. Int J High Perform Comput Appl 28(1):73–86. https://doi.org/10.1177/1094342013490973 CrossRef

16.

Joubert WD, Carey GF (1992) Parallelizable restarted iterative methods for nonsymmetric linear systems. Part I: theory. Int J Comput Math 44(1–4):269–290. https://doi.org/10.1080/00207169208804107 CrossRefMATH

17.

McCalpin JD (November 2018) STREAM: Sustainable memory bandwidth in high performance computers. http://www.cs.virginia.edu/stream/. Accessed 5

18.

Mohiyuddin M, Hoemmen M, Demmel J, Yelick K (2009) Minimizing communication in sparse matrix solvers. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC ’09). ACM. https://doi.org/10.1145/1654059.1654096

19.

Nath R, Tomov S, Dongarra J (2010) An improved MAGMA GEMM for Fermi graphics processing units. Int J High Perform Comput Appl 24(4):511–515. https://doi.org/10.1177/1094342010385729 CrossRef

20.

NVIDIA Corporation: NVIDIA GPUDirect. https://developer.nvidia.com/gpudirect. Accessed 5 Nov 2018

21.

Rosendale JV (1983) Minimizing inner product data dependencies in conjugate gradient iteration. Technical Report NASA-CR-17, NASA

22.

Saad Y (2003) Iterative methods for sparse linear systems, 2nd edn. SIAM, PhiladelphiaCrossRef

23.

Saad Y, Schultz MH (1986) GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J Sci Stat Comput 7(3):856–869. https://doi.org/10.1137/0907058 MathSciNetCrossRefMATH

24.

Shimokawabe T, Aoki T, Muroi C, Ishida J, Kawano K, Endo T, Nukada A, Maruyama N, Matsuoka S (2010) An 80-fold speedup, 15.0 TFlops GPU acceleration of non-hydrostatic weather model ASUCA production code. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2010). IEEE. https://doi.org/10.1109/SC.2010.9

25.

Stathopoulos A, Wu K (2002) A block orthogonalization procedure with constant synchronization requirements. SIAM J Sci Comput 23(6):2165–2184. https://doi.org/10.1137/S1064827500370883 MathSciNetCrossRefMATH

26.

de Sturler E, van der Vorst HA (1995) Reducing the effect of global communication in GMRES(m) and CG on parallel distributed memory computers. Appl Numer Math 18(4):441–459. https://doi.org/10.1016/0168-9274(95)00079-A CrossRefMATH

27.

Walker HF (1988) Implementation of the GMRES method using householder transformations. SIAM J Sci Stat Comput 9(1):152–163. https://doi.org/10.1137/0909010 MathSciNetCrossRefMATH

28.

Williams SW (2011) The roofline model. In: Bailey DH, Lucas RF, Williams SW (eds) Performance tuning of scientific applications, chapter 9. CRC Press, Boca Raton, pp 195–215

29.

Yamazaki I, Anzt H, Tomov S, Hoemmen M, Dongarra J (2014) Improving the performance of CA-GMRES on multicores with multiple GPUs. IEEE, pp 382–391. https://doi.org/10.1109/IPDPS.2014.48

30.

Yamazaki I, Hoemmen M, Luszczek P, Dongarra J (2017) Improving performance of GMRES by reducing communication and pipelining global collectives. In: Proceedings of the 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops (IPDPSW 2017). IEEE, pp 1118–1127. https://doi.org/10.1109/IPDPSW.2017.65

31.

Yamazaki I, Tomov S, Dongarra J (2015) Mixed-precision Cholesky QR factorization and its case studies on multicore CPU with multiple GPUs. SIAM J Sci Comput 37(3):C307–C330. https://doi.org/10.1137/14M0973773 MathSciNetCrossRefMATH

32.

Yamazaki I, Tomov S, Dongarra JJ (2016) Stability and performance of various singular value QR implementations on multicore CPU with a GPU. ACM Trans Math Softw 43(2):1–18. https://doi.org/10.1145/2898347 MathSciNetCrossRefMATH

Titel: Implementation and performance evaluation of a communication-avoiding GMRES method for stencil-based code on GPU cluster
verfasst von: Kazuya Matsumoto
Yasuhiro Idomura
Takuya Ina
Akie Mayumi
Susumu Yamada
Publikationsdatum: 05.09.2019
Verlag: Springer US
Erschienen in: The Journal of Supercomputing / Ausgabe 12/2019
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-019-02983-7

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 12/2019

Weighted label propagation based on Local Edge Betweenness

Auto-tuning GEMM kernels on the Intel KNL and Intel Skylake-SP processors

Comparing GPU-parallelized metaheuristics to branch-and-bound for batch plants optimization

A formally based parallelization of data mining algorithms for multi-core systems

Optimized and low-complexity power allocation and beamforming with full duplex in massive MIMO and small-cell networks

Unleashing the performance of ccNUMA multiprocessor architectures in heterogeneous stencil computations

Premium Partner