Skip to main content
Erschienen in: The Journal of Supercomputing 12/2019

05.09.2019

Implementation and performance evaluation of a communication-avoiding GMRES method for stencil-based code on GPU cluster

verfasst von: Kazuya Matsumoto, Yasuhiro Idomura, Takuya Ina, Akie Mayumi, Susumu Yamada

Erschienen in: The Journal of Supercomputing | Ausgabe 12/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this study, a communication-avoiding generalized minimum residual method (CA-GMRES) is implemented on a hybrid CPU–GPU cluster, targeted for the performance acceleration of iterative linear system solver in the gyrokinetic toroidal five-dimensional Eulerian code (GT5D). In the GT5D, its sparse matrix-vector multiplication operation (SpMV) is performed as a 17-point stencil-based computation. The specialized part for the GT5D is only in the SpMV, and the other parts are usable also for other application program codes. In addition to the CA-GMRES, we implement and evaluate a modified variant of CA-GMRES (M-CA-GMRES) proposed in the previous study Idomura et al. (in: Proceedings of the 8th workshop on latest advances in scalable algorithms for large-scale systems (ScalA ’17), 2017. https://​doi.​org/​10.​1145/​3148226.​3148234) to reduce the amount of floating-point calculations. This study demonstrates that beneficial features of the CA-GMRES are in its minimum number of collective communications and its highly efficient calculations based on dense matrix–matrix operations. The performance evaluation is conducted on the Reedbush-L GPU cluster, which contains four NVIDIA Tesla P100 (Pascal GP100) GPUs per compute node. The evaluation results show that the M-CA-GMRES or CA-GMRES for the GT5D is advantageous over the GMRES or the generalized conjugate residual method (GCR) on GPU clusters, especially when the problem size (vector length) is large so that the cost of the SpMV is less dominant. The M-CA-GMRES is 1.09 ×, 1.22 × and 1.50 × faster than the CA-GMRES, GCR and GMRES, respectively, when 64 GPUs are used.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Fußnoten
3
Actually, this algorithm is a truncated version of the GCR known as the \(\hbox {ORTHOMIN}(k=1)\) method [9], not the standard GCR [6, 22].
 
5
The average time is calculated from the measured time divided by the step size s.
 
6
The amount of the internode communications per node is \(8n_{y}n_{z}n_{v}\cdot 2\cdot 8\) bytes and \((8n_{y}\,+ 1n_x)n_{z}n_{v}\cdot 2\cdot 8\) bytes in case with \(p=16\) and \(p=64\), respectively.
 
7
f\(_{\hbox {i3,j,k+2,l}}\), f\(_{\hbox {i4,j,k+2,l}}\), f\(_{\hbox {i1,j,k,l}}\), f\(_{\hbox {i2,j,k,l}}\), f\(_{\hbox {i5,j,k,l}}\), and f\(_{\hbox {i6,j,k,l}}\) in the loop of Algorithm 6.
 
8
The Cholesky factorization is redundantly computed on all MPI processes while each computation is identical. We have additionally evaluated a CholQR implementation that firstly gathers all the local products to a single process, conducts the Cholesky factorization on the process, and broadcasts the Cholesky factor \(\varvec{R}\) to all processes; the implementation with gather and broadcast is a little slower than that with allreduce, due to the latency increase associated with the twice calls of collective communications.
 
9
The batch count used is 1024, i.e., each sub-calculation performs a multiplication on the transposed \((n/1024)\text{-by-}c\) sub-matrix and \(c\text{-by-}c\) matrix. We have tested other batch counts (512 and 2048); the performance difference among them is small.
 
10
The implicit solver of the GT5D is well-conditioned in terms of the convergence property. For ill-conditioned solvers, the use of more stable basis conversion (such as with the Newton basis [12]), or more stable TSQR algorithm (such as the SVQR [25] and CAQR [8]) is probably required.
 
11
If the larger number of GPUs (i.e., \(p>64\)) is utilized, the allreduce cost is more dominant in the total solution time and the speedup ratio in the M-CA-GMRES is possibly higher than that in the GMRES or GCR.
 
Literatur
1.
Zurück zum Zitat Abdelfattah A, Haidar A, Tomov S, Dongarra J (2016) Performance, design, and autotuning of batched GEMM for GPUs. In: Proceedings of the ISC High Performance Computing 2016, LNCS, vol 9697, pp 21–38. Springer Abdelfattah A, Haidar A, Tomov S, Dongarra J (2016) Performance, design, and autotuning of batched GEMM for GPUs. In: Proceedings of the ISC High Performance Computing 2016, LNCS, vol 9697, pp 21–38. Springer
4.
Zurück zum Zitat Carson E (2015) Communication-avoiding Krylov subspace methods in theory and practice. PhD dissertation, University of California, Berkeley Carson E (2015) Communication-avoiding Krylov subspace methods in theory and practice. PhD dissertation, University of California, Berkeley
6.
Zurück zum Zitat Concus P, Golub GH (1976) A generalized conjugate gradient method for nonsymmetric systems of linear equations. In: Computing Methods in Applied Sciences and Engineering, Lecture Notes in Economics and Mathematical Systems, vol 134. Springer, pp 56–65. https://doi.org/10.1007/978-3-642-85972-4_4 Concus P, Golub GH (1976) A generalized conjugate gradient method for nonsymmetric systems of linear equations. In: Computing Methods in Applied Sciences and Engineering, Lecture Notes in Economics and Mathematical Systems, vol 134. Springer, pp 56–65. https://​doi.​org/​10.​1007/​978-3-642-85972-4_​4
10.
Zurück zum Zitat Fujita N, Nuga H, Boku T, Idomura Y (2013) Nuclear fusion simulation code optimization on GPU clusters. In: Proceedings of the 19th IEEE International Conference on Parallel and Distributed Systems (ICPADS 2013). IEEE, pp 1266–1274. https://doi.org/10.1109/ICPADS.2013.65 Fujita N, Nuga H, Boku T, Idomura Y (2013) Nuclear fusion simulation code optimization on GPU clusters. In: Proceedings of the 19th IEEE International Conference on Parallel and Distributed Systems (ICPADS 2013). IEEE, pp 1266–1274. https://​doi.​org/​10.​1109/​ICPADS.​2013.​65
11.
Zurück zum Zitat Golub GH, Van Loan CF (2013) Matrix computations, 4th edn. The John Hopkins University Press, BaltimoreMATH Golub GH, Van Loan CF (2013) Matrix computations, 4th edn. The John Hopkins University Press, BaltimoreMATH
12.
Zurück zum Zitat Hoemmen M (2010) Communication-avoiding Krylov subspace methods. PhD dissertation, University of California, Berkeley Hoemmen M (2010) Communication-avoiding Krylov subspace methods. PhD dissertation, University of California, Berkeley
14.
Zurück zum Zitat Idomura Y, Ina T, Mayumi A, Yamada S, Matsumoto K, Asahi Y, Imamura T (2017) Application of a communication-avoiding generalized minimal residual method to a gyrokinetic five dimensional Eulerian code on many core platforms. In: Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA ’17), p 7. https://doi.org/10.1145/3148226.3148234 Idomura Y, Ina T, Mayumi A, Yamada S, Matsumoto K, Asahi Y, Imamura T (2017) Application of a communication-avoiding generalized minimal residual method to a gyrokinetic five dimensional Eulerian code on many core platforms. In: Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA ’17), p 7. https://​doi.​org/​10.​1145/​3148226.​3148234
15.
Zurück zum Zitat Idomura Y, Nakata M, Yamada S, Machida M, Imamura T, Watanabe T, Nunami M, Inoue H, Tsutsumi S, Miyoshi I, Shida N (2014) Communication-overlap techniques for improved strong scaling of gyrokinetic Eulerian code beyond 100k cores on the K-computer. Int J High Perform Comput Appl 28(1):73–86. https://doi.org/10.1177/1094342013490973 CrossRef Idomura Y, Nakata M, Yamada S, Machida M, Imamura T, Watanabe T, Nunami M, Inoue H, Tsutsumi S, Miyoshi I, Shida N (2014) Communication-overlap techniques for improved strong scaling of gyrokinetic Eulerian code beyond 100k cores on the K-computer. Int J High Perform Comput Appl 28(1):73–86. https://​doi.​org/​10.​1177/​1094342013490973​ CrossRef
21.
Zurück zum Zitat Rosendale JV (1983) Minimizing inner product data dependencies in conjugate gradient iteration. Technical Report NASA-CR-17, NASA Rosendale JV (1983) Minimizing inner product data dependencies in conjugate gradient iteration. Technical Report NASA-CR-17, NASA
22.
Zurück zum Zitat Saad Y (2003) Iterative methods for sparse linear systems, 2nd edn. SIAM, PhiladelphiaCrossRef Saad Y (2003) Iterative methods for sparse linear systems, 2nd edn. SIAM, PhiladelphiaCrossRef
24.
Zurück zum Zitat Shimokawabe T, Aoki T, Muroi C, Ishida J, Kawano K, Endo T, Nukada A, Maruyama N, Matsuoka S (2010) An 80-fold speedup, 15.0 TFlops GPU acceleration of non-hydrostatic weather model ASUCA production code. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2010). IEEE. https://doi.org/10.1109/SC.2010.9 Shimokawabe T, Aoki T, Muroi C, Ishida J, Kawano K, Endo T, Nukada A, Maruyama N, Matsuoka S (2010) An 80-fold speedup, 15.0 TFlops GPU acceleration of non-hydrostatic weather model ASUCA production code. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2010). IEEE. https://​doi.​org/​10.​1109/​SC.​2010.​9
28.
Zurück zum Zitat Williams SW (2011) The roofline model. In: Bailey DH, Lucas RF, Williams SW (eds) Performance tuning of scientific applications, chapter 9. CRC Press, Boca Raton, pp 195–215 Williams SW (2011) The roofline model. In: Bailey DH, Lucas RF, Williams SW (eds) Performance tuning of scientific applications, chapter 9. CRC Press, Boca Raton, pp 195–215
30.
Zurück zum Zitat Yamazaki I, Hoemmen M, Luszczek P, Dongarra J (2017) Improving performance of GMRES by reducing communication and pipelining global collectives. In: Proceedings of the 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops (IPDPSW 2017). IEEE, pp 1118–1127. https://doi.org/10.1109/IPDPSW.2017.65 Yamazaki I, Hoemmen M, Luszczek P, Dongarra J (2017) Improving performance of GMRES by reducing communication and pipelining global collectives. In: Proceedings of the 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops (IPDPSW 2017). IEEE, pp 1118–1127. https://​doi.​org/​10.​1109/​IPDPSW.​2017.​65
Metadaten
Titel
Implementation and performance evaluation of a communication-avoiding GMRES method for stencil-based code on GPU cluster
verfasst von
Kazuya Matsumoto
Yasuhiro Idomura
Takuya Ina
Akie Mayumi
Susumu Yamada
Publikationsdatum
05.09.2019
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 12/2019
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-019-02983-7

Weitere Artikel der Ausgabe 12/2019

The Journal of Supercomputing 12/2019 Zur Ausgabe

Premium Partner