Skip to main content
Top

2015 | OriginalPaper | Chapter

Mixed-Precision Orthogonalization Scheme and Adaptive Step Size for Improving the Stability and Performance of CA-GMRES on GPUs

Authors : Ichitaro Yamazaki, Stanimire Tomov, Tingxing Dong, Jack Dongarra

Published in: High Performance Computing for Computational Science -- VECPAR 2014

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The Generalized Minimum Residual (GMRES) method is a popular Krylov subspace projection method for solving a nonsymmetric linear system of equations. On modern computers, communication is becoming increasingly expensive compared to arithmetic operations, and a communication-avoiding variant (CA-GMRES) may improve the performance of GMRES. To further enhance the performance of CA-GMRES, in this paper, we propose two techniques, focusing on the two main computational kernels of CA-GMRES, tall-skinny QR (TSQR) and matrix powers kernel (MPK). First, to improve the numerical stability of TSQR based on the Cholesky QR (CholQR) factorization, we use higher-precision arithmetic at carefully-selected steps of the factorization. In particular, our mixed-precision CholQR takes the input matrix in the standard \(64\)-bit double precision, but accumulates some of its intermediate results in a software-emulated double-double precision. Compared with the standard CholQR, this mixed-precision CholQR requires about \(8.5\times \) more computation but a much smaller increase in communication. Since the computation is becoming less expensive compared to the communication on a newer computer, the relative overhead of the mixed-precision CholQR is decreasing. Our case studies on a GPU demonstrate that using higher-precision arithmetic for this small but critical segment of the algorithm can improve not only the overall numerical stability of CA-GMRES but also, in some cases, its performance. We then study an adaptive scheme to dynamically adjust the step size of MPK based on the static inputs and the performance measurements gathered during the first restart loop of CA-GMRES. Since the optimal step size of MPK is often much smaller than that of the orthogonalization kernel, the overall performance of CA-GMRES can be improved using different step sizes for these two kernels. Our performance results on multiple GPUs show that our adaptive scheme can choose a near optimal step size for MPK, reducing the total solution time of CA-GMRES.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
In the current implementation, the numbers of rows and columns in \(X\) and \(Y\) are a multiple of \(h\), and multiples of \(m_b\) and \(n_b\), respectively, where \(n_b\) is a multiple of \(n_r\).
 
Literature
2.
go back to reference Demmel, J., Hoemmen, M., Mohiyuddin, M., Yelick, K.: Avoiding communication in computing Krylov subspaces. Technical report UCB/EECS-2007-123, University of California Berkeley EECS Department, October 2007 Demmel, J., Hoemmen, M., Mohiyuddin, M., Yelick, K.: Avoiding communication in computing Krylov subspaces. Technical report UCB/EECS-2007-123, University of California Berkeley EECS Department, October 2007
3.
go back to reference Golub, G., van Loan, C.: Matrix Computations, 4th edn. The Johns Hopkins University Press, Baltimore (2012) Golub, G., van Loan, C.: Matrix Computations, 4th edn. The Johns Hopkins University Press, Baltimore (2012)
4.
go back to reference Hida, Y., Li, X., Bailey, D.: Quad-double arithmetic: algorithms, implementation, and application. Technical report LBNL-46996 (2000) Hida, Y., Li, X., Bailey, D.: Quad-double arithmetic: algorithms, implementation, and application. Technical report LBNL-46996 (2000)
5.
go back to reference Hoemmen, M.: Communication-avoiding Krylov subspace methods. Ph.D. thesis, University of California, Berkeley (2010) Hoemmen, M.: Communication-avoiding Krylov subspace methods. Ph.D. thesis, University of California, Berkeley (2010)
6.
go back to reference Saad, Y., Schultz, M.: GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comput. 7, 856–869 (1986)CrossRefMATHMathSciNet Saad, Y., Schultz, M.: GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comput. 7, 856–869 (1986)CrossRefMATHMathSciNet
7.
go back to reference Stathopoulos, A., Wu, K.: A block orthogonalization procedure with constant synchronization requirements. SIAM J. Sci. Comput. 23, 2165–2182 (2002)CrossRefMATHMathSciNet Stathopoulos, A., Wu, K.: A block orthogonalization procedure with constant synchronization requirements. SIAM J. Sci. Comput. 23, 2165–2182 (2002)CrossRefMATHMathSciNet
8.
go back to reference Yamazaki, I., Anzt, H., Tomov, S., Hoemmen, M., Dongarra, J.: Improving the performance of CA-GMRES on multicores with multiple GPUs. In: The Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 382–391 (2014) Yamazaki, I., Anzt, H., Tomov, S., Hoemmen, M., Dongarra, J.: Improving the performance of CA-GMRES on multicores with multiple GPUs. In: The Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 382–391 (2014)
9.
go back to reference Yamazaki, I., Tomov, S., Dongarra, J.: Mixed-precision Cholesky QR factorization and its case studies on multicore CPUs with multiple GPUs. Submitted to SIAM J. Sci. Comput. (2014) Yamazaki, I., Tomov, S., Dongarra, J.: Mixed-precision Cholesky QR factorization and its case studies on multicore CPUs with multiple GPUs. Submitted to SIAM J. Sci. Comput. (2014)
Metadata
Title
Mixed-Precision Orthogonalization Scheme and Adaptive Step Size for Improving the Stability and Performance of CA-GMRES on GPUs
Authors
Ichitaro Yamazaki
Stanimire Tomov
Tingxing Dong
Jack Dongarra
Copyright Year
2015
DOI
https://doi.org/10.1007/978-3-319-17353-5_2

Premium Partner