Skip to main content

2015 | OriginalPaper | Buchkapitel

Fault Tolerance in an Inner-Outer Solver: A GVR-Enabled Case Study

verfasst von : Ziming Zheng, Andrew A. Chien, Keita Teranishi

Erschienen in: High Performance Computing for Computational Science -- VECPAR 2014

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Resilience is a major challenge for large-scale systems. It is particularly important for iterative linear solvers, since they take much of the time of many scientific applications. We show that single bit flip errors in the Flexible GMRES iterative linear solver can lead to high computational overhead or even failure to converge to the right answer. Informed by these results, we design and evaluate several strategies for fault tolerance in both inner and outer solvers appropriate across a range of error rates. We implement them, extending Trilinos’ solver library with the Global View Resilience (GVR) programming model, which provides multi-stream snapshots, multi-version data structures with portable and rich error checking/recovery. Experimental results validate correct execution with low performance overhead under varied error conditions.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Borkar, S., Chien, A.A.: The future of microprocessors. Commun. ACM 54(5), 67–77 (2011)CrossRef Borkar, S., Chien, A.A.: The future of microprocessors. Commun. ACM 54(5), 67–77 (2011)CrossRef
2.
Zurück zum Zitat Bridges, P.G., Ferreira, K. B., Heroux, M. A., Hoemmen, M.: Fault-tolerant linear solvers via selective reliability. ArXiv e-prints, June 2012. Provided by the SAO/NASA Astrophysics Data System Bridges, P.G., Ferreira, K. B., Heroux, M. A., Hoemmen, M.: Fault-tolerant linear solvers via selective reliability. ArXiv e-prints, June 2012. Provided by the SAO/NASA Astrophysics Data System
3.
Zurück zum Zitat Bronevetsky, G., de Supinski, B.: Soft error vulnerability of iterative linear algebra methods. In: Proceedings of ICS (2008) Bronevetsky, G., de Supinski, B.: Soft error vulnerability of iterative linear algebra methods. In: Proceedings of ICS (2008)
4.
Zurück zum Zitat Cappello, F., Geist, A., Gropp, W., Kale, L., Kramer, W., Snir, M.: Towards exascale resilience. Int. J. High Perform. Comput. Appl. 23(4), 374–388 (2009)CrossRef Cappello, F., Geist, A., Gropp, W., Kale, L., Kramer, W., Snir, M.: Towards exascale resilience. Int. J. High Perform. Comput. Appl. 23(4), 374–388 (2009)CrossRef
5.
Zurück zum Zitat Chen, J., McInnes, L.C., Zhang, H.: Analysis and practical use of flexible BiCGStab. Technical report ANL/MCS-P3039-0912, Argonne National Laboratory (2012) Chen, J., McInnes, L.C., Zhang, H.: Analysis and practical use of flexible BiCGStab. Technical report ANL/MCS-P3039-0912, Argonne National Laboratory (2012)
6.
Zurück zum Zitat Chen, Z.: Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods. In: Proceedings of PPoPP (2013) Chen, Z.: Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods. In: Proceedings of PPoPP (2013)
7.
Zurück zum Zitat Davis, T.A., Hu, Y.: The University of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1–25 (2011)MathSciNet Davis, T.A., Hu, Y.: The University of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1–25 (2011)MathSciNet
8.
Zurück zum Zitat Du, P., Luszczek, P., Dongarra, J.: High performance dense linear system solver with resilience to multiple soft errors. In: Proceedings of ICCS (2012) Du, P., Luszczek, P., Dongarra, J.: High performance dense linear system solver with resilience to multiple soft errors. In: Proceedings of ICCS (2012)
9.
Zurück zum Zitat Elliott, J., Hoemmen, M., Mueller, F.: Evaluating the impact of SDC on the GMRES iterative solver. In: Proceedings of IPDPS (2014) Elliott, J., Hoemmen, M., Mueller, F.: Evaluating the impact of SDC on the GMRES iterative solver. In: Proceedings of IPDPS (2014)
11.
Zurück zum Zitat Elnozahy, M., et al.: System resilience at extreme scale (2009). White Paper written for the Defense Advanced Research Project Agency (DARPA), with Ricardo Bianchini et al. Elnozahy, M., et al.: System resilience at extreme scale (2009). White Paper written for the Defense Advanced Research Project Agency (DARPA), with Ricardo Bianchini et al.
13.
Zurück zum Zitat Kogge, P., et al.: Exascale computing study: Technology challenges in achieving exascale systems. Technical report TR-2008-13, University of Notre Dame CSE Department (2008) Kogge, P., et al.: Exascale computing study: Technology challenges in achieving exascale systems. Technical report TR-2008-13, University of Notre Dame CSE Department (2008)
14.
Zurück zum Zitat Huang, K., Abraham, J.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. C–33(6), 518–528 (1984)CrossRef Huang, K., Abraham, J.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. C–33(6), 518–528 (1984)CrossRef
15.
Zurück zum Zitat Lidman, J., Quinlan, D. J., Liao, C., McKee, S.A.: ROSEFTTransform - a source-to-source translation framework for exascale fault-tolerance research. In: DSN-W (2012) Lidman, J., Quinlan, D. J., Liao, C., McKee, S.A.: ROSEFTTransform - a source-to-source translation framework for exascale fault-tolerance research. In: DSN-W (2012)
16.
Zurück zum Zitat Moody, A., Bronevetsky, G., Mohror, K., Supinski, B.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of Supercomputing (2010) Moody, A., Bronevetsky, G., Mohror, K., Supinski, B.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of Supercomputing (2010)
17.
Zurück zum Zitat Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. SIAM, Philadelphia (2003)CrossRefMATH Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. SIAM, Philadelphia (2003)CrossRefMATH
18.
Zurück zum Zitat Shantharam, M., Srinivasmurthy, S., Raghavan, P.: Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: Proceedings of ICS (2012) Shantharam, M., Srinivasmurthy, S., Raghavan, P.: Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: Proceedings of ICS (2012)
Metadaten
Titel
Fault Tolerance in an Inner-Outer Solver: A GVR-Enabled Case Study
verfasst von
Ziming Zheng
Andrew A. Chien
Keita Teranishi
Copyright-Jahr
2015
DOI
https://doi.org/10.1007/978-3-319-17353-5_11