Skip to main content
Erschienen in: Computing 12/2013

01.12.2013

An evaluation of User-Level Failure Mitigation support in MPI

verfasst von: Wesley Bland, Aurelien Bouteiller, Thomas Herault, Joshua Hursey, George Bosilca, Jack J. Dongarra

Erschienen in: Computing | Ausgabe 12/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

As the scale of computing platforms becomes increasingly extreme, the requirements for application fault tolerance are increasing as well. Techniques to address this problem by improving the resilience of algorithms have been developed, but they currently receive no support from the programming model, and without such support, they are bound to fail. This paper discusses the failure-free overhead and recovery impact of the user-level failure mitigation proposal presented in the MPI Forum. Experiments demonstrate that fault-aware MPI has little or no impact on performance for a range of applications, and produces satisfactory recovery times when there are failures.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Angskun T, Bosilca G, Dongarra J (2007) Binomial graph: a scalable and faulttolerant logical network topology. In: ISPA07. Number 4742 in LNCS, Springer, pp 471–482 Angskun T, Bosilca G, Dongarra J (2007) Binomial graph: a scalable and faulttolerant logical network topology. In: ISPA07. Number 4742 in LNCS, Springer, pp 471–482
2.
Zurück zum Zitat Bland W, Bosilca G, Bouteiller A, Herault T, Dongarra J (2012) A proposal for user-level failure Mitigation in the MPI-3 standard. Department of Electrical Engineering and Computer Science, University of Tennessee Bland W, Bosilca G, Bouteiller A, Herault T, Dongarra J (2012) A proposal for user-level failure Mitigation in the MPI-3 standard. Department of Electrical Engineering and Computer Science, University of Tennessee
3.
Zurück zum Zitat Bland W, Bouteiller A, Herault T, Hursey J, Bosilca G, Dongarra JJ (2012) An evaluation of user-level failure mitigation support in MPI. In: Träff JL, Benkner S, Dongarra JJ (eds) EuroMPI, Lecture Notes in Computer Science, vol 7490, Springer, pp 193–203 Bland W, Bouteiller A, Herault T, Hursey J, Bosilca G, Dongarra JJ (2012) An evaluation of user-level failure mitigation support in MPI. In: Träff JL, Benkner S, Dongarra JJ (eds) EuroMPI, Lecture Notes in Computer Science, vol 7490, Springer, pp 193–203
4.
Zurück zum Zitat Bland W, Du P, Bouteiller A, Herault T, Bosilca G, Dongarra JJ (2012) A Checkpoint-on-Failure protocol for algorithm-based recovery in standard MPI. In: 18th Euro-Par, LNCS, vol 7484, Springer, pp 477–489 Bland W, Du P, Bouteiller A, Herault T, Bosilca G, Dongarra JJ (2012) A Checkpoint-on-Failure protocol for algorithm-based recovery in standard MPI. In: 18th Euro-Par, LNCS, vol 7484, Springer, pp 477–489
5.
Zurück zum Zitat Bosilca G, Bouteiller A, Brunet É, Cappello F, Dongarra J, Guermouche A, Herault T, Robert Y, Vivien F, Zaidouni D (2012) Unified model for assessing checkpointing protocols at extreme-scale. Tech. report RR-7950, INRIA Bosilca G, Bouteiller A, Brunet É, Cappello F, Dongarra J, Guermouche A, Herault T, Robert Y, Vivien F, Zaidouni D (2012) Unified model for assessing checkpointing protocols at extreme-scale. Tech. report RR-7950, INRIA
6.
Zurück zum Zitat Bougeret M, Casanova H, Robert Y, Vivien F, Zaidouni D (2012) Using group replication for resilience on exascale systems. Tech. Rep. 265, LAWNs Bougeret M, Casanova H, Robert Y, Vivien F, Zaidouni D (2012) Using group replication for resilience on exascale systems. Tech. Rep. 265, LAWNs
7.
Zurück zum Zitat Bouteiller A, Bosilca G, Dongarra J (2010) Redesigning the message logging model for high performance. CCPE 22(16):2196–2211 Bouteiller A, Bosilca G, Dongarra J (2010) Redesigning the message logging model for high performance. CCPE 22(16):2196–2211
8.
Zurück zum Zitat Buntinas D, Coti C, Herault T, Lemarinier P, Pilard L, Rezmerita A, Rodriguez E, Cappello F (2008) Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI protocols. FGCS 24(1):73–84. doi:10.1016/j.future.2007.02.002 Buntinas D, Coti C, Herault T, Lemarinier P, Pilard L, Rezmerita A, Rodriguez E, Cappello F (2008) Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI protocols. FGCS 24(1):73–84. doi:10.​1016/​j.​future.​2007.​02.​002
9.
Zurück zum Zitat Cappello F, Geist A, Gropp B, Kalé LV, Kramer B, Snir M (2009) Toward exascale resilience. IJHPCA 23(4):374–388 Cappello F, Geist A, Gropp B, Kalé LV, Kramer B, Snir M (2009) Toward exascale resilience. IJHPCA 23(4):374–388
10.
Zurück zum Zitat Davies T, Karlsson C, Liu H, Ding C, Chen Z (2011) High performance linpack benchmark: a fault tolerant implementation without checkpointing. In: 25th ICS, ACM, pp 162–171 Davies T, Karlsson C, Liu H, Ding C, Chen Z (2011) High performance linpack benchmark: a fault tolerant implementation without checkpointing. In: 25th ICS, ACM, pp 162–171
11.
Zurück zum Zitat Dongarra J, Beckman P et al (2011) The international exascale software roadmap. IJHPCA 25(11):3–60 Dongarra J, Beckman P et al (2011) The international exascale software roadmap. IJHPCA 25(11):3–60
12.
Zurück zum Zitat Du P, Bouteiller A et al (2012) Algorithm-based Fault Tolerance for dense matrix factorizations. In: 17th SIGPLAN PPoPP, ACM, pp 225–234 Du P, Bouteiller A et al (2012) Algorithm-based Fault Tolerance for dense matrix factorizations. In: 17th SIGPLAN PPoPP, ACM, pp 225–234
13.
Zurück zum Zitat Fagg G, Dongarra J (2000) FT-MPI: Fault Tolerant MPI, supporting dynamic applications in a dynamic world. In: 7th EuroPVM/MPI, LNCS, vol 1908, Springer, pp 346–353 Fagg G, Dongarra J (2000) FT-MPI: Fault Tolerant MPI, supporting dynamic applications in a dynamic world. In: 7th EuroPVM/MPI, LNCS, vol 1908, Springer, pp 346–353
14.
Zurück zum Zitat Gabriel E et al (2004) Open MPI: Goals, concept, and design of a next generation MPI implementation. In: 11th EuroPVM/MPI, LNCS, vol 3241, Springer, pp 353–377 Gabriel E et al (2004) Open MPI: Goals, concept, and design of a next generation MPI implementation. In: 11th EuroPVM/MPI, LNCS, vol 3241, Springer, pp 353–377
16.
Zurück zum Zitat Hadzilacos V, Toueg S (1993) Distributed systems (2nd edn). In: Fault-tolerant broadcasts and related problems, ACM/Addison-Wesley, pp 97–145 Hadzilacos V, Toueg S (1993) Distributed systems (2nd edn). In: Fault-tolerant broadcasts and related problems, ACM/Addison-Wesley, pp 97–145
17.
Zurück zum Zitat Huang K, Abraham J (1984) Algorithm-based Fault Tolerance for matrix operations. IEEE Trans Comput 100(6):518–528CrossRef Huang K, Abraham J (1984) Algorithm-based Fault Tolerance for matrix operations. IEEE Trans Comput 100(6):518–528CrossRef
18.
Zurück zum Zitat Hursey J, Graham RL, Bronevetsky G, Buntinas D, Pritchard H, Solt DG (2011) Run-through stabilization: an MPI proposal for process fault tolerance. In: 18th EuroMPI, LNCS, vol 6690, Springer, pp 329–332 Hursey J, Graham RL, Bronevetsky G, Buntinas D, Pritchard H, Solt DG (2011) Run-through stabilization: an MPI proposal for process fault tolerance. In: 18th EuroMPI, LNCS, vol 6690, Springer, pp 329–332
19.
Zurück zum Zitat Hursey J, Naughton T, Vallee G, Graham RL (2011) A log-scaling fault tolerant agreement algorithm for a fault tolerant MPI. In: 18th EuroMPI, LNCS, vol 6690, Springer, pp 255–263 Hursey J, Naughton T, Vallee G, Graham RL (2011) A log-scaling fault tolerant agreement algorithm for a fault tolerant MPI. In: 18th EuroMPI, LNCS, vol 6690, Springer, pp 255–263
20.
Zurück zum Zitat Lusk E, Chan A (2008) Early experiments with the OpenMP/MPI hybrid programming model. In: 4th IWOMP, LNCS, vol 5004, Springer, pp 36–47 Lusk E, Chan A (2008) Early experiments with the OpenMP/MPI hybrid programming model. In: 4th IWOMP, LNCS, vol 5004, Springer, pp 36–47
21.
Zurück zum Zitat Mohan C, Lindsay B (1985) Efficient commit protocols for the tree of processes model of distributed transactions. In: SIGOPS OSR, vol 19, ACM, pp 40–52 Mohan C, Lindsay B (1985) Efficient commit protocols for the tree of processes model of distributed transactions. In: SIGOPS OSR, vol 19, ACM, pp 40–52
22.
Zurück zum Zitat Sterling T (2011) HPC in phase change: towards a new execution model. In: HPCCS-VECPAR 2010, LNCS, vol 6449, Springer, pp 31–31 Sterling T (2011) HPC in phase change: towards a new execution model. In: HPCCS-VECPAR 2010, LNCS, vol 6449, Springer, pp 31–31
Metadaten
Titel
An evaluation of User-Level Failure Mitigation support in MPI
verfasst von
Wesley Bland
Aurelien Bouteiller
Thomas Herault
Joshua Hursey
George Bosilca
Jack J. Dongarra
Publikationsdatum
01.12.2013
Verlag
Springer Vienna
Erschienen in
Computing / Ausgabe 12/2013
Print ISSN: 0010-485X
Elektronische ISSN: 1436-5057
DOI
https://doi.org/10.1007/s00607-013-0331-3