Skip to main content
Erschienen in: The Journal of Supercomputing 1/2017

15.09.2016

Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applications

verfasst von: Nuria Losada, María J. Martín, Patricia González

Erschienen in: The Journal of Supercomputing | Ausgabe 1/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The Message Passing Interface (MPI) standard is the most popular parallel programming model for distributed systems. However, it lacks fault-tolerance support and, traditionally, failures are addressed with stop-and-restart checkpointing solutions. The proposal of User Level Failure Mitigation (ULFM) for the inclusion of resilience capabilities in the MPI standard provides new opportunities in this field, allowing the implementation of resilient MPI applications, i.e., applications that are able to detect and react to failures without stopping their execution. This work compares the performance of a traditional stop-and-restart checkpointing solution with its equivalent resilience proposal. Both approaches are built on top of ComPiler for Portable Checkpoiting (CPPC) an application-level checkpointing tool for MPI applications, and they allow to transparently obtain fault-tolerant MPI applications from generic MPI Single Program Multiple Data (SPMD). The evaluation is focused on the scalability of the two solutions, comparing both proposals using up to 3072 cores.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
3.
Zurück zum Zitat Aulwes R, Daniel D, Desai N, Graham R, Risinger L, Taylor MA, Woodall T, Sukalski M (2004) Architecture of LA-MPI, a network-fault-tolerant MPI. In: International parallel and distributed processing symposium, p 15 Aulwes R, Daniel D, Desai N, Graham R, Risinger L, Taylor MA, Woodall T, Sukalski M (2004) Architecture of LA-MPI, a network-fault-tolerant MPI. In: International parallel and distributed processing symposium, p 15
4.
Zurück zum Zitat Bland W, Bouteiller A, Herault T, Hursey J, Bosilca G, Dongarra J (2012) An evaluation of user-level failure mitigation support in MPI. Recent Adv Message Pass Interface 7490:193–203CrossRef Bland W, Bouteiller A, Herault T, Hursey J, Bosilca G, Dongarra J (2012) An evaluation of user-level failure mitigation support in MPI. Recent Adv Message Pass Interface 7490:193–203CrossRef
5.
Zurück zum Zitat Bland W, Raffenetti K, Balaji P (2014) Simplifying the recovery model of user-level failure mitigation. In: Workshop on Exascale MPI at Supercomputing Conference, pp 20–25 Bland W, Raffenetti K, Balaji P (2014) Simplifying the recovery model of user-level failure mitigation. In: Workshop on Exascale MPI at Supercomputing Conference, pp 20–25
6.
Zurück zum Zitat Broquedis F, Clet-Ortega J, Moreaud S, Furmento N, Goglin B, Mercier G, Thibault S, Namyst R (2010) hwloc: a generic framework for managing hardware affinities in HPC applications. In: International Conference on Parallel, Distributed and Network-Based Computing Broquedis F, Clet-Ortega J, Moreaud S, Furmento N, Goglin B, Mercier G, Thibault S, Namyst R (2010) hwloc: a generic framework for managing hardware affinities in HPC applications. In: International Conference on Parallel, Distributed and Network-Based Computing
7.
Zurück zum Zitat Cores I, Rodríguez G, Martín M, González P, Osorio R (2013) Improving scalability of application-level checkpoint-recovery by reducing checkpoint sizes. New Gener Comput 31(3):163–185CrossRef Cores I, Rodríguez G, Martín M, González P, Osorio R (2013) Improving scalability of application-level checkpoint-recovery by reducing checkpoint sizes. New Gener Comput 31(3):163–185CrossRef
8.
Zurück zum Zitat Di Martino C, Kramer W, Kalbarczyk Z, Iyer R (2015) Measuring and understanding extreme-scale application resilience: a field study of 5,000,000 HPC application runs. In: International Conference on Dependable Systems and Networks, pp 25–36 Di Martino C, Kramer W, Kalbarczyk Z, Iyer R (2015) Measuring and understanding extreme-scale application resilience: a field study of 5,000,000 HPC application runs. In: International Conference on Dependable Systems and Networks, pp 25–36
9.
Zurück zum Zitat Fagg G, Dongarra J (2000) FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Recent advances in parallel virtual machine and message passing interface, vol 1908, pp 346–353. Springer, New York Fagg G, Dongarra J (2000) FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Recent advances in parallel virtual machine and message passing interface, vol 1908, pp 346–353. Springer, New York
11.
Zurück zum Zitat Hursey J, Graham RL, Bronevetsky G, Buntinas D, Pritchard H, Solt DG (2011) Run-through stabilization: An MPI proposal for process fault tolerance. In: Recent advances in the message passing interface, pp 329–332 Hursey J, Graham RL, Bronevetsky G, Buntinas D, Pritchard H, Solt DG (2011) Run-through stabilization: An MPI proposal for process fault tolerance. In: Recent advances in the message passing interface, pp 329–332
12.
Zurück zum Zitat Laguna I, Richards D, Gamblin T, Schulz M, de Supinski B (2014) Evaluating user-level fault tolerance for MPI Applications. In: European MPI Users’ group meeting, EuroMPI/ASIA ’14, pp 57–62 Laguna I, Richards D, Gamblin T, Schulz M, de Supinski B (2014) Evaluating user-level fault tolerance for MPI Applications. In: European MPI Users’ group meeting, EuroMPI/ASIA ’14, pp 57–62
15.
Zurück zum Zitat Moody A, Bronevetsky G, Mohror K, De Supinski BR (2010) Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–11 Moody A, Bronevetsky G, Mohror K, De Supinski BR (2010) Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–11
16.
Zurück zum Zitat Pauli S, Kohler M, Arbenz P (2013) A fault tolerant implementation of multi-level Monte Carlo methods. In: Advances in parallel computing, pp 471–480 Pauli S, Kohler M, Arbenz P (2013) A fault tolerant implementation of multi-level Monte Carlo methods. In: Advances in parallel computing, pp 471–480
17.
Zurück zum Zitat Plank JS, Li K, Puening MA (1998) Diskless checkpointing. Trans Parall Distrib Syst 9(10):972–986CrossRef Plank JS, Li K, Puening MA (1998) Diskless checkpointing. Trans Parall Distrib Syst 9(10):972–986CrossRef
18.
Zurück zum Zitat Rizzi F, Morris K, Sargsyan K, Mycek P, Safta C, Debusschere B, LeMaitre O, Knio O (2016) ULFM-MPI implementation of a resilient task-based partial differential equations preconditioner. In: Workshop on fault-tolerance for HPC at extreme scale, pp 19–26 Rizzi F, Morris K, Sargsyan K, Mycek P, Safta C, Debusschere B, LeMaitre O, Knio O (2016) ULFM-MPI implementation of a resilient task-based partial differential equations preconditioner. In: Workshop on fault-tolerance for HPC at extreme scale, pp 19–26
19.
Zurück zum Zitat Rodríguez G, Martín M, González P, Touriño J, Doallo R (2010) CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications. Concur Comput Pract Exp 22(6):749–766CrossRef Rodríguez G, Martín M, González P, Touriño J, Doallo R (2010) CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications. Concur Comput Pract Exp 22(6):749–766CrossRef
20.
Zurück zum Zitat Sato K, Moody A, Mohror K, Gamblin T, De Supinski B, Maruyama N, Matsuoka S (2014) FMI: fault tolerant messaging interface for fast and transparent recovery. In: International parallel and distributed processing symposium, pp 1225–1234 Sato K, Moody A, Mohror K, Gamblin T, De Supinski B, Maruyama N, Matsuoka S (2014) FMI: fault tolerant messaging interface for fast and transparent recovery. In: International parallel and distributed processing symposium, pp 1225–1234
21.
Zurück zum Zitat Suo G, Lu Y, Liao X, Xie M, Cao H (2013) NR-MPI: a non-stop and fault resilient MPI. In: International Conference on Parallel and Distributed Systems, pp 190–199 Suo G, Lu Y, Liao X, Xie M, Cao H (2013) NR-MPI: a non-stop and fault resilient MPI. In: International Conference on Parallel and Distributed Systems, pp 190–199
22.
Zurück zum Zitat Teranishi K, Heroux M (2014) Toward local failure local recovery resilience model using MPI-ULFM. In: European MPI users’ group meeting, pp 51–56 Teranishi K, Heroux M (2014) Toward local failure local recovery resilience model using MPI-ULFM. In: European MPI users’ group meeting, pp 51–56
23.
Zurück zum Zitat Wolters E, Smith M (2013) MOCFE-Bone: the 3D MOC mini-application for exascale research. Tech. rep, Argonne National Laboratory Wolters E, Smith M (2013) MOCFE-Bone: the 3D MOC mini-application for exascale research. Tech. rep, Argonne National Laboratory
Metadaten
Titel
Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applications
verfasst von
Nuria Losada
María J. Martín
Patricia González
Publikationsdatum
15.09.2016
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 1/2017
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-016-1863-z

Weitere Artikel der Ausgabe 1/2017

The Journal of Supercomputing 1/2017 Zur Ausgabe