article

Automated application-level checkpointing of MPI programs

Authors:
Greg Bronevetsky

Cornell University, Ithaca, NY

Cornell University, Ithaca, NY
View Profile

,
Daniel Marques

Cornell University, Ithaca, NY

Cornell University, Ithaca, NY
View Profile

,
Keshav Pingali

Cornell University, Ithaca, NY

Cornell University, Ithaca, NY
View Profile

,
Paul Stodghill

Cornell University, Ithaca, NY

Cornell University, Ithaca, NY
View Profile

Authors Info & Claims

ACM SIGPLAN Notices Volume 38 Issue 10October 2003pp 84–94https://doi.org/10.1145/966049.781513

Published:11 June 2003Publication History

ACM SIGPLAN Notices

Abstract

The running times of many computational science applications, such as protein-folding using ab initio methods, are much longer than the mean-time-to-failure of high-performance computing platforms. To run to completion, therefore, these applications must tolerate hardware failures.In this paper, we focus on the stopping failure model in which a faulty process hangs and stops responding to the rest of the system. We argue that tolerating such faults is best done by an approach called application-level coordinated non-blocking checkpointing, and that existing fault-tolerance protocols in the literature are not suitable for implementing this approach.We then present a suitable protocol, which is implemented by a co-ordination layer that sits between the application program and the MPI library. We show how this protocol can be used with a precompiler that instruments C/MPI programs to save application and MPI library state. An advantage of our approach is that it is independent of the MPI implementation. We present experimental results that argue that the overhead of using our system can be small.

References

A. Agbaria and R. Friedman. Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations. In 8th IEEE International Symposium on High Performance Distributed Computing, 1999. Google ScholarDigital Library
M. Beck, J. S. Plank, and G. Kingsley. Compiler-assisted checkpointing. Technical Report UT-CS-94-269, Dept. of Computer Science, University of Tennessee, 1994. Google ScholarDigital Library
A. Beguelin, E. Seligman, and P. Stephan. Application level fault tolerance in heterogeneous networks of workstations. Journal of Parallel and Distributed Computing, 43(2):147--155, 1997. Google ScholarDigital Library
G. Bronevetsky, D. Marques, K. Pingali, and P. Stodghill. Collective operations in an application-level fault tolerant MPI system. In International Conference on Supercomputing (ICS) 2003, San Francisco, CA, June 23--26 2003. Google ScholarDigital Library
M. Chandy and L. Lamport. Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computing Systems, 3(1):63--75, 1985. Google ScholarDigital Library
E. N. Elnozahy and W. Zwaenepoel. Manetho: Transparent rollback-recovery with low overhead, limited rollback and fast output. IEEE Transactions on Computers, 41(5), May 1992. Google ScholarDigital Library
M. Elnozahy, L. Alvisi, Y. M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report CMU-CS-96-181, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, Oct. 1996.Google Scholar
M. P. I. Forum. MPI: A message-passing interface standard. Technical Report UT-CS-94-230, University of Tennessee, 1994. Google ScholarDigital Library
M. P. I. Forum. MPI-2: Extensions to the message-passing interface, July 18 1997. Available from http://www.mpi-forum.org/docs/mpi-20-html/mpi2-report.html.Google Scholar
R. Graham, S.-E. Choi, D. Daniel, N. Desai, R. Minnich, C. Rasmussen, D. Risinger, and M. Sukalski. A network-failure-tolerant message-passing system for tera-scale clusters. In Proceedings of the International Conference on Supercomputing 2002, 2002. Google ScholarDigital Library
I. Gupta, T. Chandra, and G. Goldszmidt. On scalable and efficient distributed failure detectors. In Proc. 20th Annual ACM Symp. on Principles of Distributed Computing, pages 170--179, 2001. Google ScholarDigital Library
IBM Research. Blue gene project overview. Online at http://www.research.ibm.com/bluegene/, 2002.Google Scholar
D. B. Johnson and W. Zwaenepoel. Transparent optimistic rollback recovery. Operating Systems Review, 25(2):99--102, 1991. Google ScholarDigital Library
N. Lynch. Distributed Algorithms. Morgan Kaufmann, San Francisco, California, first edition, 1996. Google ScholarDigital Library
J. B. M. Litzkow, T. Tannenbaum and M. Livny. Checkpoint and migration of UNIX processes in the condor distributed processing system. Technical Report 1346, University of Wisconsin-Madison, 1997.Google Scholar
National Nuclear Security Administration. Asci home. Online at http://www.nnsa.doe.gov/asc/, 2002.Google Scholar
J. S. Plank, M. Beck, G. Kingsley, and K. Li. Libckpt: Transparent checkpointing under UNIX. Technical Report UT-CS-94-242, Dept. of Computer Science, University of Tennessee, 1994. Google ScholarDigital Library
B. Ramkumar and V. Strumpen. Portable checkpointing for heterogenous architectures. In Symposium on Fault-Tolerant Computing, pages 58--67, 1997. Google ScholarDigital Library
S. Rao, L. Alvisi, and H. M. Vin. Egida: An extensible toolkit for low-overhead fault tolerance. In Symposium on Fault-Tolerant Computing, pages 48--55, 1999. Google ScholarDigital Library
T. Tabe and Q. F. Stout. The use of the MPI communication library in the NAS parallel benchmarks. Technical Report CSE-TR-386-99, Advanced Computer Architecture Laboratory, Dept. of Electrical Engineering and Computer Science, University of Michigan, 17, 1999.Google Scholar
The BlueGene/L Team. An overview of the BlueGene/L supercomputer. In SC 2000 High Performance Networking and Computing, 2002. Google ScholarDigital Library

Index Terms

Automated application-level checkpointing of MPI programs
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Preprocessors
    2. General programming languages
      1. Language types
        Parallel programming languages
  2. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Communications management
        Message passing
    2. Extra-functional properties
      1. Software fault tolerance
        Checkpoint / restart

Recommendations

Automated application-level checkpointing of MPI programs
PPoPP '03: Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming

The running times of many computational science applications, such as protein-folding using ab initio methods, are much longer than the mean-time-to-failure of high-performance computing platforms. To run to completion, therefore, these applications ...
Read More
Collective operations in application-level fault-tolerant MPI
ICS '03: Proceedings of the 17th annual international conference on Supercomputing

Fault-tolerance is becoming a critical issue on high-performance platforms. Checkpointing techniques make programs fault-tolerant by saving their state periodically and restoring this state after failure. System-level checkpointing saves the state of ...
Read More
Local rollback for resilient MPI applications with application-level checkpointing and message logging
Abstract
The resilience approach generally used in high-performance computing (HPC) relies on coordinated checkpoint/restart, a global rollback of all the processes that are running the application. However, in many instances, the failure has a ...
Highlights
- A local rollback solution for MPI resilient programs preventing survivors rollback.
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGPLAN Notices Volume 38, Issue 10
Proceedings of the ACM SIGPLAN symposium on principles and practice of parallel programming (PPoPP 2003) and workshop on partial evaluation and semantics-based program manipulation (PEPM 2003)
October 2003
331 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/966049
Issue’s Table of Contents
PPoPP '03: Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
June 2003
250 pages
ISBN:1581135882
DOI:10.1145/781498
General Chair:
Rudolf Eigenmann
Purdue University
,
Program Chair:
Martin Rinard
MIT Laboratory for Computer Science
Copyright © 2003 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 June 2003
Check for updates
Author Tags
MPI
application-level checkpointing
fault-tolerance
non-FIFO communication
scientific computing
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 122
  Total Citations
  View Citations
- 1,366
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Automated application-level checkpointing of MPI programs

ACM SIGPLAN Notices

Abstract

References

Cited By

Index Terms

Recommendations

Automated application-level checkpointing of MPI programs

Collective operations in application-level fault-tolerant MPI

Local rollback for resilient MPI applications with application-level checkpointing and message logging