skip to main content
article

Automated application-level checkpointing of MPI programs

Published:11 June 2003Publication History
Skip Abstract Section

Abstract

The running times of many computational science applications, such as protein-folding using ab initio methods, are much longer than the mean-time-to-failure of high-performance computing platforms. To run to completion, therefore, these applications must tolerate hardware failures.In this paper, we focus on the stopping failure model in which a faulty process hangs and stops responding to the rest of the system. We argue that tolerating such faults is best done by an approach called application-level coordinated non-blocking checkpointing, and that existing fault-tolerance protocols in the literature are not suitable for implementing this approach.We then present a suitable protocol, which is implemented by a co-ordination layer that sits between the application program and the MPI library. We show how this protocol can be used with a precompiler that instruments C/MPI programs to save application and MPI library state. An advantage of our approach is that it is independent of the MPI implementation. We present experimental results that argue that the overhead of using our system can be small.

References

  1. A. Agbaria and R. Friedman. Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations. In 8th IEEE International Symposium on High Performance Distributed Computing, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Beck, J. S. Plank, and G. Kingsley. Compiler-assisted checkpointing. Technical Report UT-CS-94-269, Dept. of Computer Science, University of Tennessee, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Beguelin, E. Seligman, and P. Stephan. Application level fault tolerance in heterogeneous networks of workstations. Journal of Parallel and Distributed Computing, 43(2):147--155, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. G. Bronevetsky, D. Marques, K. Pingali, and P. Stodghill. Collective operations in an application-level fault tolerant MPI system. In International Conference on Supercomputing (ICS) 2003, San Francisco, CA, June 23--26 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Chandy and L. Lamport. Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computing Systems, 3(1):63--75, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. E. N. Elnozahy and W. Zwaenepoel. Manetho: Transparent rollback-recovery with low overhead, limited rollback and fast output. IEEE Transactions on Computers, 41(5), May 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Elnozahy, L. Alvisi, Y. M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report CMU-CS-96-181, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, Oct. 1996.Google ScholarGoogle Scholar
  8. M. P. I. Forum. MPI: A message-passing interface standard. Technical Report UT-CS-94-230, University of Tennessee, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. P. I. Forum. MPI-2: Extensions to the message-passing interface, July 18 1997. Available from http://www.mpi-forum.org/docs/mpi-20-html/mpi2-report.html.Google ScholarGoogle Scholar
  10. R. Graham, S.-E. Choi, D. Daniel, N. Desai, R. Minnich, C. Rasmussen, D. Risinger, and M. Sukalski. A network-failure-tolerant message-passing system for tera-scale clusters. In Proceedings of the International Conference on Supercomputing 2002, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. I. Gupta, T. Chandra, and G. Goldszmidt. On scalable and efficient distributed failure detectors. In Proc. 20th Annual ACM Symp. on Principles of Distributed Computing, pages 170--179, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. IBM Research. Blue gene project overview. Online at http://www.research.ibm.com/bluegene/, 2002.Google ScholarGoogle Scholar
  13. D. B. Johnson and W. Zwaenepoel. Transparent optimistic rollback recovery. Operating Systems Review, 25(2):99--102, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. N. Lynch. Distributed Algorithms. Morgan Kaufmann, San Francisco, California, first edition, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. B. M. Litzkow, T. Tannenbaum and M. Livny. Checkpoint and migration of UNIX processes in the condor distributed processing system. Technical Report 1346, University of Wisconsin-Madison, 1997.Google ScholarGoogle Scholar
  16. National Nuclear Security Administration. Asci home. Online at http://www.nnsa.doe.gov/asc/, 2002.Google ScholarGoogle Scholar
  17. J. S. Plank, M. Beck, G. Kingsley, and K. Li. Libckpt: Transparent checkpointing under UNIX. Technical Report UT-CS-94-242, Dept. of Computer Science, University of Tennessee, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. B. Ramkumar and V. Strumpen. Portable checkpointing for heterogenous architectures. In Symposium on Fault-Tolerant Computing, pages 58--67, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Rao, L. Alvisi, and H. M. Vin. Egida: An extensible toolkit for low-overhead fault tolerance. In Symposium on Fault-Tolerant Computing, pages 48--55, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. T. Tabe and Q. F. Stout. The use of the MPI communication library in the NAS parallel benchmarks. Technical Report CSE-TR-386-99, Advanced Computer Architecture Laboratory, Dept. of Electrical Engineering and Computer Science, University of Michigan, 17, 1999.Google ScholarGoogle Scholar
  21. The BlueGene/L Team. An overview of the BlueGene/L supercomputer. In SC 2000 High Performance Networking and Computing, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Automated application-level checkpointing of MPI programs

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM SIGPLAN Notices
            ACM SIGPLAN Notices  Volume 38, Issue 10
            Proceedings of the ACM SIGPLAN symposium on principles and practice of parallel programming (PPoPP 2003) and workshop on partial evaluation and semantics-based program manipulation (PEPM 2003)
            October 2003
            331 pages
            ISSN:0362-1340
            EISSN:1558-1160
            DOI:10.1145/966049
            Issue’s Table of Contents
            • cover image ACM Conferences
              PPoPP '03: Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
              June 2003
              250 pages
              ISBN:1581135882
              DOI:10.1145/781498

            Copyright © 2003 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 11 June 2003

            Check for updates

            Qualifiers

            • article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader