Research Note
Application Level Fault Tolerance in Heterogeneous Networks of Workstations

https://doi.org/10.1006/jpdc.1997.1338Get rights and content

Abstract

We have explored methods for checkpointing and restarting processes within the distributed object migration environment (Dome), a C++ library of data parallel objects that are automatically distributed over heterogeneous networks of workstations (NOWs). System level checkpointing methods, although transparent to the user, were rejected because they lack support for heterogeneity. We have implemented application level checkpointing which places the checkpoint and restart mechanisms within Dome's C++ objects. Application level checkpointing has been implemented with a library-based technique for the programmer and a more transparent preprocessor-based technique. Dome's implementation of checkpointing successfully checkpoints and restarts processes on different numbers of machines and different architectures. Results from executing Dome programs across a NOW with realistic failure rates have been experimentally determined and are compared with theoretical results. The overhead of checkpointing is found to be low, while providing substantial decreases in expected runtime on realistic systems.

References (23)

  • José, Nagib Cotrim, Árabe, Adam, Beguelin, Bruce, Lowekamp, Erik, Seligman, Michael, Starkey, Peter, Stephan, 1996,...
  • Adam Beguelin et al.

    Application level fault tolerance in heterogeneous networks of workstations

    Technical Report

    (1996)
  • Andrzej Duda

    The effects of checkpointing on program execution time

    Inform. Process. Lett.

    (1991)
  • Elmootazbella, Elnozahy, David, Johnson, Willy, Zwaeneopoel, 1992, The performance of consistent checkpointing,...
  • J. Choi, 1992, Scalapack: A scalable linear algebra library for distributed memory concurrent computers, 4th Symposium...
  • Geoffrey, C. Fox, 1988, What have we learned from using real parallel machines to solve real problems?, Proceedings of...
  • Al Geist et al.

    PVM: Parallel Virtual Machine—A Users' Guide and Tutorial for Networked Parallel Computing

    (1994)
  • Erol Gelenbe

    On the optimum checkpoint interval

    J. Assoc. Comput. Mach.

    (1979)
  • Christine Hofmeister et al.

    Dynamic reconfiguration in distributed systems: Adapting software modules for replacement

    Technical Report

    (1992)
  • Yennun, Huang, Chandra, Kintala, 1993, Software implemented fault tolerance: Technologies and experience,...
  • Juan Leon et al.

    Fail-safe PVM: A portable package for distributed programming with transparent recovery

    Technical Report

    (1993)
  • Cited by (77)

    View all citing articles on Scopus

    This research was sponsored by the National Science Foundation and the Defense Advanced Research Projects Agency under Cooperative Agreement NCR-8919038 with the Corporation for National Research Initiatives and by the Advanced Research Projects Agency under contract number DABT63-93-C-0054.

    2

    Currently at Inktomi Corporation.

    3

    Currently at Intel Corporation.

    4

    E-mail: {adamb,eriks,pstephan}@cs.cmu.edu.

    View full text