Application Level Fault Tolerance in Heterogeneous Networks of Workstations

doi:10.1006/jpdc.1997.1338

Journal of Parallel and Distributed Computing

Volume 43, Issue 2, 15 June 1997, Pages 147-155

https://doi.org/10.1006/jpdc.1997.1338 Get rights and content

Abstract

We have explored methods for checkpointing and restarting processes within the distributed object migration environment (Dome), a C++ library of data parallel objects that are automatically distributed over heterogeneous networks of workstations (NOWs). System level checkpointing methods, although transparent to the user, were rejected because they lack support for heterogeneity. We have implemented application level checkpointing which places the checkpoint and restart mechanisms within Dome's C++ objects. Application level checkpointing has been implemented with a library-based technique for the programmer and a more transparent preprocessor-based technique. Dome's implementation of checkpointing successfully checkpoints and restarts processes on different numbers of machines and different architectures. Results from executing Dome programs across a NOW with realistic failure rates have been experimentally determined and are compared with theoretical results. The overhead of checkpointing is found to be low, while providing substantial decreases in expected runtime on realistic systems.

References (23)

José, Nagib Cotrim, Árabe, Adam, Beguelin, Bruce, Lowekamp, Erik, Seligman, Michael, Starkey, Peter, Stephan, 1996,...
Adam Beguelin et al.
Application level fault tolerance in heterogeneous networks of workstations
Technical Report
(1996)
Andrzej Duda
The effects of checkpointing on program execution time
Inform. Process. Lett.
(1991)
Elmootazbella, Elnozahy, David, Johnson, Willy, Zwaeneopoel, 1992, The performance of consistent checkpointing,...
J. Choi, 1992, Scalapack: A scalable linear algebra library for distributed memory concurrent computers, 4th Symposium...
Geoffrey, C. Fox, 1988, What have we learned from using real parallel machines to solve real problems?, Proceedings of...
Al Geist et al.
PVM: Parallel Virtual Machine—A Users' Guide and Tutorial for Networked Parallel Computing
(1994)
Erol Gelenbe
On the optimum checkpoint interval
J. Assoc. Comput. Mach.
(1979)
Christine Hofmeister et al.
Dynamic reconfiguration in distributed systems: Adapting software modules for replacement
Technical Report
(1992)
Yennun, Huang, Chandra, Kintala, 1993, Software implemented fault tolerance: Technologies and experience,...

Juan Leon et al.

Fail-safe PVM: A portable package for distributed programming with transparent recovery

Technical Report

(1993)

Cited by (77)

X10-FT: Transparent fault tolerance for APGAS language and runtime
2014, Parallel Computing
Citation Excerpt :
The checkpoints used for recovery can be diskless [17,18], and also can be at the user/kernel level. The checkpoint/recovery code can be inserted either by the programmer manually [19], or by the compiler automatically [20], or by using the existing software infrastructure, such as BLCR (Berkeley Labs Checkpoint Restart) [21]. X10-FT leverages the classical checkpoint/restart approach to make X10 programs fault tolerant.
The asynchronous partitioned global address space (APGAS) model is a programming model aiming at unifying programming on multicore and clusters, with good productivity. However, it currently lacks support for fault tolerance (FT) such that a single transient failure may render hours to months of computation useless.
In this paper, we thoroughly analyze the feasibility of providing fault tolerance for APGAS model and make the first attempt to add fault tolerance support to an APGAS language called X10. Based on the analysis, we design and implement a fault-tolerance framework called X10-FT that leverages renowned techniques in distributed systems like distributed file systems and Paxos, as well as specific solutions based on the characteristics of the APGAS model to make checkpoints and consensus. This allows the system to transparently handle machine failures at different granularities. Using the features of the APGAS model, we extend the X10 compiler to automatically locate execution points to checkpoint program states without any intervention from programmers. Evaluation using a set of benchmarks shows that the cost for fault tolerance is modest.
Survey of State-of-the-art Fault Tolerance for Distributed Graph Processing Jobs
2021, Ruan Jian Xue Bao/Journal of Software
On providing os support to allow transparent use of traditional programming models for persistent memory
2020, ACM Journal on Emerging Technologies in Computing Systems
Transitioning scientific applications to using non-volatile memory for resilience
2019, ACM International Conference Proceeding Series
Workload partitioning and task migration to reduce response times in heterogeneous computing environments
2018, Proceedings - International Conference on Computer Communications and Networks, ICCCN
A Hierarchical Distributed Runtime Resource Management Scheme for NoC-Based Many-Cores
2018, ACM Transactions on Embedded Computing Systems

View all citing articles on Scopus

^☆: This research was sponsored by the National Science Foundation and the Defense Advanced Research Projects Agency under Cooperative Agreement NCR-8919038 with the Corporation for National Research Initiatives and by the Advanced Research Projects Agency under contract number DABT63-93-C-0054.

²: Currently at Inktomi Corporation.

³: Currently at Intel Corporation.

⁴: E-mail: {adamb,eriks,pstephan}@cs.cmu.edu.

View full text

Research NoteApplication Level Fault Tolerance in Heterogeneous Networks of Workstations☆

Abstract