Research NoteApplication Level Fault Tolerance in Heterogeneous Networks of Workstations☆
References (23)
- José, Nagib Cotrim, Árabe, Adam, Beguelin, Bruce, Lowekamp, Erik, Seligman, Michael, Starkey, Peter, Stephan, 1996,...
- et al.
Application level fault tolerance in heterogeneous networks of workstations
Technical Report
(1996) The effects of checkpointing on program execution time
Inform. Process. Lett.
(1991)- Elmootazbella, Elnozahy, David, Johnson, Willy, Zwaeneopoel, 1992, The performance of consistent checkpointing,...
- J. Choi, 1992, Scalapack: A scalable linear algebra library for distributed memory concurrent computers, 4th Symposium...
- Geoffrey, C. Fox, 1988, What have we learned from using real parallel machines to solve real problems?, Proceedings of...
- et al.
PVM: Parallel Virtual Machine—A Users' Guide and Tutorial for Networked Parallel Computing
(1994) On the optimum checkpoint interval
J. Assoc. Comput. Mach.
(1979)- et al.
Dynamic reconfiguration in distributed systems: Adapting software modules for replacement
Technical Report
(1992) - Yennun, Huang, Chandra, Kintala, 1993, Software implemented fault tolerance: Technologies and experience,...
Fail-safe PVM: A portable package for distributed programming with transparent recovery
Technical Report
Cited by (77)
X10-FT: Transparent fault tolerance for APGAS language and runtime
2014, Parallel ComputingCitation Excerpt :The checkpoints used for recovery can be diskless [17,18], and also can be at the user/kernel level. The checkpoint/recovery code can be inserted either by the programmer manually [19], or by the compiler automatically [20], or by using the existing software infrastructure, such as BLCR (Berkeley Labs Checkpoint Restart) [21]. X10-FT leverages the classical checkpoint/restart approach to make X10 programs fault tolerant.
Survey of State-of-the-art Fault Tolerance for Distributed Graph Processing Jobs
2021, Ruan Jian Xue Bao/Journal of SoftwareOn providing os support to allow transparent use of traditional programming models for persistent memory
2020, ACM Journal on Emerging Technologies in Computing SystemsTransitioning scientific applications to using non-volatile memory for resilience
2019, ACM International Conference Proceeding SeriesWorkload partitioning and task migration to reduce response times in heterogeneous computing environments
2018, Proceedings - International Conference on Computer Communications and Networks, ICCCNA Hierarchical Distributed Runtime Resource Management Scheme for NoC-Based Many-Cores
2018, ACM Transactions on Embedded Computing Systems
- ☆
This research was sponsored by the National Science Foundation and the Defense Advanced Research Projects Agency under Cooperative Agreement NCR-8919038 with the Corporation for National Research Initiatives and by the Advanced Research Projects Agency under contract number DABT63-93-C-0054.
- 2
Currently at Inktomi Corporation.
- 3
Currently at Intel Corporation.
- 4
E-mail: {adamb,eriks,pstephan}@cs.cmu.edu.