Skip to main content
Erschienen in: The Journal of Supercomputing 3/2013

Open Access 01.09.2013

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

verfasst von: Ifeanyi P. Egwutuoha, David Levy, Bran Selic, Shiping Chen

Erschienen in: The Journal of Supercomputing | Ausgabe 3/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In recent years, High Performance Computing (HPC) systems have been shifting from expensive massively parallel architectures to clusters of commodity PCs to take advantage of cost and performance benefits. Fault tolerance in such systems is a growing concern for long-running applications. In this paper, we briefly review the failure rates of HPC systems and also survey the fault tolerance approaches for HPC systems and issues with these approaches. Rollback-recovery techniques which are most often used for long-running applications on HPC clusters are discussed because they are widely used for long-running applications on HPC systems. Specifically, the feature requirements of rollback-recovery are discussed and a taxonomy is developed for over twenty popular checkpoint/restart solutions. The intent of this paper is to aid researchers in the domain as well as to facilitate development of new checkpointing solutions.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Ansel J, Arya K, Cooperman G (2009) DMTCP: transparent checkpointing for cluster computations and the desktop. In: 23rd IEEE international parallel and distributed processing symposium, Rome, Italy, pp 1–12 Ansel J, Arya K, Cooperman G (2009) DMTCP: transparent checkpointing for cluster computations and the desktop. In: 23rd IEEE international parallel and distributed processing symposium, Rome, Italy, pp 1–12
2.
Zurück zum Zitat Bartlett W, Spainhower L (2004) Commercial fault tolerance: a tale of two systems. IEEE Trans Dependable Secure Comput 1(1):87–96 CrossRef Bartlett W, Spainhower L (2004) Commercial fault tolerance: a tale of two systems. IEEE Trans Dependable Secure Comput 1(1):87–96 CrossRef
3.
Zurück zum Zitat Bartlett J, Gray J, Horst B (1986) Fault tolerance in tandem computer systems. Tandem Technical Report Bartlett J, Gray J, Horst B (1986) Fault tolerance in tandem computer systems. Tandem Technical Report
5.
Zurück zum Zitat Bosilca G, Bouteiller A, Cappello et al (2002) MPICH-V: toward a scalable fault tolerant MPI for volatile nodes. In: IEEE/ACM SIGARCH Bosilca G, Bouteiller A, Cappello et al (2002) MPICH-V: toward a scalable fault tolerant MPI for volatile nodes. In: IEEE/ACM SIGARCH
6.
Zurück zum Zitat Brown A, Patterson DA (2001) To err is human. In: Proceedings of the first workshop on evaluating and architecting system dependability (EASY’01), Göteborg, Sweden, July 2001 Brown A, Patterson DA (2001) To err is human. In: Proceedings of the first workshop on evaluating and architecting system dependability (EASY’01), Göteborg, Sweden, July 2001
7.
Zurück zum Zitat Byoung-Jip K (2005) Comparison of the existing checkpoint systems. Technical report, IBM Watson Byoung-Jip K (2005) Comparison of the existing checkpoint systems. Technical report, IBM Watson
8.
Zurück zum Zitat Cappello F (2009) Fault tolerance in petascale/exascale systems: current knowledge, challenges and research opportunities. Int J High Perform Comput Appl 23:212–226 CrossRef Cappello F (2009) Fault tolerance in petascale/exascale systems: current knowledge, challenges and research opportunities. Int J High Perform Comput Appl 23:212–226 CrossRef
9.
Zurück zum Zitat Cappello F, Geist A, Gropp B, Kale L, Kramer B, Snir M (2009) Toward exascale resilience. Int J High Perform Comput Appl 23(4):378–388 CrossRef Cappello F, Geist A, Gropp B, Kale L, Kramer B, Snir M (2009) Toward exascale resilience. Int J High Perform Comput Appl 23(4):378–388 CrossRef
11.
Zurück zum Zitat Chandy KM, Lamport L (1985) Distributed snapshots: determining global states of distributed systems. ACM Trans Comput Syst 3(1):63–75 CrossRef Chandy KM, Lamport L (1985) Distributed snapshots: determining global states of distributed systems. ACM Trans Comput Syst 3(1):63–75 CrossRef
13.
Zurück zum Zitat Chen F (2010) On performance optimization and system design of flash memory based solid state drives in the storage hierarchy. Ph.D. dissertation, Ohio State University, Computer Science and Engineering, Ohio State University Chen F (2010) On performance optimization and system design of flash memory based solid state drives in the storage hierarchy. Ph.D. dissertation, Ohio State University, Computer Science and Engineering, Ohio State University
14.
Zurück zum Zitat Chen L, Avizienis A (1978) N-version programming: a fault-tolerance approach to reliability of software operation, June, Toulouse, France, pp 3–9 Chen L, Avizienis A (1978) N-version programming: a fault-tolerance approach to reliability of software operation, June, Toulouse, France, pp 3–9
15.
Zurück zum Zitat Christodorescu M, Jha S (2003) Static analysis of executables to detect malicious patterns. In: Proceedings of the 12th USENIX security symposium, pp 169–186 Christodorescu M, Jha S (2003) Static analysis of executables to detect malicious patterns. In: Proceedings of the 12th USENIX security symposium, pp 169–186
16.
Zurück zum Zitat Clark C, Fraser K, Hand S et al (2005) Live migration of virtual machines. In: Proceedings of the 2nd conference on symposium on networked systems design and implementation, vol 2, May 2005, pp 273–286 Clark C, Fraser K, Hand S et al (2005) Live migration of virtual machines. In: Proceedings of the 2nd conference on symposium on networked systems design and implementation, vol 2, May 2005, pp 273–286
17.
Zurück zum Zitat Courtright II, William V, Gibson GA (1994) Backward error recovery in redundant disk arrays. In: Proc 1994 computer measurement group con Courtright II, William V, Gibson GA (1994) Backward error recovery in redundant disk arrays. In: Proc 1994 computer measurement group con
18.
Zurück zum Zitat Cristian F (1991) Understanding fault-tolerant distributed systems. Commun ACM 34(2):56–88 CrossRef Cristian F (1991) Understanding fault-tolerant distributed systems. Commun ACM 34(2):56–88 CrossRef
19.
Zurück zum Zitat Cristian F, Jahanian F (1991) A timestampbased checkpointing protocol for long-lived distributed computations. In: Proceedings, tenth symposium on reliable distributed systems Cristian F, Jahanian F (1991) A timestampbased checkpointing protocol for long-lived distributed computations. In: Proceedings, tenth symposium on reliable distributed systems
20.
Zurück zum Zitat Czarnecki K, Østerbye K, Völter M (2002) Generative programming. In: Object-oriented technology ECOOP 2002 workshop reader. Springer, Berlin/Heidelberg, pp 83–115 Czarnecki K, Østerbye K, Völter M (2002) Generative programming. In: Object-oriented technology ECOOP 2002 workshop reader. Springer, Berlin/Heidelberg, pp 83–115
21.
Zurück zum Zitat Duell J, Hargrove P, Roman E (2002) The design and implementation of Berkeley lab’s Linux checkpoint/restart. Berkeley Lab Technical Report (publication LBNL-54941), December 2002 Duell J, Hargrove P, Roman E (2002) The design and implementation of Berkeley lab’s Linux checkpoint/restart. Berkeley Lab Technical Report (publication LBNL-54941), December 2002
22.
Zurück zum Zitat Duell J, Hargrove P, Roman E (2002) Requirements for Linux checkpoint/restart. Lawrence Berkeley National Laboratory Technical Report LBNL-49659 Duell J, Hargrove P, Roman E (2002) Requirements for Linux checkpoint/restart. Lawrence Berkeley National Laboratory Technical Report LBNL-49659
23.
Zurück zum Zitat Elnozahy ENM, Alvisi L, Wang YM, Johnson DB (2002) A survey of rollback-recovery protocols in message-passing systems. ACM Comput Surv 34(3):375–408 CrossRef Elnozahy ENM, Alvisi L, Wang YM, Johnson DB (2002) A survey of rollback-recovery protocols in message-passing systems. ACM Comput Surv 34(3):375–408 CrossRef
24.
Zurück zum Zitat Fagg GE, Dongarra J (2000) FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Recent advances in parallel virtual machine and message passing interface, pp 346–353 CrossRef Fagg GE, Dongarra J (2000) FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Recent advances in parallel virtual machine and message passing interface, pp 346–353 CrossRef
27.
Zurück zum Zitat Ghaeba JA, Smadia MA, Chebil J (2010) A high performance data integrity assurance based on the determinant technique. Elsevier, Amsterdam Ghaeba JA, Smadia MA, Chebil J (2010) A high performance data integrity assurance based on the determinant technique. Elsevier, Amsterdam
29.
Zurück zum Zitat Grant-Ireson W, Coombs CF (1988) Handbook of reliability engineering and management. McGraw-Hill, New York Grant-Ireson W, Coombs CF (1988) Handbook of reliability engineering and management. McGraw-Hill, New York
30.
Zurück zum Zitat Gray J (1990) A census of tandem system availability between 1985 and 1990. IEEE Trans Reliab 39(4):409–418 CrossRef Gray J (1990) A census of tandem system availability between 1985 and 1990. IEEE Trans Reliab 39(4):409–418 CrossRef
31.
Zurück zum Zitat Gwertzman J, Seltzer M (1996) World-wide web cache consistency. In: Proc 1996 USENIX tech conf, San Diego, CA, Jan 1996, pp 141–152 Gwertzman J, Seltzer M (1996) World-wide web cache consistency. In: Proc 1996 USENIX tech conf, San Diego, CA, Jan 1996, pp 141–152
32.
Zurück zum Zitat Hobbs C, Becha H, Amyot D (2008) Failure semantics in a SOA environment. In: 3rd int MCeTech conference on etechnologies, Montréal Hobbs C, Becha H, Amyot D (2008) Failure semantics in a SOA environment. In: 3rd int MCeTech conference on etechnologies, Montréal
34.
Zurück zum Zitat Johnson C, Holloway C (2007) The dangers of failure masking in fault tolerant software: aspects of a recent in-flight upset event. In: 2nd institution of engineering and technology international conference on system safety, pp 60–65 Johnson C, Holloway C (2007) The dangers of failure masking in fault tolerant software: aspects of a recent in-flight upset event. In: 2nd institution of engineering and technology international conference on system safety, pp 60–65
35.
Zurück zum Zitat Kalaiselvi S, Rajaraman V (2000) A survey of checkpointing algorithms for parallel and distributed computers, pp 489–510 Kalaiselvi S, Rajaraman V (2000) A survey of checkpointing algorithms for parallel and distributed computers, pp 489–510
36.
Zurück zum Zitat Koch D, Haubelt C, Teich J (2007) Efficient hardware checkpointing concepts, overhead analysis, and implementation. In: Proceedings of int symp on field programmable gate arrays (FPGA) Koch D, Haubelt C, Teich J (2007) Efficient hardware checkpointing concepts, overhead analysis, and implementation. In: Proceedings of int symp on field programmable gate arrays (FPGA)
37.
Zurück zum Zitat Koren I, Krishna C (2007) Fault-tolerant systems. Elsevier/Morgan Kaufmann, San Diego, San Mateo MATH Koren I, Krishna C (2007) Fault-tolerant systems. Elsevier/Morgan Kaufmann, San Diego, San Mateo MATH
38.
Zurück zum Zitat Lamport L (1978) Time, clocks, and the ordering of events in a distributed system. Commun ACM 21:558–565 MATHCrossRef Lamport L (1978) Time, clocks, and the ordering of events in a distributed system. Commun ACM 21:558–565 MATHCrossRef
39.
Zurück zum Zitat Laprie JC, Arlat J, Beounes C, Kanoun K (1990) Definition and analysis of hardware-and software-fault-tolerant architectures. Computer 23(7):39–51 CrossRef Laprie JC, Arlat J, Beounes C, Kanoun K (1990) Definition and analysis of hardware-and software-fault-tolerant architectures. Computer 23(7):39–51 CrossRef
41.
Zurück zum Zitat Li K, Naughton JF, Plank JS (1994) Low-latency, concurrent checkpointing for parallel programs. IEEE Trans Parallel Distrib Syst 5(8):874–879 CrossRef Li K, Naughton JF, Plank JS (1994) Low-latency, concurrent checkpointing for parallel programs. IEEE Trans Parallel Distrib Syst 5(8):874–879 CrossRef
42.
Zurück zum Zitat Liang Y, Zhang Y, Jette et al (2006) BlueGene/L failure analysis and prediction models. In: International conference on dependable systems and networks, DSN 2006. IEEE Press, New York, pp 425–434 Liang Y, Zhang Y, Jette et al (2006) BlueGene/L failure analysis and prediction models. In: International conference on dependable systems and networks, DSN 2006. IEEE Press, New York, pp 425–434
43.
Zurück zum Zitat Lofgren KMJ et al (2001) Wear leveling techniques for flash EEPROM systems. US Patent No 6,230,233, 8 May 2001 Lofgren KMJ et al (2001) Wear leveling techniques for flash EEPROM systems. US Patent No 6,230,233, 8 May 2001
44.
Zurück zum Zitat Lu CD (2005) Scalable diskless checkpointing for large parallel systems. Ph.D. dissertation, University of Illinois at Urbana-Champaign Lu CD (2005) Scalable diskless checkpointing for large parallel systems. Ph.D. dissertation, University of Illinois at Urbana-Champaign
45.
Zurück zum Zitat Lyons RE, Vanderkulk W (1962) The use of triple-modular redundancy to improve computer reliability. IBM J Res Dev 6(2):200–209 MATHCrossRef Lyons RE, Vanderkulk W (1962) The use of triple-modular redundancy to improve computer reliability. IBM J Res Dev 6(2):200–209 MATHCrossRef
46.
Zurück zum Zitat Maloney A, Goscinski A (2009) A survey and review of the current state of rollback-recovery for cluster systems. Concurr Comput., 1632–1666 Maloney A, Goscinski A (2009) A survey and review of the current state of rollback-recovery for cluster systems. Concurr Comput., 1632–1666
47.
Zurück zum Zitat Milojicic DS, Douglis F, Paindaveine Y, Wheeler R, Zhou S (2000) Process migration. ACM Comput Surv 32(3):241–299 CrossRef Milojicic DS, Douglis F, Paindaveine Y, Wheeler R, Zhou S (2000) Process migration. ACM Comput Surv 32(3):241–299 CrossRef
48.
Zurück zum Zitat MPI Forum (1994) MPI: a message-passing interface standard. Int J Supercomput Appl High Perform Comput MPI Forum (1994) MPI: a message-passing interface standard. Int J Supercomput Appl High Perform Comput
49.
Zurück zum Zitat Oliner A, Stearley J (2007) What supercomputers say: a study of five system logs. Washington, DC, pp 575–584 Oliner A, Stearley J (2007) What supercomputers say: a study of five system logs. Washington, DC, pp 575–584
50.
Zurück zum Zitat Oppenheimer D, Patterson D (2002) Architecture and dependability of large-scale Internet services. IEEE Internet Comput 6(5):41–49 CrossRef Oppenheimer D, Patterson D (2002) Architecture and dependability of large-scale Internet services. IEEE Internet Comput 6(5):41–49 CrossRef
51.
Zurück zum Zitat Osman S, Subhraveti D, Su G, Nieh J (2002) The design and implementation of zap: a system for migration computing environments. Oper Syst Rev 36(SI):361–376 CrossRef Osman S, Subhraveti D, Su G, Nieh J (2002) The design and implementation of zap: a system for migration computing environments. Oper Syst Rev 36(SI):361–376 CrossRef
52.
Zurück zum Zitat Overeinder BJ, Sloot RN, Heederik RN, Hertzberger LO (1996) A dynamic load balancing system for parallel cluster computing. Future Gener Comput Syst 12:101–115 CrossRef Overeinder BJ, Sloot RN, Heederik RN, Hertzberger LO (1996) A dynamic load balancing system for parallel cluster computing. Future Gener Comput Syst 12:101–115 CrossRef
55.
Zurück zum Zitat Plank JS, Li K (1994) ickp: a consistent checkpointer for multicomputers. In: IEEE parallel and distributed technologies, vol 2, pp 62–67 Plank JS, Li K (1994) ickp: a consistent checkpointer for multicomputers. In: IEEE parallel and distributed technologies, vol 2, pp 62–67
56.
Zurück zum Zitat Plank JS, Beck M, Kingsley G, Li K (1995) Libckpt: transparent checkpointing under UNIX. In: Conference proceedings. Usenix, Berkeley Plank JS, Beck M, Kingsley G, Li K (1995) Libckpt: transparent checkpointing under UNIX. In: Conference proceedings. Usenix, Berkeley
57.
Zurück zum Zitat Poledna S (1996) The problem of replica determinism. Kluwer Academic, Boston, pp 29–30 MATH Poledna S (1996) The problem of replica determinism. Kluwer Academic, Boston, pp 29–30 MATH
58.
Zurück zum Zitat Ramkumar B, Strumpen V (1997) Portable checkpointing for heterogeneous archtitectures. In: Proceedings of he 27th international symposium on fault-tolerant computing (FTCS’97), pp 58–67 CrossRef Ramkumar B, Strumpen V (1997) Portable checkpointing for heterogeneous archtitectures. In: Proceedings of he 27th international symposium on fault-tolerant computing (FTCS’97), pp 58–67 CrossRef
59.
Zurück zum Zitat Randell B (1975) System structure for software fault tolerance. IEEE Trans Softw Eng SE-1(2):220–232 CrossRef Randell B (1975) System structure for software fault tolerance. IEEE Trans Softw Eng SE-1(2):220–232 CrossRef
60.
Zurück zum Zitat Roman E (2002) A survey of checkpoint/restart implementations. Berkeley Lab Technical Report (publication LBNL-54942) Roman E (2002) A survey of checkpoint/restart implementations. Berkeley Lab Technical Report (publication LBNL-54942)
61.
Zurück zum Zitat Ruscio J, Heffner M, Varadarajan S (2007) DejaVu: transparent user-level checkpointing, migration, and recovery for distributed systems. In: IEEE international parallel and distributed processing symposium, pp 1–10 Ruscio J, Heffner M, Varadarajan S (2007) DejaVu: transparent user-level checkpointing, migration, and recovery for distributed systems. In: IEEE international parallel and distributed processing symposium, pp 1–10
62.
Zurück zum Zitat Sancho JC, Petrini F, Davis K, Gioiosa R, Jiang S (2005) Current practice and a direction forward in checkpoint/restart implementations for fault olerance. In: Proceedings of the 19th IEEE international parallel and distributed processing symposium (IPDPS’05)—workshop 18 Sancho JC, Petrini F, Davis K, Gioiosa R, Jiang S (2005) Current practice and a direction forward in checkpoint/restart implementations for fault olerance. In: Proceedings of the 19th IEEE international parallel and distributed processing symposium (IPDPS’05)—workshop 18
63.
Zurück zum Zitat Sankaran S, Squyres JM, Barrett B et al (2005) The Lam/Mpi checkpoint/restart framework: system-initiated checkpointing. Int J High Perform Comput Appl 19(4):479–493 CrossRef Sankaran S, Squyres JM, Barrett B et al (2005) The Lam/Mpi checkpoint/restart framework: system-initiated checkpointing. Int J High Perform Comput Appl 19(4):479–493 CrossRef
64.
Zurück zum Zitat Schroeder B, Gibson G (2007) Understanding failures in petascale computers. J Phys Conf Ser 78(1):012022 CrossRef Schroeder B, Gibson G (2007) Understanding failures in petascale computers. J Phys Conf Ser 78(1):012022 CrossRef
65.
Zurück zum Zitat Schroeder B, Gibson GA (2010) A large-scale study of failures in high performance computing systems. IEEE Trans Dependable Secure Comput 7(4):337–350 CrossRef Schroeder B, Gibson GA (2010) A large-scale study of failures in high performance computing systems. IEEE Trans Dependable Secure Comput 7(4):337–350 CrossRef
66.
Zurück zum Zitat Schulz M, Bronevetsky G, Fernandes R, Marques D, Pingali K, Stodghill P (2004) Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for MPI programs. In: Supercomputing, Pittsburgh, PA Schulz M, Bronevetsky G, Fernandes R, Marques D, Pingali K, Stodghill P (2004) Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for MPI programs. In: Supercomputing, Pittsburgh, PA
67.
Zurück zum Zitat Shalf J, Dosanjh S, Morrison J (2011) Exascale computing technology challenges. In: VECPAR 2010, LNCS, vol 6449. Springer, Berlin, Heidelberg, pp 1–25 Shalf J, Dosanjh S, Morrison J (2011) Exascale computing technology challenges. In: VECPAR 2010, LNCS, vol 6449. Springer, Berlin, Heidelberg, pp 1–25
68.
Zurück zum Zitat Slivinski T, Broglio C, Wild C et al. (1984) Study of fault-tolerant software technology. NASA CR 172385, Langley Research, Center, VA Slivinski T, Broglio C, Wild C et al. (1984) Study of fault-tolerant software technology. NASA CR 172385, Langley Research, Center, VA
69.
Zurück zum Zitat Stellner G (1996) Cocheck: checkpointing and process migration for MPI. In: Proc IPPS Stellner G (1996) Cocheck: checkpointing and process migration for MPI. In: Proc IPPS
70.
Zurück zum Zitat Sudakov OO, Meshcheriakov IS, Boyko YV (2007) CHPOX: transparent checkpointing system for Linux clusters. In: IEEE international workshop on intelligent data acquisition and advanced computing systems: technology and applications, pp 159–164 Sudakov OO, Meshcheriakov IS, Boyko YV (2007) CHPOX: transparent checkpointing system for Linux clusters. In: IEEE international workshop on intelligent data acquisition and advanced computing systems: technology and applications, pp 159–164
71.
Zurück zum Zitat Takahashi T, Sumimoto S, Hori A, Harada H, Ishikawa Y (2000) PM2: high performance communication middleware for heterogeneous network environments, in supercomputing. In: ACM/IEEE 2000 conference. IEEE Press, New York, p 16 Takahashi T, Sumimoto S, Hori A, Harada H, Ishikawa Y (2000) PM2: high performance communication middleware for heterogeneous network environments, in supercomputing. In: ACM/IEEE 2000 conference. IEEE Press, New York, p 16
72.
Zurück zum Zitat Team Condor (2010) Condor version 7.5.3 manual. University of Wisconsin–Madison Team Condor (2010) Condor version 7.5.3 manual. University of Wisconsin–Madison
73.
Zurück zum Zitat Teodorescu R, Nakano J, Torrellas J (2006) SWICH: a prototype for efficient cache-level checkpointing and rollback. IEEE Micro Teodorescu R, Nakano J, Torrellas J (2006) SWICH: a prototype for efficient cache-level checkpointing and rollback. IEEE Micro
75.
Zurück zum Zitat Walters J, Chaudhary V (2006) Application-level checkpointing techniques for parallel programs. In: Proc of the 3rd ICDCIT conf, pp 221–234 Walters J, Chaudhary V (2006) Application-level checkpointing techniques for parallel programs. In: Proc of the 3rd ICDCIT conf, pp 221–234
76.
Zurück zum Zitat Wang Y-M, Chung P-Y, Lin I-J, Fuchs WK (1995) Checkpoint space reclamation for uncoordinated checkpointing in message-passing systems. IEEE Trans Parallel Distrib Syst 6(5):546–554 CrossRef Wang Y-M, Chung P-Y, Lin I-J, Fuchs WK (1995) Checkpoint space reclamation for uncoordinated checkpointing in message-passing systems. IEEE Trans Parallel Distrib Syst 6(5):546–554 CrossRef
77.
Zurück zum Zitat William RD, James EL Jr (2001) User-level checkpointing for LinuxThreads programs. In: FREENIX track: USENIX annual technical conference William RD, James EL Jr (2001) User-level checkpointing for LinuxThreads programs. In: FREENIX track: USENIX annual technical conference
79.
Zurück zum Zitat Zhong H, Nieh J (2001) CRAK: Linux checkpoint/restart as a kernel module. Technical Report CUCS-014-01, Department of Computer Science, Columbia University Zhong H, Nieh J (2001) CRAK: Linux checkpoint/restart as a kernel module. Technical Report CUCS-014-01, Department of Computer Science, Columbia University
Metadaten
Titel
A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems
verfasst von
Ifeanyi P. Egwutuoha
David Levy
Bran Selic
Shiping Chen
Publikationsdatum
01.09.2013
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 3/2013
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-013-0884-0

Weitere Artikel der Ausgabe 3/2013

The Journal of Supercomputing 3/2013 Zur Ausgabe

Premium Partner