nach oben

The Journal of Supercomputing

Erschienen in:

Open Access 01.09.2013

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

verfasst von: Ifeanyi P. Egwutuoha, David Levy, Bran Selic, Shiping Chen

Erschienen in: The Journal of Supercomputing | Ausgabe 3/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

In recent years, High Performance Computing (HPC) systems have been shifting from expensive massively parallel architectures to clusters of commodity PCs to take advantage of cost and performance benefits. Fault tolerance in such systems is a growing concern for long-running applications. In this paper, we briefly review the failure rates of HPC systems and also survey the fault tolerance approaches for HPC systems and issues with these approaches. Rollback-recovery techniques which are most often used for long-running applications on HPC clusters are discussed because they are widely used for long-running applications on HPC systems. Specifically, the feature requirements of rollback-recovery are discussed and a taxonomy is developed for over twenty popular checkpoint/restart solutions. The intent of this paper is to aid researchers in the domain as well as to facilitate development of new checkpointing solutions.

Vorheriger Artikel Parallel construction of independent spanning trees and an application in diagnosis on Möbius cubes

Nächster Artikel Box-counting algorithm on GPU and multi-core CPU: an OpenCL cross-platform study

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Ansel J, Arya K, Cooperman G (2009) DMTCP: transparent checkpointing for cluster computations and the desktop. In: 23rd IEEE international parallel and distributed processing symposium, Rome, Italy, pp 1–12

Bartlett W, Spainhower L (2004) Commercial fault tolerance: a tale of two systems. IEEE Trans Dependable Secure Comput 1(1):87–96 CrossRef

Bartlett J, Gray J, Horst B (1986) Fault tolerance in tandem computer systems. Tandem Technical Report

Blackham B (2005) [Online]. Available: http://cryopid.berlios.de/

Bosilca G, Bouteiller A, Cappello et al (2002) MPICH-V: toward a scalable fault tolerant MPI for volatile nodes. In: IEEE/ACM SIGARCH

Brown A, Patterson DA (2001) To err is human. In: Proceedings of the first workshop on evaluating and architecting system dependability (EASY’01), Göteborg, Sweden, July 2001

Byoung-Jip K (2005) Comparison of the existing checkpoint systems. Technical report, IBM Watson

Cappello F (2009) Fault tolerance in petascale/exascale systems: current knowledge, challenges and research opportunities. Int J High Perform Comput Appl 23:212–226 CrossRef

Cappello F, Geist A, Gropp B, Kale L, Kramer B, Snir M (2009) Toward exascale resilience. Int J High Perform Comput Appl 23(4):378–388 CrossRef

10.

CFDR (2012) [Online]. Available: CFDR http://cfdr.usenix.org/

11.

Chandy KM, Lamport L (1985) Distributed snapshots: determining global states of distributed systems. ACM Trans Comput Syst 3(1):63–75 CrossRef

12.

Checkpointing.org (2012) Checkpointing [Online]. Available: http://checkpointing.org

13.

Chen F (2010) On performance optimization and system design of flash memory based solid state drives in the storage hierarchy. Ph.D. dissertation, Ohio State University, Computer Science and Engineering, Ohio State University

14.

Chen L, Avizienis A (1978) N-version programming: a fault-tolerance approach to reliability of software operation, June, Toulouse, France, pp 3–9

15.

Christodorescu M, Jha S (2003) Static analysis of executables to detect malicious patterns. In: Proceedings of the 12th USENIX security symposium, pp 169–186

16.

Clark C, Fraser K, Hand S et al (2005) Live migration of virtual machines. In: Proceedings of the 2nd conference on symposium on networked systems design and implementation, vol 2, May 2005, pp 273–286

17.

Courtright II, William V, Gibson GA (1994) Backward error recovery in redundant disk arrays. In: Proc 1994 computer measurement group con

18.

Cristian F (1991) Understanding fault-tolerant distributed systems. Commun ACM 34(2):56–88 CrossRef

19.

Cristian F, Jahanian F (1991) A timestampbased checkpointing protocol for long-lived distributed computations. In: Proceedings, tenth symposium on reliable distributed systems

20.

Czarnecki K, Østerbye K, Völter M (2002) Generative programming. In: Object-oriented technology ECOOP 2002 workshop reader. Springer, Berlin/Heidelberg, pp 83–115

21.

Duell J, Hargrove P, Roman E (2002) The design and implementation of Berkeley lab’s Linux checkpoint/restart. Berkeley Lab Technical Report (publication LBNL-54941), December 2002

22.

Duell J, Hargrove P, Roman E (2002) Requirements for Linux checkpoint/restart. Lawrence Berkeley National Laboratory Technical Report LBNL-49659

23.

Elnozahy ENM, Alvisi L, Wang YM, Johnson DB (2002) A survey of rollback-recovery protocols in message-passing systems. ACM Comput Surv 34(3):375–408 CrossRef

24.

Fagg GE, Dongarra J (2000) FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Recent advances in parallel virtual machine and message passing interface, pp 346–353 CrossRef

25.

Fault tolerance, wikipedia (2012) [Online]. Available: http://en.wikipedia.org/wiki/Fault-tolerant_system

26.

Fusion-IO (2012) [Online]. Available: http://www.rpmgmbh.com/download/Whitepaper_Green.pdf

27.

Ghaeba JA, Smadia MA, Chebil J (2010) A high performance data integrity assurance based on the determinant technique. Elsevier, Amsterdam

28.

Gibson D (2012) esky [Online]. Available: http://esky.sourceforge.net

29.

Grant-Ireson W, Coombs CF (1988) Handbook of reliability engineering and management. McGraw-Hill, New York

30.

Gray J (1990) A census of tandem system availability between 1985 and 1990. IEEE Trans Reliab 39(4):409–418 CrossRef

31.

Gwertzman J, Seltzer M (1996) World-wide web cache consistency. In: Proc 1996 USENIX tech conf, San Diego, CA, Jan 1996, pp 141–152

32.

Hobbs C, Becha H, Amyot D (2008) Failure semantics in a SOA environment. In: 3rd int MCeTech conference on etechnologies, Montréal

33.

InfiniBand (2012) [Online]. Available: InfiniBand http://www.infinibandta.org/

34.

Johnson C, Holloway C (2007) The dangers of failure masking in fault tolerant software: aspects of a recent in-flight upset event. In: 2nd institution of engineering and technology international conference on system safety, pp 60–65

35.

Kalaiselvi S, Rajaraman V (2000) A survey of checkpointing algorithms for parallel and distributed computers, pp 489–510

36.

Koch D, Haubelt C, Teich J (2007) Efficient hardware checkpointing concepts, overhead analysis, and implementation. In: Proceedings of int symp on field programmable gate arrays (FPGA)

37.

Koren I, Krishna C (2007) Fault-tolerant systems. Elsevier/Morgan Kaufmann, San Diego, San Mateo MATH

38.

Lamport L (1978) Time, clocks, and the ordering of events in a distributed system. Commun ACM 21:558–565 MATHCrossRef

39.

Laprie JC, Arlat J, Beounes C, Kanoun K (1990) Definition and analysis of hardware-and software-fault-tolerant architectures. Computer 23(7):39–51 CrossRef

40.

Large software state (2012) [Online]. Available: http://www.safeware-eng.com/White_Papers/Software%20Safety.htm

41.

Li K, Naughton JF, Plank JS (1994) Low-latency, concurrent checkpointing for parallel programs. IEEE Trans Parallel Distrib Syst 5(8):874–879 CrossRef

42.

Liang Y, Zhang Y, Jette et al (2006) BlueGene/L failure analysis and prediction models. In: International conference on dependable systems and networks, DSN 2006. IEEE Press, New York, pp 425–434

43.

Lofgren KMJ et al (2001) Wear leveling techniques for flash EEPROM systems. US Patent No 6,230,233, 8 May 2001

44.

Lu CD (2005) Scalable diskless checkpointing for large parallel systems. Ph.D. dissertation, University of Illinois at Urbana-Champaign

45.

Lyons RE, Vanderkulk W (1962) The use of triple-modular redundancy to improve computer reliability. IBM J Res Dev 6(2):200–209 MATHCrossRef

46.

Maloney A, Goscinski A (2009) A survey and review of the current state of rollback-recovery for cluster systems. Concurr Comput., 1632–1666

47.

Milojicic DS, Douglis F, Paindaveine Y, Wheeler R, Zhou S (2000) Process migration. ACM Comput Surv 32(3):241–299 CrossRef

48.

MPI Forum (1994) MPI: a message-passing interface standard. Int J Supercomput Appl High Perform Comput

49.

Oliner A, Stearley J (2007) What supercomputers say: a study of five system logs. Washington, DC, pp 575–584

50.

Oppenheimer D, Patterson D (2002) Architecture and dependability of large-scale Internet services. IEEE Internet Comput 6(5):41–49 CrossRef

51.

Osman S, Subhraveti D, Su G, Nieh J (2002) The design and implementation of zap: a system for migration computing environments. Oper Syst Rev 36(SI):361–376 CrossRef

52.

Overeinder BJ, Sloot RN, Heederik RN, Hertzberger LO (1996) A dynamic load balancing system for parallel cluster computing. Future Gener Comput Syst 12:101–115 CrossRef

53.

PETSc (2012) [Online]. Available: http://www.mcs.anl.gov/petsc/petsc-as/

54.

Pinheiro E (2001) http://www.research.rutgers.edu/~edpin/epckpt/

55.

Plank JS, Li K (1994) ickp: a consistent checkpointer for multicomputers. In: IEEE parallel and distributed technologies, vol 2, pp 62–67

56.

Plank JS, Beck M, Kingsley G, Li K (1995) Libckpt: transparent checkpointing under UNIX. In: Conference proceedings. Usenix, Berkeley

57.

Poledna S (1996) The problem of replica determinism. Kluwer Academic, Boston, pp 29–30 MATH

58.

Ramkumar B, Strumpen V (1997) Portable checkpointing for heterogeneous archtitectures. In: Proceedings of he 27th international symposium on fault-tolerant computing (FTCS’97), pp 58–67 CrossRef

59.

Randell B (1975) System structure for software fault tolerance. IEEE Trans Softw Eng SE-1(2):220–232 CrossRef

60.

Roman E (2002) A survey of checkpoint/restart implementations. Berkeley Lab Technical Report (publication LBNL-54942)

61.

Ruscio J, Heffner M, Varadarajan S (2007) DejaVu: transparent user-level checkpointing, migration, and recovery for distributed systems. In: IEEE international parallel and distributed processing symposium, pp 1–10

62.

Sancho JC, Petrini F, Davis K, Gioiosa R, Jiang S (2005) Current practice and a direction forward in checkpoint/restart implementations for fault olerance. In: Proceedings of the 19th IEEE international parallel and distributed processing symposium (IPDPS’05)—workshop 18

63.

Sankaran S, Squyres JM, Barrett B et al (2005) The Lam/Mpi checkpoint/restart framework: system-initiated checkpointing. Int J High Perform Comput Appl 19(4):479–493 CrossRef

64.

Schroeder B, Gibson G (2007) Understanding failures in petascale computers. J Phys Conf Ser 78(1):012022 CrossRef

65.

Schroeder B, Gibson GA (2010) A large-scale study of failures in high performance computing systems. IEEE Trans Dependable Secure Comput 7(4):337–350 CrossRef

66.

Schulz M, Bronevetsky G, Fernandes R, Marques D, Pingali K, Stodghill P (2004) Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for MPI programs. In: Supercomputing, Pittsburgh, PA

67.

Shalf J, Dosanjh S, Morrison J (2011) Exascale computing technology challenges. In: VECPAR 2010, LNCS, vol 6449. Springer, Berlin, Heidelberg, pp 1–25

68.

Slivinski T, Broglio C, Wild C et al. (1984) Study of fault-tolerant software technology. NASA CR 172385, Langley Research, Center, VA

69.

Stellner G (1996) Cocheck: checkpointing and process migration for MPI. In: Proc IPPS

70.

Sudakov OO, Meshcheriakov IS, Boyko YV (2007) CHPOX: transparent checkpointing system for Linux clusters. In: IEEE international workshop on intelligent data acquisition and advanced computing systems: technology and applications, pp 159–164

71.

Takahashi T, Sumimoto S, Hori A, Harada H, Ishikawa Y (2000) PM2: high performance communication middleware for heterogeneous network environments, in supercomputing. In: ACM/IEEE 2000 conference. IEEE Press, New York, p 16

72.

Team Condor (2010) Condor version 7.5.3 manual. University of Wisconsin–Madison

73.

Teodorescu R, Nakano J, Torrellas J (2006) SWICH: a prototype for efficient cache-level checkpointing and rollback. IEEE Micro

74.

Top500 (2012) [Online]. Available: http://www.top500.org

75.

Walters J, Chaudhary V (2006) Application-level checkpointing techniques for parallel programs. In: Proc of the 3rd ICDCIT conf, pp 221–234

76.

Wang Y-M, Chung P-Y, Lin I-J, Fuchs WK (1995) Checkpoint space reclamation for uncoordinated checkpointing in message-passing systems. IEEE Trans Parallel Distrib Syst 6(5):546–554 CrossRef

77.

William RD, James EL Jr (2001) User-level checkpointing for LinuxThreads programs. In: FREENIX track: USENIX annual technical conference

78.

Zandy V (2002) ckpt [Online]. Available: http://pages.cs.wisc.edu/~zandy/ckpt/

79.

Zhong H, Nieh J (2001) CRAK: Linux checkpoint/restart as a kernel module. Technical Report CUCS-014-01, Department of Computer Science, Columbia University

Titel: A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems
verfasst von: Ifeanyi P. Egwutuoha
David Levy
Bran Selic
Shiping Chen
Publikationsdatum: 01.09.2013
Verlag: Springer US
Erschienen in: The Journal of Supercomputing / Ausgabe 3/2013
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-013-0884-0

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 3/2013

Multicore implementation of a fixed-complexity tree-search detector for MIMO communications

A preliminary evaluation of OpenACC implementations

Conditional diagnosability of balanced hypercubes under the MM∗ model

Nested clusters with intercluster routing

uBench: exposing the impact of CUDA block geometry in terms of performance

A multicore solution to Block–Toeplitz linear systems of equations

Premium Partner