Skip to main content
Top

2013 | OriginalPaper | Chapter

8. Asynchronous Distributed Checkpointing

Author : Michel Raynal

Published in: Distributed Algorithms for Message-Passing Systems

Publisher: Springer Berlin Heidelberg

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This chapter is devoted to checkpointing in asynchronous message-passing systems. It first presents the notions of local and global checkpoints and a theorem stating a necessary and sufficient condition for a set of local checkpoints to belong to the same consistent global checkpoint.
Then, the chapter considers two consistency conditions, which can be associated with a distributed computation enriched with local checkpoints (the corresponding execution is called a communication and checkpoint pattern). The first consistency condition (called z-cycle-freedom) ensures that any local checkpoint, which has been taken by a process, belongs to a consistent global checkpoint. The second consistency condition (called rollback-dependency trackability) is stronger. It states that a consistent global checkpoint can be associated on the fly with each local checkpoint (i.e., without additional communication).
The chapter discusses these consistency conditions and presents algorithms that, once superimposed on a distributed execution, ensure that the corresponding consistency condition is satisfied. It also presents a message logging algorithm suited to uncoordinated checkpointing.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
2.
go back to reference A. Acharya, B.R. Badrinath, Checkpointing distributed application on mobile computers, in 3rd Int’l Conference on Parallel and Distributed Information Systems (IEEE Press, New York, 1994), pp. 73–80 CrossRef A. Acharya, B.R. Badrinath, Checkpointing distributed application on mobile computers, in 3rd Int’l Conference on Parallel and Distributed Information Systems (IEEE Press, New York, 1994), pp. 73–80 CrossRef
16.
go back to reference L. Alvisi, K. Marzullo, Message logging: pessimistic, optimistic, and causal. IEEE Trans. Softw. Eng. 24(2), 149–159 (1998) CrossRef L. Alvisi, K. Marzullo, Message logging: pessimistic, optimistic, and causal. IEEE Trans. Softw. Eng. 24(2), 149–159 (1998) CrossRef
33.
go back to reference R. Baldoni, J.M. Hélary, A. Mostéfaoui, M. Raynal, Impossibility of scalar clock-based communication-induced checkpointing protocols ensuring the RDT property. Inf. Process. Lett. 80(2), 105–111 (2001) MATHCrossRef R. Baldoni, J.M. Hélary, A. Mostéfaoui, M. Raynal, Impossibility of scalar clock-based communication-induced checkpointing protocols ensuring the RDT property. Inf. Process. Lett. 80(2), 105–111 (2001) MATHCrossRef
34.
go back to reference R. Baldoni, J.M. Hélary, A. Mostéfaoui, M. Raynal, A communication-induced checkpointing protocol that ensures rollback-dependency trackability, in Proc. 27th IEEE Symposium on Fault-Tolerant Computing (FTCS-27) (IEEE Press, New York, 1997), pp. 68–77 CrossRef R. Baldoni, J.M. Hélary, A. Mostéfaoui, M. Raynal, A communication-induced checkpointing protocol that ensures rollback-dependency trackability, in Proc. 27th IEEE Symposium on Fault-Tolerant Computing (FTCS-27) (IEEE Press, New York, 1997), pp. 68–77 CrossRef
35.
36.
go back to reference R. Baldoni, J.M. Hélary, M. Raynal, Rollback-dependency trackability: a minimal characterization and its protocol. Inf. Comput. 165(2), 144–173 (2001) MATHCrossRef R. Baldoni, J.M. Hélary, M. Raynal, Rollback-dependency trackability: a minimal characterization and its protocol. Inf. Comput. 165(2), 144–173 (2001) MATHCrossRef
52.
go back to reference B.K. Bhargava, S.-R. Lian, Independent checkpointing and concurrent rollback for recovery in distributed systems: an optimistic approach, in Proc. 7th IEEE Symposium on Reliable Distributed Systems (SRDS’88) (IEEE Press, New York, 1988), pp. 3–12 B.K. Bhargava, S.-R. Lian, Independent checkpointing and concurrent rollback for recovery in distributed systems: an optimistic approach, in Proc. 7th IEEE Symposium on Reliable Distributed Systems (SRDS’88) (IEEE Press, New York, 1988), pp. 3–12
62.
go back to reference D. Briatico, A. Ciuffoletti, L.A. Simoncini, Distributed domino-effect free recovery algorithm, in 4th IEEE Symposium on Reliability in Distributed Software and Database Systems (IEEE Press, New York, 1984), pp. 207–215 D. Briatico, A. Ciuffoletti, L.A. Simoncini, Distributed domino-effect free recovery algorithm, in 4th IEEE Symposium on Reliability in Distributed Software and Database Systems (IEEE Press, New York, 1984), pp. 207–215
75.
go back to reference K.M. Chandy, L. Lamport, Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1), 63–75 (1985) CrossRef K.M. Chandy, L. Lamport, Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1), 63–75 (1985) CrossRef
103.
go back to reference F. Cristian, F. Jahanian, A timestamping-based checkpointing protocol for long-lived distributed computations, in Proc. 10th IEEE Symposium on Reliable Distributed Systems (SRDS’91) (IEEE Press, New York, 1991), pp. 12–20 F. Cristian, F. Jahanian, A timestamping-based checkpointing protocol for long-lived distributed computations, in Proc. 10th IEEE Symposium on Reliable Distributed Systems (SRDS’91) (IEEE Press, New York, 1991), pp. 12–20
104.
go back to reference O.P. Damani, Y.-M. Wang, V.K. Garg, Distributed recovery with k-optimistic logging. J. Parallel Distrib. Comput. 63(12), 1193–1218 (2003) MATHCrossRef O.P. Damani, Y.-M. Wang, V.K. Garg, Distributed recovery with k-optimistic logging. J. Parallel Distrib. Comput. 63(12), 1193–1218 (2003) MATHCrossRef
120.
go back to reference E.N. Elnozahy, L. Alvisi, Y.-M. Wang, D.B. Johnson, A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002) CrossRef E.N. Elnozahy, L. Alvisi, Y.-M. Wang, D.B. Johnson, A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002) CrossRef
129.
go back to reference J. Fowler, W. Zwaenepoel, Causal distributed breakpoints, in Proc. 10th Int’l IEEE Conference on Distributed Computing Systems (ICDCS’90) (IEEE Press, New York, 1990), pp. 134–141 J. Fowler, W. Zwaenepoel, Causal distributed breakpoints, in Proc. 10th Int’l IEEE Conference on Distributed Computing Systems (ICDCS’90) (IEEE Press, New York, 1990), pp. 134–141
144.
go back to reference I.C. Garcia, E. Buzato, Progressive construction of consistent global checkpoints, in Proc. 19th Int’l Conference on Distributed Computing Systems (ICDCS’99) (IEEE Press, New York, 1999), pp. 55–62 I.C. Garcia, E. Buzato, Progressive construction of consistent global checkpoints, in Proc. 19th Int’l Conference on Distributed Computing Systems (ICDCS’99) (IEEE Press, New York, 1999), pp. 55–62
145.
go back to reference I.C. Garcia, L.E. Buzato, On the minimal characterization of the rollback-dependency trackability property, in Proc. 21st Int’l Conference on Distributed Computing Systems (ICDCS’01) (IEEE Press, New York, 2001), pp. 342–349 CrossRef I.C. Garcia, L.E. Buzato, On the minimal characterization of the rollback-dependency trackability property, in Proc. 21st Int’l Conference on Distributed Computing Systems (ICDCS’01) (IEEE Press, New York, 2001), pp. 342–349 CrossRef
146.
go back to reference I.C. Garcia, L.E. Buzato, An efficient checkpointing protocol for the minimal characterization of operational rollback-dependency trackability, in Proc. 23rd Int’l Symposium on Reliable Distributed Systems (SRDS’04) (IEEE Press, New York, 2004), pp. 126–135 CrossRef I.C. Garcia, L.E. Buzato, An efficient checkpointing protocol for the minimal characterization of operational rollback-dependency trackability, in Proc. 23rd Int’l Symposium on Reliable Distributed Systems (SRDS’04) (IEEE Press, New York, 2004), pp. 126–135 CrossRef
149.
go back to reference V.K. Garg, Principles of Distributed Systems (Kluwer Academic, Dordrecht, 1996), 274 pages CrossRef V.K. Garg, Principles of Distributed Systems (Kluwer Academic, Dordrecht, 1996), 274 pages CrossRef
161.
go back to reference A.P. Goldberg, A. Gopal, A. Lowry, R. Strom, Restoring consistent global states of distributed computations, in Proc. ACM/ONR Workshop on Parallel and Distributed Debugging (ACM Press, New York, 1991), pp. 144–156 A.P. Goldberg, A. Gopal, A. Lowry, R. Strom, Restoring consistent global states of distributed computations, in Proc. ACM/ONR Workshop on Parallel and Distributed Debugging (ACM Press, New York, 1991), pp. 144–156
171.
go back to reference J.-M. Hélary, A. Mostéfaoui, R.H.B. Netzer, M. Raynal, Communication-based prevention of useless checkpoints in distributed computations. Distrib. Comput. 13(1), 29–43 (2000) CrossRef J.-M. Hélary, A. Mostéfaoui, R.H.B. Netzer, M. Raynal, Communication-based prevention of useless checkpoints in distributed computations. Distrib. Comput. 13(1), 29–43 (2000) CrossRef
173.
go back to reference J.-M. Hélary, A. Mostéfaoui, M. Raynal, Communication-induced determination of consistent snapshots. IEEE Trans. Parallel Distrib. Syst. 10(9), 865–877 (1999) CrossRef J.-M. Hélary, A. Mostéfaoui, M. Raynal, Communication-induced determination of consistent snapshots. IEEE Trans. Parallel Distrib. Syst. 10(9), 865–877 (1999) CrossRef
174.
go back to reference J.-M. Hélary, A. Mostéfaoui, M. Raynal, Interval consistency of asynchronous distributed computations. J. Comput. Syst. Sci. 64(2), 329–349 (2002) MATHCrossRef J.-M. Hélary, A. Mostéfaoui, M. Raynal, Interval consistency of asynchronous distributed computations. J. Comput. Syst. Sci. 64(2), 329–349 (2002) MATHCrossRef
175.
go back to reference J.-M. Hélary, R.H.B. Netzer, M. Raynal, Consistency criteria for distributed checkpoints. IEEE Trans. Softw. Eng. 2(2), 274–281 (1999) CrossRef J.-M. Hélary, R.H.B. Netzer, M. Raynal, Consistency criteria for distributed checkpoints. IEEE Trans. Softw. Eng. 2(2), 274–281 (1999) CrossRef
202.
go back to reference D.B. Johnson, W. Zwaenepoel, Recovery in distributed systems using optimistic message logging and checkpointing. J. Algorithms 11(3), 462–491 (1990) MathSciNetMATHCrossRef D.B. Johnson, W. Zwaenepoel, Recovery in distributed systems using optimistic message logging and checkpointing. J. Algorithms 11(3), 462–491 (1990) MathSciNetMATHCrossRef
207.
go back to reference R. Koo, S. Toueg, Checkpointing and rollback-recovery for distributed systems. IEEE Trans. Softw. Eng. 13(1), 23–31 (1987) MATHCrossRef R. Koo, S. Toueg, Checkpointing and rollback-recovery for distributed systems. IEEE Trans. Softw. Eng. 13(1), 23–31 (1987) MATHCrossRef
219.
go back to reference A.D. Kshemkalyani, M. Singhal, Distributed Computing: Principles, Algorithms and Systems (Cambridge University Press, Cambridge, 2008), 736 pages MATHCrossRef A.D. Kshemkalyani, M. Singhal, Distributed Computing: Principles, Algorithms and Systems (Cambridge University Press, Cambridge, 2008), 736 pages MATHCrossRef
246.
go back to reference D. Manivannan, R.H.B. Netzer, M. Singhal, Finding consistent global checkpoints in a distributed computation. IEEE Trans. Parallel Distrib. Syst. 8(6), 623–627 (1997) CrossRef D. Manivannan, R.H.B. Netzer, M. Singhal, Finding consistent global checkpoints in a distributed computation. IEEE Trans. Parallel Distrib. Syst. 8(6), 623–627 (1997) CrossRef
247.
go back to reference D. Manivannan, M. Singhal, A low overhead recovery technique using quasi-synchronous checkpointing, in Proc. 16th IEEE Int’l Conference on Distributed Computing Systems (ICDCS’96) (IEEE Press, New York, 1996), pp. 100–107 CrossRef D. Manivannan, M. Singhal, A low overhead recovery technique using quasi-synchronous checkpointing, in Proc. 16th IEEE Int’l Conference on Distributed Computing Systems (ICDCS’96) (IEEE Press, New York, 1996), pp. 100–107 CrossRef
271.
go back to reference A. Mostéfaoui, M. Raynal, Efficient message logging for uncoordinated checkpointing protocols, in Proc. 2nd European Dependable Computing Conference (EDCC’96). LNCS, vol. 1150 (Springer, Berlin, 1996), pp. 353–364 A. Mostéfaoui, M. Raynal, Efficient message logging for uncoordinated checkpointing protocols, in Proc. 2nd European Dependable Computing Conference (EDCC’96). LNCS, vol. 1150 (Springer, Berlin, 1996), pp. 353–364
283.
go back to reference R.H.B. Netzer, J. Xu, Necessary and sufficient conditions for consistent global snapshots. IEEE Trans. Parallel Distrib. Syst. 6(2), 165–169 (1995) CrossRef R.H.B. Netzer, J. Xu, Necessary and sufficient conditions for consistent global snapshots. IEEE Trans. Parallel Distrib. Syst. 6(2), 165–169 (1995) CrossRef
284.
go back to reference N. Neves, W.K. Fuchs, Adaptive recovery for mobile environments. Commun. ACM 40(1), 68–74 (1997) CrossRef N. Neves, W.K. Fuchs, Adaptive recovery for mobile environments. Commun. ACM 40(1), 68–74 (1997) CrossRef
299.
go back to reference R. Prakash, M. Singhal, Low-cost checkpointing and failure recovery in mobile computing systems. IEEE Trans. Parallel Distrib. Syst. 7(10), 1035–1048 (1996) CrossRef R. Prakash, M. Singhal, Low-cost checkpointing and failure recovery in mobile computing systems. IEEE Trans. Parallel Distrib. Syst. 7(10), 1035–1048 (1996) CrossRef
303.
go back to reference B. Randell, System structure for software fault-tolerance. IEEE Trans. Softw. Eng. SE1(2), 220–232 (1975) CrossRef B. Randell, System structure for software fault-tolerance. IEEE Trans. Softw. Eng. SE1(2), 220–232 (1975) CrossRef
332.
go back to reference D.L. Russell, State restoration in systems of communicating processes. IEEE Trans. Softw. Eng. SE6(2), 183–194 (1980) CrossRef D.L. Russell, State restoration in systems of communicating processes. IEEE Trans. Softw. Eng. SE6(2), 183–194 (1980) CrossRef
337.
go back to reference R. Schmid, I.C. Garcia, F. Pedone, L.E. Buzato, Optimal asynchronous garbage collection for RDT checkpointing protocols, in Proc. 25th Int’l Conference on Distributed Computing Systems (ICDCS’01) (IEEE Press, New York, 2005), pp. 167–176 R. Schmid, I.C. Garcia, F. Pedone, L.E. Buzato, Optimal asynchronous garbage collection for RDT checkpointing protocols, in Proc. 25th Int’l Conference on Distributed Computing Systems (ICDCS’01) (IEEE Press, New York, 2005), pp. 167–176
344.
go back to reference L.M. Silva, J.G. Silva, Global checkpoints for distributed programs, in Proc. 11th Symposium on Reliable Distributed Systems (SRDS’92) (IEEE Press, New York, 1992), pp. 155–162 L.M. Silva, J.G. Silva, Global checkpoints for distributed programs, in Proc. 11th Symposium on Reliable Distributed Systems (SRDS’92) (IEEE Press, New York, 1992), pp. 155–162
352.
go back to reference A.P. Sistla, J.L. Welch, Efficient distributed recovery using message logging, in Proc. 8th ACM Symposium on Principles of Distributed Computing (PODC’89) (ACM Press, New York, 1989), pp. 223–238 A.P. Sistla, J.L. Welch, Efficient distributed recovery using message logging, in Proc. 8th ACM Symposium on Principles of Distributed Computing (PODC’89) (ACM Press, New York, 1989), pp. 223–238
375.
go back to reference J. Tsai, S.-Y. Kuo, Y.-M. Wang, Theoretical analysis for communication-induced checkpointing protocols with rollback-dependency trackability. IEEE Trans. Parallel Distrib. Syst. 9(10), 963–971 (1998) CrossRef J. Tsai, S.-Y. Kuo, Y.-M. Wang, Theoretical analysis for communication-induced checkpointing protocols with rollback-dependency trackability. IEEE Trans. Parallel Distrib. Syst. 9(10), 963–971 (1998) CrossRef
376.
go back to reference J. Tsai, Y.-M. Wang, S.-Y. Kuo, Evaluations of domino-free communication-induced checkpointing protocols. Inf. Process. Lett. 69(1), 31–37 (1999) MathSciNetCrossRef J. Tsai, Y.-M. Wang, S.-Y. Kuo, Evaluations of domino-free communication-induced checkpointing protocols. Inf. Process. Lett. 69(1), 31–37 (1999) MathSciNetCrossRef
381.
go back to reference Y.-M. Wang, Consistent global checkpoints that contain a given set of local checkpoints. IEEE Trans. Comput. 46(4), 456–468 (1997) MathSciNetCrossRef Y.-M. Wang, Consistent global checkpoints that contain a given set of local checkpoints. IEEE Trans. Comput. 46(4), 456–468 (1997) MathSciNetCrossRef
382.
go back to reference Y.-M. Wang, P.Y. Chung, I.J. Lin, W.K. Fuchs, Checkpointing space reclamation for uncoordinated checkpointing in message-passing systems. IEEE Trans. Parallel Distrib. Syst. 6(5), 546–554 (1995) CrossRef Y.-M. Wang, P.Y. Chung, I.J. Lin, W.K. Fuchs, Checkpointing space reclamation for uncoordinated checkpointing in message-passing systems. IEEE Trans. Parallel Distrib. Syst. 6(5), 546–554 (1995) CrossRef
383.
go back to reference Y.-M. Wang, W.K. Fuchs, Optimistic message logging for independent checkpointing in message-passing systems, in Proc. 11th Symposium on Reliable Distributed Systems (SRDS’92) (IEEE Press, New York, 1992), pp. 147–154 Y.-M. Wang, W.K. Fuchs, Optimistic message logging for independent checkpointing in message-passing systems, in Proc. 11th Symposium on Reliable Distributed Systems (SRDS’92) (IEEE Press, New York, 1992), pp. 147–154
Metadata
Title
Asynchronous Distributed Checkpointing
Author
Michel Raynal
Copyright Year
2013
Publisher
Springer Berlin Heidelberg
DOI
https://doi.org/10.1007/978-3-642-38123-2_8

Premium Partner