skip to main content
article
Free Access

Unreliable failure detectors for reliable distributed systems

Published:01 March 1996Publication History
Skip Abstract Section

Abstract

We introduce the concept of unreliable failure detectors and study how they can be used to solve Consensus in asynchronous systems with crash failures. We characterise unreliable failure detectors in terms of two properties—completeness and accuracy. We show that Consensus can be solved even with unreliable failure detectors that make an infinite number of mistakes, and determine which ones can be used to solve Consensus despite any number of crashes, and which ones require a majority of correct processes. We prove that Consensus and Atomic Broadcast are reducible to each other in asynchronous systems with crash failures; thus, the above results also apply to Atomic Broadcast. A companion paper shows that one of the failure detectors introduced here is the weakest failure detector for solving Consensus [Chandra et al. 1992].

References

  1. AMIR, ~., DOLEV, D., KRAMER, S., AND MALKI, D. 1991. Transis: A communication sub-system for high availability. Tech. Rep. CS91-13 (Nov.), Computer Science Department, The Hebrew University of Jerusalem, Jerusalem, Israel.Google ScholarGoogle Scholar
  2. ATrlYA~ H., B^R-NoY, A., DOLEV, D., KOLLER, D., PELEG, D., AND REISCHUK, R. 1987. Achievable cases in an asynchronous environment. In Proceedings of the 28th Symposium on Foundations of Computer Science (Oct.). IEEE Computer Society Press, Washington, D.C., pp. 337-346.Google ScholarGoogle Scholar
  3. ATTIYA, H., DWORK, C., LYNCH, N., AND STOCKMEYER, L. 1991. Bounds on the time to reach agreement Jn the presence of timing uncertainty. In Proceedings of the 23rd ,4CM Symposium on Theory of Computing (New Orleans, La., May 6-8). ACM, New York, pp. 359-369. Google ScholarGoogle Scholar
  4. BEN-OR, M. 1983. Another advantage of free choice: Completely asynchronous agreement protocols. In Proceedings of the 2nd ACM Symposium on Principles of Distributed Computing (Montreal, Que. Canada, Aug. 17-19). ACM, New York, pp. 27-30. Google ScholarGoogle Scholar
  5. BERMAN, P., GARAY, J. A., AND PERRY, K.J. 1989. Towards optimal distributed consensus. In Proceedings of the 30th Symposium on Foundations of Computer Science (Oct.). IEEE Computer Society Press, Washington, D.C., pp. 410-415.Google ScholarGoogle Scholar
  6. BIRAN, O., MORA~, S., AND ZAKS, S. 1988. A combinatorial characterization of the distributed tasks which are solvable in the presence of one faulty processor. In Proceedings of the 7th ACM Symposium on Principles of Distributed Computing (Toronto, Ont., Canada, Aug. 15-17). ACM, New York, pp. 263-275. Google ScholarGoogle Scholar
  7. BIRMAN, K. P., COOPER, R., JOSEPH, T. A., KANE, K. P., AND SCHMUCK, F. B. 1990. lsis--A Distributed Programming Environment.Google ScholarGoogle Scholar
  8. BIRMAN, K. P., AND JOSEPH, T.A. 1987. Reliable communication in the presence of failures. ACM Trans. Comput. Syst. 5, 1 (Feb.), 47-76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. BRACHA, G., AND TOUEG, S. 1985. Asynchronous consensus and broadcast protocols. J. ACM 32, 4 (Oct.), 824-840. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. BRIDGLAND, M. F., AND WATRO, R.J. 1987. Fault-tolerant decision making in totally asynchronous distributed systems. In Proceedings of the 6th ACM Symposium on Principles of Distributed Computing (Vancouver, B.C., Canada, Aug. 10-12). ACM, New York, pp. 52-63. Google ScholarGoogle Scholar
  11. BUDmRAJA, N., GOPAL, A., AND TOUEG, S. 1990. Early-stopping distributed bidding and applications. In Proceedings of the 4th International Workshop on Distributed Algorithms (Sept.). Springer- Verlag, New York, pp. 301-320. Google ScholarGoogle Scholar
  12. CHANDRA, T. D., HADZILACOS, V., AND TOUEG, S. 1992. The weakest failure detector for solving consensus. Technical Report 92-1293 (July), Department of Computer Science, Cornell University. Available from ftp://ftp.cs.cornell.edu/pub/chandra/failure.detectors.weakest.dvi. Z. A preliminary version appeared in the Proceedings of the llth ACM Symposium on Principles of Distributed Computing (Vancouver, B.C., Canada, Aug. 10-12). ACM, New York, pp. 147-158. Google ScholarGoogle Scholar
  13. CHANDRA, T. D., HADZlLACOS, V., AND TOUEG, S. 1995. Impossibility of group membership in asynchronous systems. Tech. Rep. 95-1533. Computer Science Department, Cornell University, Ithaca, New York. Google ScholarGoogle Scholar
  14. CHANDRA, T. D., AND LARREA, M. 1994. E-mail correspondence. Showed that cW cannot be used to solve non-blocking atomic commit.Google ScholarGoogle Scholar
  15. CHANDRA, T. D., AND TOUEG, S. 1990. Time and message efficient reliable broadcasts. In Proceedings of the Fourth International Workshop on Distributed Algorithms (Sept.). Springer-Verlag, New York, pp. 289-300. Google ScholarGoogle Scholar
  16. CHANG, J., AND MAXEMCHUK, N. 1984. Reliable broadcast protocols. ACM Trans. Comput. Syst. 2, 3 (Aug.), 251-273. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. CHOR, B., AND DWORK, C. 1989. Randomization in byzantine agreement. Adv. Comput. Res. 5, 443-497.Google ScholarGoogle Scholar
  18. CRISTIAN, F. 1987. Issues in the design of highly available computing services. In Annual Symposium of the Canadian Information Processing Society (July), pp. 9-16. Also IBM Res. Rep. RJ5856. Thomas J. Watson Research Center, Hawthorne, N.Y.Google ScholarGoogle Scholar
  19. CRISTIAN, F., AGHILI, H., STRONG, R., AND DOLEV, D. 1985/1989. Atomic broadcast: From simple message diffusion to Byzantine agreement. In Proceedings of the 15th International Symposium on Fauh-Tolerant Computing (June 1985), pp. 200-206. A revised version appears as IBM Research Laboratory Technical Report RJ5244 (April 1989). Thomas J. Watson Research Center, Hawthorne, N.Y.Google ScholarGoogle Scholar
  20. CRISTIAN, F., DANCEY, R. D., AND DEHN, J. 1990. Fault-tolerance in the advanced automation system. Tech. Rep. RJ 7424 (April), IBM Research Laboratory, Thomas J. Watson Research Center, Hawthorne, N.Y.Google ScholarGoogle Scholar
  21. DOLEV, D., DWORK, C., AND STOCKMEYER, L. 1987. On the minimal synchronism needed for distributed consensus. J. ACM 34, 1 (Jan.), 77-97. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. DOLEV, D., LYNCH, N. A., PINTER, S. S., STARK, E. W., AND WEIHL, W. E. 1986. Reaching approximate agreement in the presence of faults. J. ACM 33, 3 (July), 499-516. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. DWORK, C., LYNCH, N. A., AND STOCKMEYER, L. 1988. Consensus in the presence of partial synchrony. J. ACM 35, 2 (Apr.), 288-323. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. FISCHER, M.J. 1983. The consensus problem in unreliable distributed systems (a brief survey). Tech. Rep. 273 (June), Department of Computer Science, Yale University, New Haven, Conn.Google ScholarGoogle Scholar
  25. FISCHER, M. J., LYNCH, N. A., AND PATERSON, M.S. 1985. Impossibility of distributed consensus with one faulty process. J. ACM 32, 2 (Apr.), 374-382. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. GOPAL, A., STRONG, R., TOUEG, S., AND CRIST{AN, F. 1990. Early-delivery atomic broadcast. In Proceedings of the 9th ACM Symposium on Principles of Distributed Computing (Quebec City, Que, Canada, Aug. 22-24). ACM, New York, pp. 297-310. Google ScholarGoogle Scholar
  27. GVERRAOt;k R. 1995. Revisiting the relationship between non blocking atomic commitment and consensus. In Proceedings of the 9th International Workshop on Distributed Algorithms (Sept.). Springer-Verlag, New York, pp. 87-100. Google ScholarGoogle Scholar
  28. HADZILACOS, V., AND TOUEG, S. 1993. Fault-tolerant broadcasts and related problems. In Distributed Systems, Chap. 5, S. J. MUELENDER, Ed, Addison-Wesley, Reading, Mass., pp. 97-145. Google ScholarGoogle Scholar
  29. HADZILACOS, V., AND TOUEG, S. 1994. A modular approach to fault-tolerant broadcasts and related problems. Tech. Rep. 94-1425 (May), Computer Science Department, Cornell University, Ithaca. N.Y. Available by anonymous ftp from ftp://ftp.db.toronto.edu/pub/vassos/fault.tolerant. broadcasts.dvi.Z. (An earlier version is also available in Hadzilacos and Toueg {1993}). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. HALPERN, J. Y., AND MOSES, Y. 1990. Knowledge and common knowledge in a distributed environment. J. ACM 37, 3 (July), 549-587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. LAMPORT, L. 1978. The implementation of reliable distributed multiprocess systems. Comput. Netw. 2, 95-114.Google ScholarGoogle Scholar
  32. LAMPORT, L., SHOSTAK, R., AND PEASE, M. 1982. The Byzantine generals problem. ACM Trans. Prog. Lang. Syst. 4, 3 (July), 382-401. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Lo, W. K., AND HADZILACOS, V. 1994. Using failure detectors to solve consensus in asynchronous shared-memory systems. In Proceedings of the 8th International Workshop on Distributed Algorithms (Sept.), Springer-Verlag, New York, pp. 280-295. Available from ftp://ftp.db.toronto.edu/pub/ vassos/failure.detectors.shared.memory.ps.Z. Google ScholarGoogle Scholar
  34. Loul, M., AND ABU-AMARA. 1987. Memory requirements for agreement among unreliable asynchronous processes. Adv. Comput. Res. 4, 163-183.Google ScholarGoogle Scholar
  35. MOSES, Y., DOLEV, D., AND HALPERN, J. Y. 1986. Cheating husbands and other stories: a case study of knowledge, action, and communication. Distrib. Comput. 1, 3, 167-176.Google ScholarGoogle ScholarCross RefCross Ref
  36. MULLENDER, S. J., ED. 1987. The Amoeba Distributed Operating System. Selected papers 1984-1987. Centre for Mathematics and Computer Science.Google ScholarGoogle Scholar
  37. NEIGER, G. 1995. Failure detectors and the wait-free hierarchy. In Proceedings of the 14th ACM Symposium on Principles of Distributed Computing (Ottawa, Ont. Canada, Aug.). ACM, New York, pp. 100-109. Google ScholarGoogle Scholar
  38. NEtGER, G., AND TOUEG, S. 1990. Automatically increasing the fault-tolerance of distributed algorithms. J. Algorithms 11, 3 (Sept.), 374-419. Google ScholarGoogle Scholar
  39. PEASE, M., SHOSTAK, R., AND LAMPORT, L. 1980. Reaching agreement in the presence of faults. J. ACM 27, 2 (Apr.), 228-234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. PETERSON, L. L., BUCHOLZ, N. C., AND SCHLICHTING, R.D. 1989. Preserving and using context information in interprocess communication. ACM Trans. Comput. Syst. 7, 3 (Aug.), 217-246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. PITTELLI, F., AND GARCIA-MOLINA, H. 1989. Reliable scheduling in a tmr database system. ACM Trans. Comput. Syst. 7, I (Feb.), 25-60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. POWELL, D., ED. 1991. Delta-4: A Generic Architecture for Dependable Distributed Computing. Springer-Verlag, New York. Google ScholarGoogle Scholar
  43. REISCHUK, R. 1982. A new solution for the Byzantine general's problem. Tech. Rep. ILl 3673 (Nov.), IBM Research Laboratory, Thomas J. Watson Research Center, Hawthorne, N.Y.Google ScholarGoogle Scholar
  44. RICCIARDI, A., AND BIRMAN, K.P. 1991. Using process groups to implement failure detection in asynchronous environments. In Proceedings of the lOth ACM Symposium on Principles of Distributed Computing (Montreal, Que., Canada, Aug. 19-21). ACM, New York, pp. 341-354. Google ScholarGoogle Scholar
  45. SABEL, L., AND MARZULLO, K. 1995. Election vs. consensus in asynchronous systems. Tech. Rep. TR95-411 (Feb.). Univ. California at San Diego. San Diego, Calif. Available at ftp://ftp.cs. cornell.edu/pub/sabel/tr94-1413.ps.Google ScholarGoogle Scholar
  46. SCHNEIDER, F.B. 1990. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Comput. Sun'. 22, 4 (Dec.), 299-319. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. WENSLEY, J. H., LAMPORT, L., GOLDBERG, J., GREEN, M. W., LEVITT, K. N., MELLIAR-SMITH, P., SHOSTAK, R. E., AND WEINSTOCK, C. B. 1978. SIFT: Design and analysis of a fault-tolerant computer for aircraft control. Proc. IEEE 66, 10 (Oct.), 1240-1255.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Unreliable failure detectors for reliable distributed systems

                  Recommendations

                  Reviews

                  Christopher D. Carothers

                  In distributed systems, the tractability of computations has been a question of much interest and the subject of much research. The consensus and atomic broadcast problems are of particular interest. Previous work by Dolev et al. [1] and Fischer et al. [2] proved that neither of these problems can be solved deterministically in an asynchronous system that is subject to even a single failed processor. This paper augments the asynchronous model of computation with unreliable failure detectors and shows that in fact these problems can solved in the presence of processor failures, thus broadening the applicability of asynchronous systems. The authors characterize failure detectors based on completeness and accuracy properties. In defining two levels of completeness and four levels of accuracy, the authors create eight classes of failure detectors. Using the concept of reduction, they show that only four of the eight classes need to be considered, and they prove how each solves the consensus problem. By establishing equivalence between atomic broadcast and consensus, the authors show that these four classes can be used to solve atomic broadcast in asynchronous systems. Overall, this is an excellent, well-written paper whose contributions “bridge the gap between impossibility results and the need for practical solutions for fault-tolerant asynchronous systems” (p. 231). It should be required reading in any distributed systems course or seminar.

                  Access critical reviews of Computing literature here

                  Become a reviewer for Computing Reviews.

                  Comments

                  Login options

                  Check if you have access through your login credentials or your institution to get full access on this article.

                  Sign in

                  Full Access

                  PDF Format

                  View or Download as a PDF file.

                  PDF

                  eReader

                  View online with eReader.

                  eReader