Abstract
We introduce the concept of unreliable failure detectors and study how they can be used to solve Consensus in asynchronous systems with crash failures. We characterise unreliable failure detectors in terms of two properties—completeness and accuracy. We show that Consensus can be solved even with unreliable failure detectors that make an infinite number of mistakes, and determine which ones can be used to solve Consensus despite any number of crashes, and which ones require a majority of correct processes. We prove that Consensus and Atomic Broadcast are reducible to each other in asynchronous systems with crash failures; thus, the above results also apply to Atomic Broadcast. A companion paper shows that one of the failure detectors introduced here is the weakest failure detector for solving Consensus [Chandra et al. 1992].
- AMIR, ~., DOLEV, D., KRAMER, S., AND MALKI, D. 1991. Transis: A communication sub-system for high availability. Tech. Rep. CS91-13 (Nov.), Computer Science Department, The Hebrew University of Jerusalem, Jerusalem, Israel.Google Scholar
- ATrlYA~ H., B^R-NoY, A., DOLEV, D., KOLLER, D., PELEG, D., AND REISCHUK, R. 1987. Achievable cases in an asynchronous environment. In Proceedings of the 28th Symposium on Foundations of Computer Science (Oct.). IEEE Computer Society Press, Washington, D.C., pp. 337-346.Google Scholar
- ATTIYA, H., DWORK, C., LYNCH, N., AND STOCKMEYER, L. 1991. Bounds on the time to reach agreement Jn the presence of timing uncertainty. In Proceedings of the 23rd ,4CM Symposium on Theory of Computing (New Orleans, La., May 6-8). ACM, New York, pp. 359-369. Google Scholar
- BEN-OR, M. 1983. Another advantage of free choice: Completely asynchronous agreement protocols. In Proceedings of the 2nd ACM Symposium on Principles of Distributed Computing (Montreal, Que. Canada, Aug. 17-19). ACM, New York, pp. 27-30. Google Scholar
- BERMAN, P., GARAY, J. A., AND PERRY, K.J. 1989. Towards optimal distributed consensus. In Proceedings of the 30th Symposium on Foundations of Computer Science (Oct.). IEEE Computer Society Press, Washington, D.C., pp. 410-415.Google Scholar
- BIRAN, O., MORA~, S., AND ZAKS, S. 1988. A combinatorial characterization of the distributed tasks which are solvable in the presence of one faulty processor. In Proceedings of the 7th ACM Symposium on Principles of Distributed Computing (Toronto, Ont., Canada, Aug. 15-17). ACM, New York, pp. 263-275. Google Scholar
- BIRMAN, K. P., COOPER, R., JOSEPH, T. A., KANE, K. P., AND SCHMUCK, F. B. 1990. lsis--A Distributed Programming Environment.Google Scholar
- BIRMAN, K. P., AND JOSEPH, T.A. 1987. Reliable communication in the presence of failures. ACM Trans. Comput. Syst. 5, 1 (Feb.), 47-76. Google ScholarDigital Library
- BRACHA, G., AND TOUEG, S. 1985. Asynchronous consensus and broadcast protocols. J. ACM 32, 4 (Oct.), 824-840. Google ScholarDigital Library
- BRIDGLAND, M. F., AND WATRO, R.J. 1987. Fault-tolerant decision making in totally asynchronous distributed systems. In Proceedings of the 6th ACM Symposium on Principles of Distributed Computing (Vancouver, B.C., Canada, Aug. 10-12). ACM, New York, pp. 52-63. Google Scholar
- BUDmRAJA, N., GOPAL, A., AND TOUEG, S. 1990. Early-stopping distributed bidding and applications. In Proceedings of the 4th International Workshop on Distributed Algorithms (Sept.). Springer- Verlag, New York, pp. 301-320. Google Scholar
- CHANDRA, T. D., HADZILACOS, V., AND TOUEG, S. 1992. The weakest failure detector for solving consensus. Technical Report 92-1293 (July), Department of Computer Science, Cornell University. Available from ftp://ftp.cs.cornell.edu/pub/chandra/failure.detectors.weakest.dvi. Z. A preliminary version appeared in the Proceedings of the llth ACM Symposium on Principles of Distributed Computing (Vancouver, B.C., Canada, Aug. 10-12). ACM, New York, pp. 147-158. Google Scholar
- CHANDRA, T. D., HADZlLACOS, V., AND TOUEG, S. 1995. Impossibility of group membership in asynchronous systems. Tech. Rep. 95-1533. Computer Science Department, Cornell University, Ithaca, New York. Google Scholar
- CHANDRA, T. D., AND LARREA, M. 1994. E-mail correspondence. Showed that cW cannot be used to solve non-blocking atomic commit.Google Scholar
- CHANDRA, T. D., AND TOUEG, S. 1990. Time and message efficient reliable broadcasts. In Proceedings of the Fourth International Workshop on Distributed Algorithms (Sept.). Springer-Verlag, New York, pp. 289-300. Google Scholar
- CHANG, J., AND MAXEMCHUK, N. 1984. Reliable broadcast protocols. ACM Trans. Comput. Syst. 2, 3 (Aug.), 251-273. Google ScholarDigital Library
- CHOR, B., AND DWORK, C. 1989. Randomization in byzantine agreement. Adv. Comput. Res. 5, 443-497.Google Scholar
- CRISTIAN, F. 1987. Issues in the design of highly available computing services. In Annual Symposium of the Canadian Information Processing Society (July), pp. 9-16. Also IBM Res. Rep. RJ5856. Thomas J. Watson Research Center, Hawthorne, N.Y.Google Scholar
- CRISTIAN, F., AGHILI, H., STRONG, R., AND DOLEV, D. 1985/1989. Atomic broadcast: From simple message diffusion to Byzantine agreement. In Proceedings of the 15th International Symposium on Fauh-Tolerant Computing (June 1985), pp. 200-206. A revised version appears as IBM Research Laboratory Technical Report RJ5244 (April 1989). Thomas J. Watson Research Center, Hawthorne, N.Y.Google Scholar
- CRISTIAN, F., DANCEY, R. D., AND DEHN, J. 1990. Fault-tolerance in the advanced automation system. Tech. Rep. RJ 7424 (April), IBM Research Laboratory, Thomas J. Watson Research Center, Hawthorne, N.Y.Google Scholar
- DOLEV, D., DWORK, C., AND STOCKMEYER, L. 1987. On the minimal synchronism needed for distributed consensus. J. ACM 34, 1 (Jan.), 77-97. Google ScholarDigital Library
- DOLEV, D., LYNCH, N. A., PINTER, S. S., STARK, E. W., AND WEIHL, W. E. 1986. Reaching approximate agreement in the presence of faults. J. ACM 33, 3 (July), 499-516. Google ScholarDigital Library
- DWORK, C., LYNCH, N. A., AND STOCKMEYER, L. 1988. Consensus in the presence of partial synchrony. J. ACM 35, 2 (Apr.), 288-323. Google ScholarDigital Library
- FISCHER, M.J. 1983. The consensus problem in unreliable distributed systems (a brief survey). Tech. Rep. 273 (June), Department of Computer Science, Yale University, New Haven, Conn.Google Scholar
- FISCHER, M. J., LYNCH, N. A., AND PATERSON, M.S. 1985. Impossibility of distributed consensus with one faulty process. J. ACM 32, 2 (Apr.), 374-382. Google ScholarDigital Library
- GOPAL, A., STRONG, R., TOUEG, S., AND CRIST{AN, F. 1990. Early-delivery atomic broadcast. In Proceedings of the 9th ACM Symposium on Principles of Distributed Computing (Quebec City, Que, Canada, Aug. 22-24). ACM, New York, pp. 297-310. Google Scholar
- GVERRAOt;k R. 1995. Revisiting the relationship between non blocking atomic commitment and consensus. In Proceedings of the 9th International Workshop on Distributed Algorithms (Sept.). Springer-Verlag, New York, pp. 87-100. Google Scholar
- HADZILACOS, V., AND TOUEG, S. 1993. Fault-tolerant broadcasts and related problems. In Distributed Systems, Chap. 5, S. J. MUELENDER, Ed, Addison-Wesley, Reading, Mass., pp. 97-145. Google Scholar
- HADZILACOS, V., AND TOUEG, S. 1994. A modular approach to fault-tolerant broadcasts and related problems. Tech. Rep. 94-1425 (May), Computer Science Department, Cornell University, Ithaca. N.Y. Available by anonymous ftp from ftp://ftp.db.toronto.edu/pub/vassos/fault.tolerant. broadcasts.dvi.Z. (An earlier version is also available in Hadzilacos and Toueg {1993}). Google ScholarDigital Library
- HALPERN, J. Y., AND MOSES, Y. 1990. Knowledge and common knowledge in a distributed environment. J. ACM 37, 3 (July), 549-587. Google ScholarDigital Library
- LAMPORT, L. 1978. The implementation of reliable distributed multiprocess systems. Comput. Netw. 2, 95-114.Google Scholar
- LAMPORT, L., SHOSTAK, R., AND PEASE, M. 1982. The Byzantine generals problem. ACM Trans. Prog. Lang. Syst. 4, 3 (July), 382-401. Google ScholarDigital Library
- Lo, W. K., AND HADZILACOS, V. 1994. Using failure detectors to solve consensus in asynchronous shared-memory systems. In Proceedings of the 8th International Workshop on Distributed Algorithms (Sept.), Springer-Verlag, New York, pp. 280-295. Available from ftp://ftp.db.toronto.edu/pub/ vassos/failure.detectors.shared.memory.ps.Z. Google Scholar
- Loul, M., AND ABU-AMARA. 1987. Memory requirements for agreement among unreliable asynchronous processes. Adv. Comput. Res. 4, 163-183.Google Scholar
- MOSES, Y., DOLEV, D., AND HALPERN, J. Y. 1986. Cheating husbands and other stories: a case study of knowledge, action, and communication. Distrib. Comput. 1, 3, 167-176.Google ScholarCross Ref
- MULLENDER, S. J., ED. 1987. The Amoeba Distributed Operating System. Selected papers 1984-1987. Centre for Mathematics and Computer Science.Google Scholar
- NEIGER, G. 1995. Failure detectors and the wait-free hierarchy. In Proceedings of the 14th ACM Symposium on Principles of Distributed Computing (Ottawa, Ont. Canada, Aug.). ACM, New York, pp. 100-109. Google Scholar
- NEtGER, G., AND TOUEG, S. 1990. Automatically increasing the fault-tolerance of distributed algorithms. J. Algorithms 11, 3 (Sept.), 374-419. Google Scholar
- PEASE, M., SHOSTAK, R., AND LAMPORT, L. 1980. Reaching agreement in the presence of faults. J. ACM 27, 2 (Apr.), 228-234. Google ScholarDigital Library
- PETERSON, L. L., BUCHOLZ, N. C., AND SCHLICHTING, R.D. 1989. Preserving and using context information in interprocess communication. ACM Trans. Comput. Syst. 7, 3 (Aug.), 217-246. Google ScholarDigital Library
- PITTELLI, F., AND GARCIA-MOLINA, H. 1989. Reliable scheduling in a tmr database system. ACM Trans. Comput. Syst. 7, I (Feb.), 25-60. Google ScholarDigital Library
- POWELL, D., ED. 1991. Delta-4: A Generic Architecture for Dependable Distributed Computing. Springer-Verlag, New York. Google Scholar
- REISCHUK, R. 1982. A new solution for the Byzantine general's problem. Tech. Rep. ILl 3673 (Nov.), IBM Research Laboratory, Thomas J. Watson Research Center, Hawthorne, N.Y.Google Scholar
- RICCIARDI, A., AND BIRMAN, K.P. 1991. Using process groups to implement failure detection in asynchronous environments. In Proceedings of the lOth ACM Symposium on Principles of Distributed Computing (Montreal, Que., Canada, Aug. 19-21). ACM, New York, pp. 341-354. Google Scholar
- SABEL, L., AND MARZULLO, K. 1995. Election vs. consensus in asynchronous systems. Tech. Rep. TR95-411 (Feb.). Univ. California at San Diego. San Diego, Calif. Available at ftp://ftp.cs. cornell.edu/pub/sabel/tr94-1413.ps.Google Scholar
- SCHNEIDER, F.B. 1990. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Comput. Sun'. 22, 4 (Dec.), 299-319. Google ScholarDigital Library
- WENSLEY, J. H., LAMPORT, L., GOLDBERG, J., GREEN, M. W., LEVITT, K. N., MELLIAR-SMITH, P., SHOSTAK, R. E., AND WEINSTOCK, C. B. 1978. SIFT: Design and analysis of a fault-tolerant computer for aircraft control. Proc. IEEE 66, 10 (Oct.), 1240-1255.Google ScholarCross Ref
Index Terms
- Unreliable failure detectors for reliable distributed systems
Recommendations
The weakest failure detector for solving consensus
We determine what information about failures is necessary and sufficient to solve Consensus in asynchronous distributed systems subject to crash failures. In Chandra and Toueg [1996], it is shown that W, a failure detector that provides surprisingly ...
Failure Detection and Randomization: A Hybrid Approach to Solve Consensus
We present a consensus algorithm that combines unreliable failure detection and randomization, two well-known techniques for solving consensus in asynchronous systems with crash failures. This hybrid algorithm combines advantages from both approaches:...
On Quiescent Reliable Communication
We study the problem of achieving reliable communication with quiescent algorithms (i.e., algorithms that eventually stop sending messages) in asynchronous systems with process crashes and lossy links. We first show that it is impossible to solve this ...
Comments