article

Free Access

Unreliable failure detectors for reliable distributed systems

Authors:
Tushar Deepak Chandra

IBM, Hawthorne, NY

IBM, Hawthorne, NY
View Profile

,
Sam Toueg

Cornell Univ., Ithaca, NY

Cornell Univ., Ithaca, NY
View Profile

Authors Info & Claims

Journal of the ACM Volume 43 Issue 2pp 225–267https://doi.org/10.1145/226643.226647

Published:01 March 1996Publication History

Journal of the ACM

Abstract

We introduce the concept of unreliable failure detectors and study how they can be used to solve Consensus in asynchronous systems with crash failures. We characterise unreliable failure detectors in terms of two properties—completeness and accuracy. We show that Consensus can be solved even with unreliable failure detectors that make an infinite number of mistakes, and determine which ones can be used to solve Consensus despite any number of crashes, and which ones require a majority of correct processes. We prove that Consensus and Atomic Broadcast are reducible to each other in asynchronous systems with crash failures; thus, the above results also apply to Atomic Broadcast. A companion paper shows that one of the failure detectors introduced here is the weakest failure detector for solving Consensus [Chandra et al. 1992].

References

AMIR, ~., DOLEV, D., KRAMER, S., AND MALKI, D. 1991. Transis: A communication sub-system for high availability. Tech. Rep. CS91-13 (Nov.), Computer Science Department, The Hebrew University of Jerusalem, Jerusalem, Israel.Google Scholar
ATrlYA~ H., B^R-NoY, A., DOLEV, D., KOLLER, D., PELEG, D., AND REISCHUK, R. 1987. Achievable cases in an asynchronous environment. In Proceedings of the 28th Symposium on Foundations of Computer Science (Oct.). IEEE Computer Society Press, Washington, D.C., pp. 337-346.Google Scholar
ATTIYA, H., DWORK, C., LYNCH, N., AND STOCKMEYER, L. 1991. Bounds on the time to reach agreement Jn the presence of timing uncertainty. In Proceedings of the 23rd ,4CM Symposium on Theory of Computing (New Orleans, La., May 6-8). ACM, New York, pp. 359-369. Google Scholar
BEN-OR, M. 1983. Another advantage of free choice: Completely asynchronous agreement protocols. In Proceedings of the 2nd ACM Symposium on Principles of Distributed Computing (Montreal, Que. Canada, Aug. 17-19). ACM, New York, pp. 27-30. Google Scholar
BERMAN, P., GARAY, J. A., AND PERRY, K.J. 1989. Towards optimal distributed consensus. In Proceedings of the 30th Symposium on Foundations of Computer Science (Oct.). IEEE Computer Society Press, Washington, D.C., pp. 410-415.Google Scholar
BIRAN, O., MORA~, S., AND ZAKS, S. 1988. A combinatorial characterization of the distributed tasks which are solvable in the presence of one faulty processor. In Proceedings of the 7th ACM Symposium on Principles of Distributed Computing (Toronto, Ont., Canada, Aug. 15-17). ACM, New York, pp. 263-275. Google Scholar
BIRMAN, K. P., COOPER, R., JOSEPH, T. A., KANE, K. P., AND SCHMUCK, F. B. 1990. lsis--A Distributed Programming Environment.Google Scholar
BIRMAN, K. P., AND JOSEPH, T.A. 1987. Reliable communication in the presence of failures. ACM Trans. Comput. Syst. 5, 1 (Feb.), 47-76. Google ScholarDigital Library
BRACHA, G., AND TOUEG, S. 1985. Asynchronous consensus and broadcast protocols. J. ACM 32, 4 (Oct.), 824-840. Google ScholarDigital Library
BRIDGLAND, M. F., AND WATRO, R.J. 1987. Fault-tolerant decision making in totally asynchronous distributed systems. In Proceedings of the 6th ACM Symposium on Principles of Distributed Computing (Vancouver, B.C., Canada, Aug. 10-12). ACM, New York, pp. 52-63. Google Scholar
BUDmRAJA, N., GOPAL, A., AND TOUEG, S. 1990. Early-stopping distributed bidding and applications. In Proceedings of the 4th International Workshop on Distributed Algorithms (Sept.). Springer- Verlag, New York, pp. 301-320. Google Scholar
CHANDRA, T. D., HADZILACOS, V., AND TOUEG, S. 1992. The weakest failure detector for solving consensus. Technical Report 92-1293 (July), Department of Computer Science, Cornell University. Available from ftp://ftp.cs.cornell.edu/pub/chandra/failure.detectors.weakest.dvi. Z. A preliminary version appeared in the Proceedings of the llth ACM Symposium on Principles of Distributed Computing (Vancouver, B.C., Canada, Aug. 10-12). ACM, New York, pp. 147-158. Google Scholar
CHANDRA, T. D., HADZlLACOS, V., AND TOUEG, S. 1995. Impossibility of group membership in asynchronous systems. Tech. Rep. 95-1533. Computer Science Department, Cornell University, Ithaca, New York. Google Scholar
CHANDRA, T. D., AND LARREA, M. 1994. E-mail correspondence. Showed that cW cannot be used to solve non-blocking atomic commit.Google Scholar
CHANDRA, T. D., AND TOUEG, S. 1990. Time and message efficient reliable broadcasts. In Proceedings of the Fourth International Workshop on Distributed Algorithms (Sept.). Springer-Verlag, New York, pp. 289-300. Google Scholar
CHANG, J., AND MAXEMCHUK, N. 1984. Reliable broadcast protocols. ACM Trans. Comput. Syst. 2, 3 (Aug.), 251-273. Google ScholarDigital Library
CHOR, B., AND DWORK, C. 1989. Randomization in byzantine agreement. Adv. Comput. Res. 5, 443-497.Google Scholar
CRISTIAN, F. 1987. Issues in the design of highly available computing services. In Annual Symposium of the Canadian Information Processing Society (July), pp. 9-16. Also IBM Res. Rep. RJ5856. Thomas J. Watson Research Center, Hawthorne, N.Y.Google Scholar
CRISTIAN, F., AGHILI, H., STRONG, R., AND DOLEV, D. 1985/1989. Atomic broadcast: From simple message diffusion to Byzantine agreement. In Proceedings of the 15th International Symposium on Fauh-Tolerant Computing (June 1985), pp. 200-206. A revised version appears as IBM Research Laboratory Technical Report RJ5244 (April 1989). Thomas J. Watson Research Center, Hawthorne, N.Y.Google Scholar
CRISTIAN, F., DANCEY, R. D., AND DEHN, J. 1990. Fault-tolerance in the advanced automation system. Tech. Rep. RJ 7424 (April), IBM Research Laboratory, Thomas J. Watson Research Center, Hawthorne, N.Y.Google Scholar
DOLEV, D., DWORK, C., AND STOCKMEYER, L. 1987. On the minimal synchronism needed for distributed consensus. J. ACM 34, 1 (Jan.), 77-97. Google ScholarDigital Library
DOLEV, D., LYNCH, N. A., PINTER, S. S., STARK, E. W., AND WEIHL, W. E. 1986. Reaching approximate agreement in the presence of faults. J. ACM 33, 3 (July), 499-516. Google ScholarDigital Library
DWORK, C., LYNCH, N. A., AND STOCKMEYER, L. 1988. Consensus in the presence of partial synchrony. J. ACM 35, 2 (Apr.), 288-323. Google ScholarDigital Library
FISCHER, M.J. 1983. The consensus problem in unreliable distributed systems (a brief survey). Tech. Rep. 273 (June), Department of Computer Science, Yale University, New Haven, Conn.Google Scholar
FISCHER, M. J., LYNCH, N. A., AND PATERSON, M.S. 1985. Impossibility of distributed consensus with one faulty process. J. ACM 32, 2 (Apr.), 374-382. Google ScholarDigital Library
GOPAL, A., STRONG, R., TOUEG, S., AND CRIST{AN, F. 1990. Early-delivery atomic broadcast. In Proceedings of the 9th ACM Symposium on Principles of Distributed Computing (Quebec City, Que, Canada, Aug. 22-24). ACM, New York, pp. 297-310. Google Scholar
GVERRAOt;k R. 1995. Revisiting the relationship between non blocking atomic commitment and consensus. In Proceedings of the 9th International Workshop on Distributed Algorithms (Sept.). Springer-Verlag, New York, pp. 87-100. Google Scholar
HADZILACOS, V., AND TOUEG, S. 1993. Fault-tolerant broadcasts and related problems. In Distributed Systems, Chap. 5, S. J. MUELENDER, Ed, Addison-Wesley, Reading, Mass., pp. 97-145. Google Scholar
HADZILACOS, V., AND TOUEG, S. 1994. A modular approach to fault-tolerant broadcasts and related problems. Tech. Rep. 94-1425 (May), Computer Science Department, Cornell University, Ithaca. N.Y. Available by anonymous ftp from ftp://ftp.db.toronto.edu/pub/vassos/fault.tolerant. broadcasts.dvi.Z. (An earlier version is also available in Hadzilacos and Toueg {1993}). Google ScholarDigital Library
HALPERN, J. Y., AND MOSES, Y. 1990. Knowledge and common knowledge in a distributed environment. J. ACM 37, 3 (July), 549-587. Google ScholarDigital Library
LAMPORT, L. 1978. The implementation of reliable distributed multiprocess systems. Comput. Netw. 2, 95-114.Google Scholar
LAMPORT, L., SHOSTAK, R., AND PEASE, M. 1982. The Byzantine generals problem. ACM Trans. Prog. Lang. Syst. 4, 3 (July), 382-401. Google ScholarDigital Library
Lo, W. K., AND HADZILACOS, V. 1994. Using failure detectors to solve consensus in asynchronous shared-memory systems. In Proceedings of the 8th International Workshop on Distributed Algorithms (Sept.), Springer-Verlag, New York, pp. 280-295. Available from ftp://ftp.db.toronto.edu/pub/ vassos/failure.detectors.shared.memory.ps.Z. Google Scholar
Loul, M., AND ABU-AMARA. 1987. Memory requirements for agreement among unreliable asynchronous processes. Adv. Comput. Res. 4, 163-183.Google Scholar
MOSES, Y., DOLEV, D., AND HALPERN, J. Y. 1986. Cheating husbands and other stories: a case study of knowledge, action, and communication. Distrib. Comput. 1, 3, 167-176.Google ScholarCross Ref
MULLENDER, S. J., ED. 1987. The Amoeba Distributed Operating System. Selected papers 1984-1987. Centre for Mathematics and Computer Science.Google Scholar
NEIGER, G. 1995. Failure detectors and the wait-free hierarchy. In Proceedings of the 14th ACM Symposium on Principles of Distributed Computing (Ottawa, Ont. Canada, Aug.). ACM, New York, pp. 100-109. Google Scholar
NEtGER, G., AND TOUEG, S. 1990. Automatically increasing the fault-tolerance of distributed algorithms. J. Algorithms 11, 3 (Sept.), 374-419. Google Scholar
PEASE, M., SHOSTAK, R., AND LAMPORT, L. 1980. Reaching agreement in the presence of faults. J. ACM 27, 2 (Apr.), 228-234. Google ScholarDigital Library
PETERSON, L. L., BUCHOLZ, N. C., AND SCHLICHTING, R.D. 1989. Preserving and using context information in interprocess communication. ACM Trans. Comput. Syst. 7, 3 (Aug.), 217-246. Google ScholarDigital Library
PITTELLI, F., AND GARCIA-MOLINA, H. 1989. Reliable scheduling in a tmr database system. ACM Trans. Comput. Syst. 7, I (Feb.), 25-60. Google ScholarDigital Library
POWELL, D., ED. 1991. Delta-4: A Generic Architecture for Dependable Distributed Computing. Springer-Verlag, New York. Google Scholar
REISCHUK, R. 1982. A new solution for the Byzantine general's problem. Tech. Rep. ILl 3673 (Nov.), IBM Research Laboratory, Thomas J. Watson Research Center, Hawthorne, N.Y.Google Scholar
RICCIARDI, A., AND BIRMAN, K.P. 1991. Using process groups to implement failure detection in asynchronous environments. In Proceedings of the lOth ACM Symposium on Principles of Distributed Computing (Montreal, Que., Canada, Aug. 19-21). ACM, New York, pp. 341-354. Google Scholar
SABEL, L., AND MARZULLO, K. 1995. Election vs. consensus in asynchronous systems. Tech. Rep. TR95-411 (Feb.). Univ. California at San Diego. San Diego, Calif. Available at ftp://ftp.cs. cornell.edu/pub/sabel/tr94-1413.ps.Google Scholar
SCHNEIDER, F.B. 1990. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Comput. Sun'. 22, 4 (Dec.), 299-319. Google ScholarDigital Library
WENSLEY, J. H., LAMPORT, L., GOLDBERG, J., GREEN, M. W., LEVITT, K. N., MELLIAR-SMITH, P., SHOSTAK, R. E., AND WEINSTOCK, C. B. 1978. SIFT: Design and analysis of a fault-tolerant computer for aircraft control. Proc. IEEE 66, 10 (Oct.), 1240-1255.Google ScholarCross Ref

Index Terms

Unreliable failure detectors for reliable distributed systems

Recommendations

The weakest failure detector for solving consensus

We determine what information about failures is necessary and sufficient to solve Consensus in asynchronous distributed systems subject to crash failures. In Chandra and Toueg [1996], it is shown that W, a failure detector that provides surprisingly ...
Read More
Failure Detection and Randomization: A Hybrid Approach to Solve Consensus

We present a consensus algorithm that combines unreliable failure detection and randomization, two well-known techniques for solving consensus in asynchronous systems with crash failures. This hybrid algorithm combines advantages from both approaches:...
Read More
On Quiescent Reliable Communication

We study the problem of achieving reliable communication with quiescent algorithms (i.e., algorithms that eventually stop sending messages) in asynchronous systems with process crashes and lossy links. We first show that it is impossible to solve this ...
Read More

Reviews

Reviewer: Christopher D. Carothers

In distributed systems, the tractability of computations has been a question of much interest and the subject of much research. The consensus and atomic broadcast problems are of particular interest. Previous work by Dolev et al. [1] and Fischer et al. [2] proved that neither of these problems can be solved deterministically in an asynchronous system that is subject to even a single failed processor. This paper augments the asynchronous model of computation with unreliable failure detectors and shows that in fact these problems can solved in the presence of processor failures, thus broadening the applicability of asynchronous systems. The authors characterize failure detectors based on completeness and accuracy properties. In defining two levels of completeness and four levels of accuracy, the authors create eight classes of failure detectors. Using the concept of reduction, they show that only four of the eight classes need to be considered, and they prove how each solves the consensus problem. By establishing equivalence between atomic broadcast and consensus, the authors show that these four classes can be used to solve atomic broadcast in asynchronous systems. Overall, this is an excellent, well-written paper whose contributions “bridge the gap between impossibility results and the need for practical solutions for fault-tolerant asynchronous systems” (p. 231). It should be required reading in any distributed systems course or seminar.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Journal of the ACM Volume 43, Issue 2
March 1996
205 pages
ISSN:0004-5411
EISSN:1557-735X
DOI:10.1145/226643
Editor:
F. T. Leighton
Massachusetts Institute of Technology, Cambridge, MA
Issue’s Table of Contents
Copyright © 1996 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 March 1996
Published in jacm Volume 43, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Byzantine Generals' problem
agreement problem
asynchronous systems
atomic broadcast
commit problem
consensus problem
crash failures
failure detection
fault-tolerance
message passing
partial synchrony
processor failures
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1,804
  Total Citations
  View Citations
- 9,082
  Total Downloads
- Downloads (Last 12 months)813
- Downloads (Last 6 weeks)96
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Unreliable failure detectors for reliable distributed systems

Journal of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

The weakest failure detector for solving consensus

Failure Detection and Randomization: A Hybrid Approach to Solve Consensus

On Quiescent Reliable Communication

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Unreliable failure detectors for reliable distributed systems

Journal of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

The weakest failure detector for solving consensus

Failure Detection and Randomization: A Hybrid Approach to Solve Consensus

On Quiescent Reliable Communication

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media