skip to main content
article

Self-stabilizing clock synchronization in the presence of Byzantine faults

Published:01 September 2004Publication History
Skip Abstract Section

Abstract

We initiate a study of bounded clock synchronization under a more severe fault model than that proposed by Lamport and Melliar-Smith [1985]. Realistic aspects of the problem of synchronizing clocks in the presence of faults are considered. One aspect is that clock synchronization is an on-going task, thus the assumption that some of the processors never fail is too optimistic. To cope with this reality, we suggest self-stabilizing protocols that stabilize in any (long enough) period in which less than a third of the processors are faulty. Another aspect is that the clock value of each processor is bounded. A single transient fault may cause the clock to reach the upper bound. Therefore, we suggest a bounded clock that wraps around when appropriate.We present two randomized self-stabilizing protocols for synchronizing bounded clocks in the presence of Byzantine processor failures. The first protocol assumes that processors have a common pulse, while the second protocol does not. A new type of distributed counter based on the Chinese remainder theorem is used as part of the first protocol.

References

  1. Arora, A., Dolev, S., and Gouda, M. 1991. Maintaining digital clocks in step. Parall. Proc. Lett. 1, 1, 11--18.Google ScholarGoogle Scholar
  2. Daliot, A., Dolev, D., and Parnas, H. 2002. Self-stabilizing pulse synchronization inspired by biological pacemaker. In Proceedings of the 6th International Symposium on Self-Stabilizing Systems. Lecture Notes in Computer Science, Vol. 2704. Springer-Verlag, New York, 32--48. Google ScholarGoogle Scholar
  3. Dijkstra, E. W. 1974. Self stabilizing systems in spite of distributed control. Commun. ACM 17, 643--644. Google ScholarGoogle Scholar
  4. Dolev, D., Halpern, J. Y., and Strong, H. R. 1986. On the possibility and impossibility of achieving clock synchronization. J. Comput. Syst. Sci. 32, 2, 230--250. Google ScholarGoogle Scholar
  5. Dolev, S., Israeli, A., and Moran, S. 1991. Uniform dynamic self stabilizing leader election. In Proceedings of the 5th International Workshop on Distributed Algorithms. 167--180. Google ScholarGoogle Scholar
  6. Dolev, S., Israeli, A., and Moran, S. 1995. Analyzing expected time by scheduler-luck games. IEEE Trans. Softw. Eng. 21, 5 (May). Google ScholarGoogle Scholar
  7. Dolev, D., Lynch, N. A., Pinter, S. S., Stark, E. W., and Weihl, W. E. 1986. Reaching approximate agreement in the presence of faults. In J. ACM 33, 499--516. Google ScholarGoogle Scholar
  8. Dolev, S. and Welch, J. L. 1993. Wait-free clock synchronization. In Proceedings of the 12th ACM Symposium on Principles of Distributed Computing. ACM, New York, 97--108. Also appeared in Algorithmica 18, 1997, 486--511. Google ScholarGoogle Scholar
  9. Cristian, F. 1989. Probabilistic clock synchronization. Distrib. Comput. 3, 146--158.Google ScholarGoogle Scholar
  10. Gopal, A. S. and Perry, K. J. 1993. Unifying self-stabilization and fault-tolerance. In Proceedings of the 12th ACM Symposium on Principles of Distributed Computing. ACM, New York, 195--206. Google ScholarGoogle Scholar
  11. Halpern, J. Simons, B., Strong, R., and Dolev, D. 1984. Fault-tolerant clock synchronization. In Proceedings of the 3rd ACM Symposium on Principles of Distributed Computing. ACM, New York, 89--102. Google ScholarGoogle Scholar
  12. Knuth, D. E. 1981. The Art of Computer Programming, Vol. 2, 2nd ed., Addison-Wesley, Reading, Mass. Google ScholarGoogle Scholar
  13. Lamport, L. and Melliar-Smith, P. M. 1985. Synchronizing clocks in the presence of faults. J. ACM 32, 1, 1--36. Google ScholarGoogle Scholar
  14. Lamport, L., Shostak, R., and Pease, M. 1982. The Byzantine generals problem. ACM Trans. Prog. Lang. Syst. 4, 3 (July), 382--401. Google ScholarGoogle Scholar
  15. Lundelius, J., and Lynch, N. 1984. An upper and lower bound for clock synchronization. Info. Cont. 62, 190--204.Google ScholarGoogle Scholar
  16. Mahaney, S., and Schneider, F.1985. Inexact agreement: Accuracy, precision and graceful degradation. In Proceedings of the 4th ACM Symposium on Principles of Distributed Computing. ACM, New York, 237--249. Google ScholarGoogle Scholar
  17. Ramanathan, P., Shin, K. G., and Butler, R. W. 1990. Fault-tolerant clock synchronization in distributed systems. IEEE Comput. (Oct), 33--42. Google ScholarGoogle Scholar
  18. Srikanth, T. K. and Toueg, S. 1987. Optimal clock synchronization. J. ACM 34, 3, 626--645. Google ScholarGoogle Scholar
  19. Szabo, S., and Tanaka, R. I. 1967. Residue Arithmetic and its Applications to Computer Technology, McGraw-Hill, New York.Google ScholarGoogle Scholar
  20. Welch, J. L., and Lynch, N. 1988. A new fault-tolerant algorithm for clock synchronization. Info. Comput. 77, 1, 1--36. Google ScholarGoogle Scholar
  21. Wensley, J. H., Lamport, L., Goldberg, J., Green, M. W., Levitt, K. N., Melliar-Smith, P. M., Shostak, R. E., and Weinstock, C. B.1978. SIFT: Design and analysis of fault-tolerant computer for aircraft control. Proc. IEEE 66, 10, 1240--1255.Google ScholarGoogle Scholar

Index Terms

  1. Self-stabilizing clock synchronization in the presence of Byzantine faults

                  Recommendations

                  Reviews

                  Bayard Kohlhepp

                  The devil is in the details. Twenty-five million Europeans died from the bubonic plague, the "Black Death," in the Middle Ages. The European outbreaks have been traced to a single Italian merchant ship that picked up goods from China during an outbreak there; it was unknown in Europe beforehand. Mighty Athens fell to the same plague centuries earlier, when its citizens crowded together in the city during a Spartan siege. What triggered these outbreaks__?__ Fleas, which transmitted the plague from rat hosts to humans. We now have computer systems reaching around the globe, spreading themselves farther and thinner (as in cell phones, and other handhelds). Is there some small detail that could bring them crashing down__?__ The integrity of every transactional system, whether it's a small database, or the largest business-to-business (B2B) network, depends on clocks. If the order of transactions cannot be determined, a transactional system can't function. Imagine transactions being sent to a bank account. They can produce very different real-world results if you vary the order in which you apply them. Synchronized clocks are the heartbeat of distributed systems. This paper was published in the Proceedings of the 14th Annual ACM Symposium on Principles of Distributed Computing , in August 1995, as "Self-stabilizing clock synchronization with Byzantine faults." I could only find a one-page summary when I looked it up in the ACM Digital Library today, but I used the entire paper a few years ago to design some middleware. Dolev and Welch build on the work of Lamport and Melliar-Smith [1]. They make several distinct contributions. First, they expand the robustness of the recovery protocol, to function with up to one-third of the processors known or suspected to be faulty. Two, they also remove the specification of unbounded clocks that was present in previous work, which makes their protocols usable on real-world systems. Third, they make the protocol self-stabilizing, so that no initial state has to be globally asserted, and to allow resynchronization (recovery) after transient faults, like more than a third of the processors becoming temporarily faulty. Finally, they present two protocols, one of which is dependent on a shared pulse. The article is 19 pages long, very readable, and, as mentioned above, was used to create real software. Online Computing Reviews Service

                  Veronica Lagrange

                  Dolev and Welch describe two self-stabilizing randomized protocols to synchronize distributed clocks in a fault tolerant environment. The protocols will synchronize, in bounded time, a distributed system in any initial state, as long as less than one-third of the processors are faulty. The emphasis of the paper is in bringing the system back to a synchronous, reliable state from a faulty, unsynchronized one. The first protocol, called synchronous, assumes the existence of a common pulse that propagates throughout all processors at a constant interval. This pulse is the trigger that each processor uses to compare its clock with all the others, and execute the synchronization steps, if necessary. The second protocol, called semi-synchronous, drops the assumption of a common pulse. This second protocol has a rather complicated proof, again mostly aimed at showing that this protocol will synchronize all nonfaulty processors after any initially faulty scenario. One important feature of fault tolerant environments, apparently overlooked in the second protocol, is that the clock of a nonfaulty processor should not be moved backward; this will cause the timestamp of a later event to be smaller than the timestamp of an earlier event. Online Computing Reviews Service

                  Access critical reviews of Computing literature here

                  Become a reviewer for Computing Reviews.

                  Comments

                  Login options

                  Check if you have access through your login credentials or your institution to get full access on this article.

                  Sign in

                  Full Access

                  PDF Format

                  View or Download as a PDF file.

                  PDF

                  eReader

                  View online with eReader.

                  eReader