Abstract
Today’s hardware technology presents a new challenge in designing robust systems. Deep submicron VLSI technology introduces transient and permanent faults that were never considered in low-level system designs in the past. Still, robustness of that part of the system is crucial and needs to be guaranteed for any successful product. Distributed systems, on the other hand, have been dealing with similar issues for decades. However, neither the basic abstractions nor the complexity of contemporary fault-tolerant distributed algorithms match the peculiarities of hardware implementations.
This article is intended to be part of an attempt striving to bridge over this gap between theory and practice for the clock synchronization problem. Solving this task sufficiently well will allow to build an ultra-robust high-precision clocking system for hardware designs like systems-on-chips in critical applications. As our first building block, we describe and prove correct a novel distributed, Byzantine fault-tolerant, probabilistically self-stabilizing pulse synchronization protocol, called FATAL, that can be implemented using standard asynchronous digital logic: Correct FATAL nodes are guaranteed to generate pulses (i.e., unnumbered clock ticks) in a synchronized way, despite a certain fraction of nodes being faulty. FATAL uses randomization only during stabilization and, despite the strict limitations introduced by hardware designs, offers optimal resilience and smaller complexity than all existing protocols. Finally, we show how to leverage FATAL to efficiently generate synchronized, self-stabilizing, high-frequency clocks.
- Ajtai, M., Komlós, J., and Szemerédi, E. 1983. An O(n log n) sorting network. In Proceedings of the 15th Symposium on Theory of Computing (STOC). 1--9. Google ScholarDigital Library
- Attiya, H. and Censor, K. 2008. Lower bounds for randomized consensus under a weak adversary. In Proceedings of the 27th Symposium on Principles of Distributed Computing (PODC). 315--324. Google ScholarDigital Library
- Ben-Or, M., Dolev, D., and Hoch, E. N. 2008. Fast self-stabilizing byzantine tolerant digital clock synchronization. In Proceedings of the 27th Symposium on Principles of Distributed Computing (PODC). 385--394. Google ScholarDigital Library
- Bhamidipati, R., Zaidi, A., Makineni, S., Low, K., Chen, R., Liu, K.-Y., and Dalgrehn, J. 2002. Challenges and methodologies for implementing high-performance network processors. Intel Technol. J. 6, 3, 83--92.Google Scholar
- Biaz, S. and Welch, J. 2001. Closed form bounds for clock synchronization under simple uncertainty assumptions. Inf. Process. Lett. 80, 3, 151--157. Google ScholarDigital Library
- Chapiro, D. M. 1984. Globally-asynchronous locally-synchronous systems. Ph.D. thesis, Stanford University. Google ScholarDigital Library
- Constantinescu, C. 2003. Trends and challenges in VLSI circuit reliability. IEEE Micro 23, 4, 14--19. Google ScholarDigital Library
- Daliot, A. and Dolev, D. 2006. Self-stabilizing byzantine pulse synchronization. Comput. Res. Repository abs/cs/0608092.Google Scholar
- Daliot, A., Dolev, D., and Parnas, H. 2003. Self-stabilizing pulse synchronization inspired by biological pacemaker networks. In Proceedings of the SSS. 32--48. Google ScholarDigital Library
- Dike, C. and Burton, E. 1999. Miller and noise effects in a synchronizing flip-flop. IEEE J. Solid-State Circ. SC-34, 6, 849--855.Google Scholar
- Dolev, D. 1982. The Byzantine generals strike again. J. Algor. 3, 14--30.Google ScholarCross Ref
- Dolev, D., Fuegger, M., Lenzen, C., Posch, M., Schmid, U., and Steininger, A. 2014. Rigorously modeling self-stabilizing fault-tolerant circuits: An ultra-robust clocking scheme for systems-on-chip. J. Comput. Syst. Sci. 80, 4, 860--900. Google ScholarDigital Library
- Dolev, D. and Hoch, E. 2007. Byzantine self-stabilizing pulse in a bounded-delay model. In Proceedings of the 9th Symposium on Stabilization, Safety, and Security of Distributed Systems (SSS). Vol. 4280, 350--362. Google ScholarDigital Library
- Dolev, D., Lynch, N. A., Pinter, S. S., Stark, E. W., and Weihl, W. E. 1986. Reaching approximate agreement in the presence of faults. J. ACM 33, 499--516. Google ScholarDigital Library
- Dolev, S. and Welch, J. L. 2004. Self-stabilizing clock synchronization in the presence of Byzantine faults. J. ACM 51, 5, 780--799. Google ScholarDigital Library
- Fischer, M. J. and Lynch, N. A. 1982. A lower bound for the time to assure interactive consistency. Inf. Process. Lett. 14, 183--186.Google ScholarCross Ref
- Fischer, M. J., Lynch, N. A., and Merritt, M. 1985. Easy impossibility proofs for distributed consensus problems. In Proceedings of the 4th Conference of Principles of Distributed Computing (PODC). 59--70. Google ScholarDigital Library
- Friedman, E. G. 2001. Clock distribution networks in synchronous digital integrated circuits. Proc. IEEE 89, 5, 665--692.Google ScholarCross Ref
- Fuchs, G., Függer, M., and Steininger, A. 2009. On the threat of metastability in an asynchronous fault-tolerant clock generation scheme. In Proceedings of the 15th Symposium on Asynchronous Circuits and Systems (ASYNC) (Chapel Hill, N. Carolina). 127--136. Google ScholarDigital Library
- Függer, M. 2010. Analysis of on-chip fault-tolerant distributed algorithms. Ph.D. thesis, Technische Universität Wien, Institut für Technische Informatik.Google Scholar
- Függer, M., Dielacher, A., and Schmid, U. 2010. How to speed-up fault-tolerant clock generation in VLSI systems-on-chip via pipelining. In Proceedings of the 8th European Dependable Computing Conference (EDCC). 230--239. Google ScholarDigital Library
- Függer, M. and Schmid, U. 2012. Reconciling fault-tolerant distributed computing and systems-on-chip. Distrib. Comput. 24, 6, 323--355.Google ScholarDigital Library
- Függer, M., Schmid, U., Fuchs, G., and Kempf, G. 2006. Fault-tolerant distributed clock generation in VLSI systems-on-chip. In Proceedings of the 6th European Dependable Computing Conference (EDCC). 87--96. Google ScholarDigital Library
- Gadlage, M. J., Eaton, P. H., Benedetto, J. M., Carts, M., Zhu, V., and Turflinger, T. L. 2006. Digital device error rate trends in advanced CMOS technologies. IEEE Trans. Nuclear Sci. 53, 6, 3466--3471.Google ScholarCross Ref
- Hoch, E., Dolev, D., and Daliot, A. 2006. Self-stabilizing byzantine digital clock synchronization. In Proceedings of the 8th Symposium on Stabilization, Safety, and Security of Distributed Systems (SSS’06). Vol. 4280, 350--362. Google ScholarDigital Library
- International Technology Roadmap for Semiconductors 2012. International Technology Roadmap for Semiconductors. http://www.itrs.net.Google Scholar
- Kinniment, D. J., Bystrov, A., and Yakovlev, A. V. 2002. Synchronization circuit performance. IEEE J. Solid-State Circ. SC-37, 2, 202--209.Google Scholar
- Kuhn, F., Lenzen, C., Locher, T., and Oshman, R. 2010. Optimal gradient clock synchronization in dynamic networks. In Proceedings of the 29th Symposium on Principles of Distributed Computing (PODC). Google ScholarDigital Library
- Lenzen, C., Locher, T., and Wattenhofer, R. 2010. Tight bounds for clock synchronization. In J. ACM 57, 2. Google ScholarDigital Library
- Lundelius, J. and Lynch, N. 1984. An upper and lower bound for clock synchronization. Inf. Control 62, 2--3, 190--204.Google ScholarCross Ref
- Malekpour, M. 2006. A Byzantine-fault tolerant self-stabilizing protocol for distributed clock synchronization systems. In Proceedings of the 9th Conference on Stabilization, Safety, and Security of Distributed Systems (SSS). 411--427. Google ScholarDigital Library
- Malekpour, M. 2009. A self-stabilizing Byzantine-fault-tolerant clock synchronization protocol. Tech. rep., TM-2009-215758, NASA.Google Scholar
- Marino, L. 1981. General theory of metastable operation. IEEE Trans. Comput. C-30, 2, 107--115. Google ScholarDigital Library
- Metra, C., Francescantonio, S., and Mak, T. 2004. Implications of clock distribution faults and issues with screening them during manufacturing testing. IEEE Trans. Comput. 53, 5, 531--546. Google ScholarDigital Library
- Pease, M., Shostak, R., and Lamport, L. 1980. Reaching agreement in the presence of faults. J. ACM 27, 228--234. Google ScholarDigital Library
- Polzer, T., Handl, T., and Steininger, A. 2009. A metastability-free multi-synchronous communication scheme for SoCs. In Proceedings of the 11th Symposium on Stabilization, Safety, and Security of Distributed Systems (SSS). 578--592. Google ScholarDigital Library
- Portmann, C. L. and Meng, T. H. Y. 1995. Supply noise and CMOS synchronization errors. IEEE J. Solid-State Circ. SC-30, 9, 1015--1017.Google Scholar
- Restle, P., McNamara, T., Webber, D., Camporese, P., Eng, K., Jenkins, K., Allen, D., Rohn, M., Quaranta, M., Boerstler, D., Alpert, C., Carter, C., Bailey, R., Petrovick, J., Krauter, B., and McCredie, B. 2001. A clock distribution network for microprocessors. IEEE J. Solid-State Circ. 36, 5, 792--799.Google ScholarCross Ref
- Semiat, Y. and Ginosar, R. 2003. Timing measurements of synchronization circuits. In Proceedings of the 9th Symposium on Asynchronous Circuits and Systems (ASYNC). 68--77. Google ScholarDigital Library
- Srikanth, T. K. and Toueg, S. 1987. Optimal clock synchronization. J. ACM 34, 3, 626--645. Google ScholarDigital Library
- Sundaresan, K., Allen, P., and Ayazi, F. 2006. Process and temperature compensation in a 7-MHz CMOS clock oscillator. IEEE J. Solid-State Circ. 41, 2, 433--442.Google ScholarCross Ref
- Teehan, P., Greenstreet, M., and Lemieux, G. 2007. A survey and taxonomy of GALS design styles. IEEE Des. Test Comput. 24, 5, 418--428. Google ScholarDigital Library
- Widder, J. and Schmid, U. 2009. The theta-model: Achieving synchrony without clocks. Distrib. Comput. 22, 1, 29--47.Google ScholarDigital Library
Index Terms
- Fault-tolerant algorithms for tick-generation in asynchronous logic: Robust pulse generation
Recommendations
Fault-tolerant algorithms for tick-generation in asynchronous logic: robust pulse generation
SSS'11: Proceedings of the 13th international conference on Stabilization, safety, and security of distributed systemsThe advances of deep submicron VLSI technology pose new challenges in designing robust systems, which can in principle be addressed by approaches established in fault-tolerant distributed systems research. This paper is the first step in an attempt to ...
A Byzantine-fault tolerant self-stabilizing protocol for distributed clock synchronization systems
SSS'06: Proceedings of the 8th international conference on Stabilization, safety, and security of distributed systemsEmbedded distributed systems have become an integral part of safety-critical computing applications, necessitating system designs that incorporate fault tolerant clock synchronization in order to achieve ultra-reliable assurance levels. Many efficient ...
Rigorously modeling self-stabilizing fault-tolerant circuits
We present the first implementation of a distributed clock generation scheme for Systems-on-Chip that recovers from an unbounded number of arbitrary transient faults despite a large number of arbitrary permanent faults. We devise self-stabilizing ...
Comments