skip to main content
article
Free Access

Software-controlled fault tolerance

Published:01 December 2005Publication History
Skip Abstract Section

Abstract

Traditional fault-tolerance techniques typically utilize resources ineffectively because they cannot adapt to the changing reliability and performance demands of a system. This paper proposes software-controlled fault tolerance, a concept allowing designers and users to tailor their performance and reliability for each situation. Several software-controllable fault-detection techniques are then presented: SWIFT, a software-only technique, and CRAFT, a suite of hybrid hardware/software techniques. Finally, the paper introduces PROFiT, a technique which adjusts the level of protection and performance at fine granularities through software control. When coupled with software-controllable techniques like SWIFT and CRAFT, PROFiT offers attractive and novel reliability options.

References

  1. Austin, T. M. 1999. DIVA: a reliable substrate for deep submicron microarchitecture design. In Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture. IEEE Computer Society. 196--207. Google ScholarGoogle Scholar
  2. Baumann, R. C. 2001. Soft errors in advanced semiconductor devices-part I: the three radiation sources. IEEE Transactions on Device and Materials Reliability 1, 1 (Mar.), 17--22.Google ScholarGoogle ScholarCross RefCross Ref
  3. Bolchini, C. and Salice, F. 2001. A software methodology for detecting hardware faults in vliw data paths. In IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems. Google ScholarGoogle Scholar
  4. Bossen, D. C. 2002. CMOS soft errors and server design. In IEEE 2002 Reliability Physics Tutorial Notes, Reliability Fundamentals. 121_07.1--121_07.6.Google ScholarGoogle Scholar
  5. Czeck, E. W. and Siewiorek, D. 1990. Effects of transient gate-level faults on program behavior. In Proceedings of the 1990 International Symposium on Fault-Tolerant Computing. 236--243.Google ScholarGoogle Scholar
  6. Dean, A. G. and Shen, J. P. 1998. Techniques for software thread integration in real-time embedded systems. In Proceedings of the IEEE Real-Time Systems Symposium, Washington, DC. IEEE Computer Society. 322. Google ScholarGoogle Scholar
  7. Gomaa, M., Scarbrough, C., Vijaykumar, T. N., and Pomeranz, I. 2003. Transient-fault recovery for chip multiprocessors. In Proceedings of the 30th Annual International Symposium on Computer Architecture. ACM Press, New York, 98--109. Google ScholarGoogle Scholar
  8. Holm, J. G. and Banerjee, P. 1992. Low cost concurrent error detection in a VLIW architecture using replicated instructions. In Proceedings of the 1992 International Conference on Parallel Processing 1, 192--195.Google ScholarGoogle Scholar
  9. Horst, R. W., Harris, R. L., and Jardine, R. L. 1990. Multiple instruction issue in the NonStop Cyclone processor. In Proceedings of the 17th International Symposium on Computer Architecture. 216--226. Google ScholarGoogle Scholar
  10. Intel Corporation. 2002. Intel Itanium Architecture Software Developer's Manual, Vol. 1--3. Santa Clara, CA.Google ScholarGoogle Scholar
  11. Kim, S. and Somani, A. K. 2002. Soft error sensitivity characterization for microprocessor dependability enhancement strategy. In Proceedings of the 2002 International Conference on Dependable Systems and Networks. 416--425. Google ScholarGoogle Scholar
  12. Lee, C., Potkonjak, M., and Mangione-Smith, W. 1997. Mediabench: A tool for evaluating and synthesizing multimedia and communications systems. In Proceedings of the 30th Annual International Symposium on Microarchitecture. 330--335. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Mahmood, A. and McCluskey, E. J. 1988. Concurrent error detection using watchdog processors-a survey. IEEE Transactions on Computers 37, 2, 160--174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Mukherjee, S. S., Kontz, M., and Reinhardt, S. K. 2002. Detailed design and evaluation of redundant multithreading alternatives. In Proceedings of the 29th Annual International Symposium on Computer Architecture. IEEE Computer Society. 99--110. Google ScholarGoogle Scholar
  15. Mukherjee, S. S., Weaver, C., Emer, J., Reinhardt, S. K., and Austin, T. 2003. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society. 29. Google ScholarGoogle Scholar
  16. O'Gorman, T. J., Ross, J. M., Taber, A. H., Ziegler, J. F., Muhlfeld, H. P., Montrose, I. C. J., Curtis, H. W., and Walsh, J. L. 1996. Field testing for cosmic ray soft errors in semiconductor memories. In IBM Journal of Research and Development. 41--49. Google ScholarGoogle Scholar
  17. Oh, N. and McCluskey, E. J. 2001. Low energy error detection technique using procedure call duplication. In Proceedings of the 2001 International Symposium on Dependable Systems and Networks.Google ScholarGoogle Scholar
  18. Oh, N., Shirvani, P. P., and McCluskey, E. J. 2002a. Control-flow checking by software signatures. In IEEE Transactions on Reliability 51, 111--122.Google ScholarGoogle ScholarCross RefCross Ref
  19. Oh, N., Shirvani, P. P., and McCluskey, E. J. 2002b. ED4I: Error detection by diverse data and duplicated instructions. In IEEE Transactions on Computers 51, 180--199. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Oh, N., Shirvani, P. P., and McCluskey, E. J. 2002c. Error detection by duplicated instructions in super-scalar processors. In IEEE Transactions on Reliability 51, 63--75.Google ScholarGoogle ScholarCross RefCross Ref
  21. Ohlsson, J. and Rimen, M. 1995. Implicit signature checking. In International Conference on Fault-Tolerant Computing. Google ScholarGoogle Scholar
  22. Patel, J. H. and Fung, L. Y. 1982. Concurrent error detection in alu's by recomputing with shifted operands. IEEE Transactions on Computers 31, 7, 589--595.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Penry, D. A., Vachharajani, M., and August, D. I. 2005. Rapid development of a flexible validated processor model. In Proceedings of the 2005 Workshop on Modeling, Benchmarking, and Simulation (MOBS).Google ScholarGoogle Scholar
  24. Ray, J., Hoe, J. C., and Falsafi, B. 2001. Dual use of superscalar datapath for transient-fault detection and recovery. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society. 214--224. Google ScholarGoogle Scholar
  25. Rebaudengo, M., Reorda, M. S., Violante, M., and Torchiano, M. 2001. A source-to-source compiler for generating dependable software. In IEEE International Workshop on Source Code Analysis and Manipulation. 33--42.Google ScholarGoogle Scholar
  26. Reinhardt, S. K. and Mukherjee, S. S. 2000. Transient fault detection via simultaneous multithreading. In Proceedings of the 27th Annual International Symposium on Computer Architecture. ACM Press, New York, 25--36. Google ScholarGoogle Scholar
  27. Reis, G. A., Chang, J., Vachharajani, N., Rangan, R., and August, D. I. 2005a. SWIFT: Software implemented fault tolerance. In Proceedings of the 3rd International Symposium on Code Generation and Optimization. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Reis, G. A., Chang, J., Vachharajani, N., Rangan, R., August, D. I., and Mukherjee, S. S. 2005b. Design and evaluation of hybrid fault-detection systems. In Proceedings of the 32th Annual International Symposium on Computer Architecture. 148--159. Google ScholarGoogle Scholar
  29. Rotenberg, E. 1999. AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing. IEEE Computer Society. 84. Google ScholarGoogle Scholar
  30. Saxena, N. and McCluskey, E. J. 1998. Dependable adaptive computing systems---the ROAR project. In International Conference on Systems, Man, and Cybernetics. 2172--2177.Google ScholarGoogle Scholar
  31. Schuette, M. A. and Shen, J. P. 1994. Exploiting instruction-level parallelism for integrated control-flow monitoring. IEEE Transactions on Computers 43, 129--133. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Shirvani, P. P., Saxena, N., and McCluskey, E. J. 2000. Software-implemented EDAC protection against SEUs. IEEE Transactions on Reliability 49, 273--284.Google ScholarGoogle ScholarCross RefCross Ref
  33. Shivakumar, P., Kistler, M., Keckler, S. W., Burger, D., and Alvisi, L. 2002. Modeling the effect of technology trends on the soft error rate of combinational logic. In Proceedings of the 2002 International Conference on Dependable Systems and Networks. 389--399. Google ScholarGoogle Scholar
  34. Slegel, T. J., Averill III, R. M., Check, M. A., Giamei, B. C., Krumm, B. W., Krygowski, C. A., Li, W. H., Liptay, J. S., MacDougall, J. D., McPherson, T. J., Navarro, J. A., Schwarz, E. M., Shum, K., and Webb, C. F. 1999. IBM's S/390 G5 Microprocessor design. IEEE Micro 19, 12--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Vachharajani, M., Vachharajani, N., Penry, D. A., Blome, J. A., and August, D. I. 2002. Microarchitectural exploration with Liberty. In Proceedings of the 35th International Symposium on Microarchitecture (MICRO). 271--282. Google ScholarGoogle Scholar
  36. Vachharajani, M., Vachharajani, N., and August, D. I. 2004. The Liberty Structural Specification Language: A high-level modeling language for component reuse. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation (PLDI). 195--206. Google ScholarGoogle Scholar
  37. Venkatasubramanian, R., Hayes, J. P., and Murray, B. T. 2003. Low-cost on-line fault detection using control-flow assertions. In Proceedings of the 9th IEEE International On-Line Testing Symposium. 137--143.Google ScholarGoogle Scholar
  38. Vijaykumar, T. N., Pomeranz, I., and Cheng, K. 2002. Transient-fault recovery using simultaneous multithreading. In Proceedings of the 29th Annual International Symposium on Computer Architecture. IEEE Computer Society. 87--98. Google ScholarGoogle Scholar
  39. Wang, N., Fertig, M., and Patel, S. J. 2003. Y-branches: When you come to a fork in the road, take it. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques. 56--67. Google ScholarGoogle Scholar
  40. Wang, N. J., Quek, J., Rafacz, T. M., and Patel, S. J. 2004. Characterizing the effects of transient faults on a high-performance processor pipeline. In Proceedings of the 2004 International Conference on Dependendable Systems and Networks. 61--72. Google ScholarGoogle Scholar
  41. Weaver, C., Emer, J., Mukherjee, S. S., and Reinhardt, S. K. 2004. Techniques to reduce the soft error rate of a high-performance microprocessor. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA). Google ScholarGoogle Scholar
  42. Yeh, Y. 1996. Triple-triple redundant 777 primary flight computer. In Proceedings of the 1996 IEEE Aerospace Applications Conference 1, 293--307.Google ScholarGoogle Scholar
  43. Yeh, Y. 1998. Design considerations in Boeing 777 fly-by-wire computers. In Proceedings of the Third IEEE International High-Assurance Systems Engineering Symposium. 64--72. Google ScholarGoogle Scholar

Index Terms

  1. Software-controlled fault tolerance

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Architecture and Code Optimization
          ACM Transactions on Architecture and Code Optimization  Volume 2, Issue 4
          December 2005
          116 pages
          ISSN:1544-3566
          EISSN:1544-3973
          DOI:10.1145/1113841
          Issue’s Table of Contents

          Copyright © 2005 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 December 2005
          Published in taco Volume 2, Issue 4

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader