Abstract
Traditional fault-tolerance techniques typically utilize resources ineffectively because they cannot adapt to the changing reliability and performance demands of a system. This paper proposes software-controlled fault tolerance, a concept allowing designers and users to tailor their performance and reliability for each situation. Several software-controllable fault-detection techniques are then presented: SWIFT, a software-only technique, and CRAFT, a suite of hybrid hardware/software techniques. Finally, the paper introduces PROFiT, a technique which adjusts the level of protection and performance at fine granularities through software control. When coupled with software-controllable techniques like SWIFT and CRAFT, PROFiT offers attractive and novel reliability options.
- Austin, T. M. 1999. DIVA: a reliable substrate for deep submicron microarchitecture design. In Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture. IEEE Computer Society. 196--207. Google Scholar
- Baumann, R. C. 2001. Soft errors in advanced semiconductor devices-part I: the three radiation sources. IEEE Transactions on Device and Materials Reliability 1, 1 (Mar.), 17--22.Google ScholarCross Ref
- Bolchini, C. and Salice, F. 2001. A software methodology for detecting hardware faults in vliw data paths. In IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems. Google Scholar
- Bossen, D. C. 2002. CMOS soft errors and server design. In IEEE 2002 Reliability Physics Tutorial Notes, Reliability Fundamentals. 121_07.1--121_07.6.Google Scholar
- Czeck, E. W. and Siewiorek, D. 1990. Effects of transient gate-level faults on program behavior. In Proceedings of the 1990 International Symposium on Fault-Tolerant Computing. 236--243.Google Scholar
- Dean, A. G. and Shen, J. P. 1998. Techniques for software thread integration in real-time embedded systems. In Proceedings of the IEEE Real-Time Systems Symposium, Washington, DC. IEEE Computer Society. 322. Google Scholar
- Gomaa, M., Scarbrough, C., Vijaykumar, T. N., and Pomeranz, I. 2003. Transient-fault recovery for chip multiprocessors. In Proceedings of the 30th Annual International Symposium on Computer Architecture. ACM Press, New York, 98--109. Google Scholar
- Holm, J. G. and Banerjee, P. 1992. Low cost concurrent error detection in a VLIW architecture using replicated instructions. In Proceedings of the 1992 International Conference on Parallel Processing 1, 192--195.Google Scholar
- Horst, R. W., Harris, R. L., and Jardine, R. L. 1990. Multiple instruction issue in the NonStop Cyclone processor. In Proceedings of the 17th International Symposium on Computer Architecture. 216--226. Google Scholar
- Intel Corporation. 2002. Intel Itanium Architecture Software Developer's Manual, Vol. 1--3. Santa Clara, CA.Google Scholar
- Kim, S. and Somani, A. K. 2002. Soft error sensitivity characterization for microprocessor dependability enhancement strategy. In Proceedings of the 2002 International Conference on Dependable Systems and Networks. 416--425. Google Scholar
- Lee, C., Potkonjak, M., and Mangione-Smith, W. 1997. Mediabench: A tool for evaluating and synthesizing multimedia and communications systems. In Proceedings of the 30th Annual International Symposium on Microarchitecture. 330--335. Google ScholarDigital Library
- Mahmood, A. and McCluskey, E. J. 1988. Concurrent error detection using watchdog processors-a survey. IEEE Transactions on Computers 37, 2, 160--174. Google ScholarDigital Library
- Mukherjee, S. S., Kontz, M., and Reinhardt, S. K. 2002. Detailed design and evaluation of redundant multithreading alternatives. In Proceedings of the 29th Annual International Symposium on Computer Architecture. IEEE Computer Society. 99--110. Google Scholar
- Mukherjee, S. S., Weaver, C., Emer, J., Reinhardt, S. K., and Austin, T. 2003. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society. 29. Google Scholar
- O'Gorman, T. J., Ross, J. M., Taber, A. H., Ziegler, J. F., Muhlfeld, H. P., Montrose, I. C. J., Curtis, H. W., and Walsh, J. L. 1996. Field testing for cosmic ray soft errors in semiconductor memories. In IBM Journal of Research and Development. 41--49. Google Scholar
- Oh, N. and McCluskey, E. J. 2001. Low energy error detection technique using procedure call duplication. In Proceedings of the 2001 International Symposium on Dependable Systems and Networks.Google Scholar
- Oh, N., Shirvani, P. P., and McCluskey, E. J. 2002a. Control-flow checking by software signatures. In IEEE Transactions on Reliability 51, 111--122.Google ScholarCross Ref
- Oh, N., Shirvani, P. P., and McCluskey, E. J. 2002b. ED4I: Error detection by diverse data and duplicated instructions. In IEEE Transactions on Computers 51, 180--199. Google ScholarDigital Library
- Oh, N., Shirvani, P. P., and McCluskey, E. J. 2002c. Error detection by duplicated instructions in super-scalar processors. In IEEE Transactions on Reliability 51, 63--75.Google ScholarCross Ref
- Ohlsson, J. and Rimen, M. 1995. Implicit signature checking. In International Conference on Fault-Tolerant Computing. Google Scholar
- Patel, J. H. and Fung, L. Y. 1982. Concurrent error detection in alu's by recomputing with shifted operands. IEEE Transactions on Computers 31, 7, 589--595.Google ScholarDigital Library
- Penry, D. A., Vachharajani, M., and August, D. I. 2005. Rapid development of a flexible validated processor model. In Proceedings of the 2005 Workshop on Modeling, Benchmarking, and Simulation (MOBS).Google Scholar
- Ray, J., Hoe, J. C., and Falsafi, B. 2001. Dual use of superscalar datapath for transient-fault detection and recovery. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society. 214--224. Google Scholar
- Rebaudengo, M., Reorda, M. S., Violante, M., and Torchiano, M. 2001. A source-to-source compiler for generating dependable software. In IEEE International Workshop on Source Code Analysis and Manipulation. 33--42.Google Scholar
- Reinhardt, S. K. and Mukherjee, S. S. 2000. Transient fault detection via simultaneous multithreading. In Proceedings of the 27th Annual International Symposium on Computer Architecture. ACM Press, New York, 25--36. Google Scholar
- Reis, G. A., Chang, J., Vachharajani, N., Rangan, R., and August, D. I. 2005a. SWIFT: Software implemented fault tolerance. In Proceedings of the 3rd International Symposium on Code Generation and Optimization. Google ScholarDigital Library
- Reis, G. A., Chang, J., Vachharajani, N., Rangan, R., August, D. I., and Mukherjee, S. S. 2005b. Design and evaluation of hybrid fault-detection systems. In Proceedings of the 32th Annual International Symposium on Computer Architecture. 148--159. Google Scholar
- Rotenberg, E. 1999. AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing. IEEE Computer Society. 84. Google Scholar
- Saxena, N. and McCluskey, E. J. 1998. Dependable adaptive computing systems---the ROAR project. In International Conference on Systems, Man, and Cybernetics. 2172--2177.Google Scholar
- Schuette, M. A. and Shen, J. P. 1994. Exploiting instruction-level parallelism for integrated control-flow monitoring. IEEE Transactions on Computers 43, 129--133. Google ScholarDigital Library
- Shirvani, P. P., Saxena, N., and McCluskey, E. J. 2000. Software-implemented EDAC protection against SEUs. IEEE Transactions on Reliability 49, 273--284.Google ScholarCross Ref
- Shivakumar, P., Kistler, M., Keckler, S. W., Burger, D., and Alvisi, L. 2002. Modeling the effect of technology trends on the soft error rate of combinational logic. In Proceedings of the 2002 International Conference on Dependable Systems and Networks. 389--399. Google Scholar
- Slegel, T. J., Averill III, R. M., Check, M. A., Giamei, B. C., Krumm, B. W., Krygowski, C. A., Li, W. H., Liptay, J. S., MacDougall, J. D., McPherson, T. J., Navarro, J. A., Schwarz, E. M., Shum, K., and Webb, C. F. 1999. IBM's S/390 G5 Microprocessor design. IEEE Micro 19, 12--23. Google ScholarDigital Library
- Vachharajani, M., Vachharajani, N., Penry, D. A., Blome, J. A., and August, D. I. 2002. Microarchitectural exploration with Liberty. In Proceedings of the 35th International Symposium on Microarchitecture (MICRO). 271--282. Google Scholar
- Vachharajani, M., Vachharajani, N., and August, D. I. 2004. The Liberty Structural Specification Language: A high-level modeling language for component reuse. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation (PLDI). 195--206. Google Scholar
- Venkatasubramanian, R., Hayes, J. P., and Murray, B. T. 2003. Low-cost on-line fault detection using control-flow assertions. In Proceedings of the 9th IEEE International On-Line Testing Symposium. 137--143.Google Scholar
- Vijaykumar, T. N., Pomeranz, I., and Cheng, K. 2002. Transient-fault recovery using simultaneous multithreading. In Proceedings of the 29th Annual International Symposium on Computer Architecture. IEEE Computer Society. 87--98. Google Scholar
- Wang, N., Fertig, M., and Patel, S. J. 2003. Y-branches: When you come to a fork in the road, take it. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques. 56--67. Google Scholar
- Wang, N. J., Quek, J., Rafacz, T. M., and Patel, S. J. 2004. Characterizing the effects of transient faults on a high-performance processor pipeline. In Proceedings of the 2004 International Conference on Dependendable Systems and Networks. 61--72. Google Scholar
- Weaver, C., Emer, J., Mukherjee, S. S., and Reinhardt, S. K. 2004. Techniques to reduce the soft error rate of a high-performance microprocessor. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA). Google Scholar
- Yeh, Y. 1996. Triple-triple redundant 777 primary flight computer. In Proceedings of the 1996 IEEE Aerospace Applications Conference 1, 293--307.Google Scholar
- Yeh, Y. 1998. Design considerations in Boeing 777 fly-by-wire computers. In Proceedings of the Third IEEE International High-Assurance Systems Engineering Symposium. 64--72. Google Scholar
Index Terms
- Software-controlled fault tolerance
Recommendations
Algorithm-Based Fault Tolerance for FFT Networks
Algorithm-based fault tolerance (ABFT) is a low-overhead system-level fault tolerance technique. Many ABFT schemes have been proposed in the past for fast Fourier transform (FFT) networks. In this paper, a new ABFT scheme for FFT networks is proposed. ...
Designing masking fault-tolerance via nonmasking fault-tolerance
SRDS '95: Proceedings of the 14TH Symposium on Reliable Distributed SystemsMasking fault-tolerance guarantees that programs continually satisfy their specification in the presence of faults. By way of contrast, nonmasking fault-tolerance does not guarantee as much: it merely guarantees that when faults stop occurring, program ...
Evaluating the Fault Tolerance of Stateful TMR
NBIS '10: Proceedings of the 2010 13th International Conference on Network-Based Information SystemsModule redundancy is often used in the construction of reliable systems. Triple Module Redundancy (TMR) is a method for improving reliability through module redundancy, although it does not give the correct results when two out of three modules fail. We,...
Comments