ABSTRACT
A transient hardware fault occurs when an energetic particle strikes a transistor, causing it to change state. These faults do not cause permanent damage, but may result in incorrect program execution by altering signal transfers or stored values. While the likelihood that such transient faults will cause any significant damage may seem remote, over the last several years transient faults have caused costly failures in high-end machines at America Online, eBay, and the Los Alamos Neutron Science Center, among others [6, 44, 15]. Because susceptibility to transient faults is proportional to the size and density of transistors, the problem of transient faults will become increasingly important in the coming decades.This paper defines the first formal, type-theoretic framework for studying reliable computation in the presence of transient faults. More specifically, it defines λzap, a lambda calculus that exhibits intermittent data faults. In order to detect and recover from these faults, λzap programs replicate intermediate computations and use majority voting, thereby modeling software-based fault tolerance techniques studied extensively, but informally [10, 20, 30, 31, 32, 33, 41].To ensure that programs maintain the proper invariants and use λzap primitives correctly, the paper defines a type system for the language. This type system guarantees that well-typed programs can tolerate any single data fault. To demonstrate that λzap can serve as an idealized typed intermediate language, we define a type-preserving translation from a standard simply-typed lambda calculus into λzap.
- M. Abadi, M. Budiu, Úlfar Erlingsson, and J. Ligatti. Control-flow integrity: Principles, implementations, and applications. In ACM Conference on Computer and Communications Security, Nov. 2005. Google ScholarDigital Library
- M. Abadi, M. Budiu, Úlfar Erlingsson, and J. Ligatti. A theory of secure control flow. In International Conference on Formal Engineering Methods, Nov. 2005. Google ScholarDigital Library
- A. W. Appel. Foundational proof-carrying code. In Sixteenth Annual IEEE Symposium on Logic in Computer Science, pages 247--258. IEEE, 2001. Google ScholarDigital Library
- T. M. Austin. DIVA: a reliable substrate for deep submicron microarchitecture design. In Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture, pages 196--207. IEEE Computer Society, 1999. Google ScholarDigital Library
- R. C. Baumann. Soft errors in advanced semiconductor devices-part I: the three radiation sources. IEEE Transactions on Device and Materials Reliability, 1(1):17-22, March 2001.Google ScholarCross Ref
- R. C. Baumann. Soft errors in commercial semiconductor technology: Overview and scaling trends. In IEEE 2002 Reliability Physics Tutorial Notes, Reliability Fundamentals, pages 121_01.1 - 121_01.14, April 2002.Google Scholar
- C. Bolchini and F. Salice. A software methodology for detecting hardware faults in vliw data paths. In IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, 2001. Google ScholarDigital Library
- J. Chang, G. A. Reis, and D. I. August. Automatic instructionlevel software-only recovery methods. In Proceedings of the 2006 International Conference on Dependendable Systems and Networks, June 2006. Google ScholarDigital Library
- A. Dimock, R. Muller, F. Turbak, and J. B. Wells. Strongly typed flow-directed representation transformations. In ACM International Conference on Functional Programming, pages 85--98, Amsterdam, June 1997. Google ScholarDigital Library
- M. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I. Pomeranz. Transient-fault recovery for chip multiprocessors. In Proceedings of the 30th annual international symposium on Computer architecture, pages 98--109. ACM Press, 2003. Google ScholarDigital Library
- S. Govindavajhala and A. W. Appel. Using memory errors to attack a virtual machine. In IEEE Symposium on Security and Privacy, pages 154--165, May 2003. Google ScholarDigital Library
- J. G. Holm and P. Banerjee. Low cost concurrent error detection in a VLIW architecture using replicated instructions. In Proceedings of the 1992 International Conference on Parallel Processing, volume 1, pages 192--195, August 1992.Google Scholar
- R.W. Horst, R.L. Harris, and R.L. Jardine. Multiple instruction issue in the NonStop Cyclone processor. In Proceedings of the 17th International Symposium on Computer Architecture, pages 216--226, May 1990. Google ScholarDigital Library
- A. Mahmood and E.J. McCluskey. Concurrent error detection using watchdog processors-a survey. IEEE Transactions on Computers, 37(2):160--174, 1988. Google ScholarDigital Library
- S.E. Michalak, K.W. Harris, N.W. Hengartner, B.E. Takala, and S.A. Wender. Predicting the number of fatal soft errors in Los Alamos National LabratoryÆs ASC Q computer. IEEE Transactions on Device and Materials Reliability, 5(3):329--335, September 2005.Google ScholarCross Ref
- G. Morrisett, K. Crary, N. Glew, and D. Walker. Stack-based Typed Assembly Language. In Second International Workshop on Types in Compilation, pages 95--117, Kyoto, Mar. 1998. Published in Xavier Leroy and Atsushi Ohori, editors, Lecture Notes in Computer Science, volume 1473, pages 28--52. Springererlag, 1998. Google ScholarDigital Library
- G. Morrisett, K. Crary, N. Glew, and D. Walker. Stack-based Typed Assembly Language. Journal of Functional Programming, 12(1):43--88, Jan. 2002. Google ScholarDigital Library
- G. Morrisett, M. Felleisen, and R. Harper. Abstract models of memory management. In ACM Conference on Functional Programming and Computer Architecture, pages 66--77, La Jolla, June 1995. Google ScholarDigital Library
- G. Morrisett, D. Walker, K. Crary, and N. Glew. From System F to Typed Assembly Language. ACM Transactions on Programming Languages and Systems, 3(21):528--569, May 1999. Google ScholarDigital Library
- S.S. Mukherjee, M. Kontz, and S. K. Reinhardt. Detailed design and evaluation of redundant multithreading alternatives. In Proceedings of the 29th Annual International Symposium on Computer Architecture, pages 99--110. IEEE Computer Society, 2002. Google ScholarDigital Library
- G. Necula. Proof-carrying code. In Twenty-Fourth ACM Symposium on Principles of Programming Languages, pages 106--119, Paris, 1997. Google ScholarDigital Library
- G. Necula and P. Lee. Safe kernel extensions without runtime checking. In Proceedings of Operating System Design and Implementation, pages 229--243, Seattle, Oct. 1996. Google ScholarDigital Library
- T.J. O'Gorman, J.M. Ross, A.H. Taber, J.F. Ziegler, H.P. Muhlfeld, I.C.J. Montrose, H.W. Curtis, and J.L. Walsh. Field testing for cosmic ray soft errors in semiconductor memories. In IBM Journal of Research and Development, pages 41--49, January 1996. Google ScholarDigital Library
- N. Oh and E.J. McCluskey. Low energy error detection technique using procedure call duplication. In Proceedings of the 2001 International Symposium on Dependable Systems and Networks, 2001.Google Scholar
- N. Oh, P.P. Shirvani, and E.J. McCluskey. Control-flow checking by software signatures. In IEEE Transactions on Reliability, volume 51, pages 111--122, March 2002.Google ScholarCross Ref
- N. Oh, P.P. Shirvani, and E.J. McCluskey. ED4I: Error detection by diverse data and duplicated instructions. In IEEE Transactions on Computers, volume 51, pages 180--199, February 2002. Google ScholarDigital Library
- N. Oh, P.P. Shirvani, and E.J. McCluskey. Error detection by duplicated instructions in super-scalar processors. In IEEE Transactions on Reliability, volume 51, pages 63--75, March 2002.Google ScholarCross Ref
- J. Ohlsson and M. Rimen. Implicit signature checking. In International Conference on Fault-Tolerant Computing, June 1995. Google ScholarDigital Library
- M. Rebaudengo, M.S. Reorda, M. Violante, and M. Torchiano. A source-to-source compiler for generating dependable software. In IEEE International Workshop on Source Code Analysis and Manipulation, pages 33--42, 2001.Google ScholarCross Ref
- S.K. Reinhardt and S.S. Mukherjee. Transient fault detection via simultaneous multithreading. In Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 25--36. ACM Press, 2000. Google ScholarDigital Library
- G.A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D.I. August. SWIFT: Software implemented fault tolerance. In Proceedings of the 3rd International Symposium on Code Generation and Optimization, March 2005. Google ScholarDigital Library
- E. Rotenberg. AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, page 84. IEEE Computer Society, 1999. Google ScholarDigital Library
- N. Saxena and E. J. McCluskey. Dependable adaptive computing systems - the ROAR project. In International Conference on Systems, Man, and Cybernetics, pages 2172--2177, October 1998.Google ScholarCross Ref
- Z. Shao. An overview of the FLINT/ML compiler. In Workshop on Types in Compilation, Amsterdam, June 1997. ACM. Published as Boston College Computer Science Dept. Technical Report BCCS-97-03.Google Scholar
- Z. Shao and A. Appel. A type-based compiler for Standard ML. In ACM Conference on Programming Language Design and Implementation, pages 116--129, La Jolla, June 1995. Google ScholarDigital Library
- P.P. Shirvani, N. Saxena, and E.J. McCluskey. Softwareimplemented EDAC protection against SEUs. In IEEE Transactions on Reliability, volume 49, pages 273--284, 2000.Google ScholarCross Ref
- P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi. Modeling the effect of technology trends on the soft error rate of combinational logic. In Proceedings of the 2002 International Conference on Dependable Systems and Networks, pages 389--399, June 2002. Google ScholarDigital Library
- T.J. Slegel, R.M. Averill III, M.A. Check, B.C. Giamei, B.W. Krumm, C.A. Krygowski, W.H. Li, J.S. Liptay, J.D. MacDougall, T.J. McPherson, J.A. Navarro, E.M. Schwarz, K. Shum, and C.F. Webb. IBMÆs S/390 G5 Microprocessor design. In IEEE Micro, volume 19, pages 12--23, March 1999. Google ScholarDigital Library
- D. Tarditi, G. Morrisett, P. Cheng, C. Stone, R. Harper, and P. Lee. TIL: A type-directed optimizing compiler for ML. In ACM Conference on Programming Language Design and Implementation, pages 181--192, Philadelphia, May 1996. Google ScholarDigital Library
- R. Venkatasubramanian, J. P. Hayes, and B. T. Murray. Low-cost on-line fault detection using control flow assertions. In Proceedings of the 9th IEEE International On-Line Testing Symposium, pages 137--143, July 2003.Google ScholarCross Ref
- T. N. Vijaykumar, I. Pomeranz, and K. Cheng. Transient-fault recovery using simultaneous multithreading. In Proceedings of the 29th Annual International Symposium on Computer Architecture, pages 87--98. IEEE Computer Society, 2002. Google ScholarDigital Library
- Y. Yeh. Triple-triple redundant 777 primary flight computer. In Proceedings of the 1996 IEEE Aerospace Applications Conference, volume 1, pages 293--307, February 1996.Google ScholarCross Ref
- Y. Yeh. Design considerations in Boeing 777 fly-by-wire computers. In Proceedings of the Third IEEE International High-Assurance Systems Engineering Symposium, pages 64--72, November 1998. Google ScholarDigital Library
- J.F. Ziegler and H. Puchner. SER-History, Trends, and Challenges: A Guide for Designing with Memory ICs. 2004.Google Scholar
Index Terms
- Static typing for a faulty lambda calculus
Recommendations
Fault-tolerant typed assembly language
PLDI '07: Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and ImplementationA transient hardware fault occurs when an energetic particle strikes a transistor, causing it to change state. Although transient faults do not permanently damage the hardware, they may corrupt computations by altering stored values and signal ...
Static typing for a faulty lambda calculus
Proceedings of the 2006 ICFP conferenceA transient hardware fault occurs when an energetic particle strikes a transistor, causing it to change state. These faults do not cause permanent damage, but may result in incorrect program execution by altering signal transfers or stored values. While ...
Fault-tolerant typed assembly language
Proceedings of the 2007 PLDI conferenceA transient hardware fault occurs when an energetic particle strikes a transistor, causing it to change state. Although transient faults do not permanently damage the hardware, they may corrupt computations by altering stored values and signal ...
Comments