ABSTRACT
Devices are increasingly vulnerable to soft errors as their feature sizes shrink. Previously, soft error rates were significant primarily in space and high-atmospheric computing. Modern architectures now use features so small at sufficiently low voltages that soft errors are becoming important even at terrestrial altitudes. Due to their large number of components, supercomputers are particularly susceptible to soft errors. Since many large scale parallel scientific applications use iterative linear algebra methods, the soft error vulnerability of these methods constitutes a large fraction of the applications' overall vulnerability. Many users consider these methods invulnerable to most soft errors since they converge from an imprecise solution to a precise one. However, we show in this paper that iterative methods are vulnerable to soft errors, exhibiting both silent data corruptions and poor ability to detect errors. Further, we evaluate a variety of soft error detection and tolerance techniques, including checkpointing, linear matrix encodings, and residual tracking techniques.
- International technology roadmap for semiconductors. White paper, ITRS, 2005.Google Scholar
- Jesd89a: Measurement and reporting of alpha particle and terrestrial cosmic ray-induced soft errors in semiconductor devices. Technical report, JEDEC Solid State Technology Association, October 2006.Google Scholar
- Natasa Miskov-Zivanov abd Diana Marculescu. Soft error rate analysis for sequential circuits. In Conference on Design, Automation and Test in Europe, pages 1436 -- 1441, 2007. Google ScholarDigital Library
- C.J. Anfinson and F.T. Luk. A linear algebraic model of algorithm-based fault tolerance. IEEE Transactions on Computers, 37(12):1599--1604, December 1995. Google ScholarDigital Library
- Jean Arlat, Yves Crouzet, Johan Karlsson, Peter Folkesson, Emmerich Fuchs, and Gunther H. Leber. Comparison of physical and software--implemented fault injection techniques. IEEE Transactions on Computers, 52(9):1115--1133, December 2003. Google ScholarDigital Library
- R. C. Baumann. Radiation-induced soft errors in advanced semiconductor technologies. IEEE Transactions on Device and Materials Reliability, 5(3):305--316, September 2005.Google ScholarCross Ref
- Daniel L. Boley, Richard P. Brent, Gene H. Golub, and Franklin T. Luk. Algorithmic fault tolerance using the Lanczos method. SIAM Journal on Matrix Analysis and Applications, 13(1):312 -- 332, January 1992. Google ScholarDigital Library
- Charng da Lu and Daniel A Reed. Assessing fault sensitivity in mpi applications. In Supercomputing, November 2004. Google ScholarDigital Library
- Tim Davis. University of Florida Sparse Matrix Collection. NA Digest, 97(23), June 1997.Google Scholar
- J. Dongarra, A. Lumsdaine, R. Pozo, and K. Remington. A sparse matrix library in C++ for high performance architectures. In Object Oriented Numerics Conference, pages 214--218, 1994.Google Scholar
- J.F. Ziegler et al. IBM experiments in soft fails in computer electronics (1978--1994). IBM Journal of Research and Development, 40(1), 1996. Google ScholarDigital Library
- Gene H. Golub and Charles F. Van Loan. Matrix computations. Johns Hopkins University Press, 1996.Google Scholar
- J. Greenough, L. Howell A. Kuhl, A. Shestakov, U. Creach, A.Miller, E. Tarwater, A. Cook, and B. Cabot. Raptor: Software and applications on BlueGene/L. In BlueGene/L Workshop, October 2003.Google Scholar
- David M. Hiemstra and Allan Baril. Single event upset characterization of the pentium mmx and pentium II microprocessors using proton. IEEE Transactions on Nuclear Science, 46(6):1453--1460, December 1999.Google ScholarCross Ref
- K.H. Huang and J.A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, 33:518--528, June 1984. Google ScholarDigital Library
- P. Kudva, Jeffrey W. Kellington, Pia N. Sanda, Ryan McBeth, John Schumann, and Ron Kalla. Soft error derating of ibm power6 microprocessor using statistical fault injection. In IEEE Workshop on Silicon Errors in Logic -- System Effects, April 2007.Google Scholar
- J. Langou, Z. Chen, G. Bosilca, and J. Dongarra. Recovery patterns for iterative methods in a parallel unstable environment. SIAM Journal on Scientific Computing, 30(1):102--116, November 2007. Google ScholarDigital Library
- Austin Lesea and Joe Fabula. The Rosetta experiment: Atmospheric soft error rate testing in differing technology FPGAs -- 90 nanometer update. In Workshop on System Effects of Logic Soft Errors, April 2005.Google Scholar
- Hatem Ltaief, Marc Garbey, and Edgar Gabriel. Parallel fault tolerant algorithms for parabolic problems. In Euro-Par Conference on Parallel Processing, pages 700--709, November 2006. Google ScholarDigital Library
- M. A. McClelland, J. L. Maienschein, A. L. Nichols, J. F. Wardell, A. I. Atwood, and P. O. Curran. ALE3D model predictions and materials characterization for the cookoff response. In Joint Army Navy NASA Air Force 38th Combustions Subcommittee, 26th Airbreathing Propulsion Subcommittee, 20th Propulsion Systems Hazards Subcommittee and 2nd Modeling and Simulation Subcommittee Joint Meeting, March 2007.Google Scholar
- P.T. McDonald, W.J. Stapor, and B.G. Henson. PC603E 32-bit RISC microprocessor radiation effects study. White paper, Innovative Concepts Inc., 1999.Google Scholar
- A. Messer, P. Bernadat, G. Fu, D. Chen, Z. Dimitrijevic, D. Lie, D.D. Mannaru, A. Riska, and D. Milojicic. Susceptibility of commodity systems and software to memory soft errors. IEEE Transactions on Computers, 53(12):1557 -- 1568, December 2004. Google ScholarDigital Library
- Sarah Michalak, Kevin W. Harris, Nicolas W. Hengartner, Bruce E. Takala, and Stephen A. Wender. Predicting the number of fatal soft errors in Los Alamos National Laboratory's ASC Q supercomputer. IEEE Transactions on Device and Materials Reliability, 5(3), 2005.Google ScholarCross Ref
- A. Mishra and P. Banerjee. An algorithm based error detection scheme for the multigrid algorithm. In International Symposium on Fault--Tolerant Computing, pages 12 -- 19, 1999. Google ScholarDigital Library
- Couchman H. M. P., Thomas P. A., and Pearce F. R. Hydra: an adaptive-mesh implementation of SPH. Astrophysical Journal, 452:797--813, April 1995.Google ScholarCross Ref
- Paula Prata and Joao Gabriel Silva. Algorithm based fault tolerance versus result-checking for matrix computations. In International Symposium on Fault-Tolerant Computing, pages 4--11, 1999. Google ScholarDigital Library
- H. Quinn and P. Graham. Terrestrial--based radiation upsets: a cautionary tale. In IEEE Symposium on Field-Programmable Custom Computing Machines, pages 193--202, April 2005. Google ScholarDigital Library
- A. Roy-Chowdhury and P. Banerjee. Algorithm-based fault location and recovery for matrix computations. In International Symposium on Fault-Tolerant Computing, June 1994.Google ScholarCross Ref
- Terrazon Semiconductor. Soft errors in electronic memory. White paper, Terrazon Semiconductor, 2004.Google Scholar
- P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi. Modeling the effect of technology trends on the soft error rate of combinational logic. In International Conference on Dependable Systems and Networks, pages 389--398, June 2002. Google ScholarDigital Library
- Daniel Skarin, Martin Sanfridson, and Johan Karlsson. Impact of soft errors in a brake-by-wire system. In IEEE Workshop on Silicon Errors in Logic -- System Effects, April 2007.Google Scholar
- Nicholas J. Wang, Aqeel Mahesri, and Sanjay J. Patel. Examining ace analysis reliability estimates using fault injection. In International Symposium on Computer Architecture, June 2007. Google ScholarDigital Library
- Hamid R. Zarandi and Seyed Ghassem Miremadi. Dependability evaluation of altera FPGA-based embedded systems subjected to SEUs. Microelectronics and Reliability, 47(2--3):461--470, 2006.Google Scholar
- Qihong Zhang and Jung H. Kim. An efficient method to reduce roundoff error in matrix multiplication with algorithm-based fault tolerance. In International Conference on Wafer Scale Integration, pages 32--39, January 1994Google Scholar
Index Terms
- Soft error vulnerability of iterative linear algebra methods
Recommendations
Effects of Soft Error to System Reliability
WAINA '11: Proceedings of the 2011 IEEE Workshops of International Conference on Advanced Information Networking and ApplicationsSoft errors on hardware could affect the reliability of computer system. To estimate system reliability, it is important to know the effects of soft errors to system reliability. This paper explores the effects of soft errors to computer system ...
Experimental framework for injecting logic errors in a virtual machine to profile applications for soft error resilience
Euro-Par'11: Proceedings of the 2011 international conference on Parallel Processing - Volume 2As the high performance computing (HPC) community continues to push for ever larger machines, reliability remains a serious obstacle. Further, as feature size and voltages decrease, the rate of transient soft errors is on the rise. HPC programmers of ...
Soft Errors: Technology Trends, System Effects, and Protection Techniques
IOLTS '07: Proceedings of the 13th IEEE International On-Line Testing SymposiumRadiation-induced soft errors are getting worse in digital systems manufactured in advanced technologies. Stringent data integrity and availability requirements of enterprise computing and networking applications demand special attention to soft errors ...
Comments