skip to main content
10.1145/1375527.1375552acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Soft error vulnerability of iterative linear algebra methods

Published:07 June 2008Publication History

ABSTRACT

Devices are increasingly vulnerable to soft errors as their feature sizes shrink. Previously, soft error rates were significant primarily in space and high-atmospheric computing. Modern architectures now use features so small at sufficiently low voltages that soft errors are becoming important even at terrestrial altitudes. Due to their large number of components, supercomputers are particularly susceptible to soft errors. Since many large scale parallel scientific applications use iterative linear algebra methods, the soft error vulnerability of these methods constitutes a large fraction of the applications' overall vulnerability. Many users consider these methods invulnerable to most soft errors since they converge from an imprecise solution to a precise one. However, we show in this paper that iterative methods are vulnerable to soft errors, exhibiting both silent data corruptions and poor ability to detect errors. Further, we evaluate a variety of soft error detection and tolerance techniques, including checkpointing, linear matrix encodings, and residual tracking techniques.

References

  1. International technology roadmap for semiconductors. White paper, ITRS, 2005.Google ScholarGoogle Scholar
  2. Jesd89a: Measurement and reporting of alpha particle and terrestrial cosmic ray-induced soft errors in semiconductor devices. Technical report, JEDEC Solid State Technology Association, October 2006.Google ScholarGoogle Scholar
  3. Natasa Miskov-Zivanov abd Diana Marculescu. Soft error rate analysis for sequential circuits. In Conference on Design, Automation and Test in Europe, pages 1436 -- 1441, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C.J. Anfinson and F.T. Luk. A linear algebraic model of algorithm-based fault tolerance. IEEE Transactions on Computers, 37(12):1599--1604, December 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Jean Arlat, Yves Crouzet, Johan Karlsson, Peter Folkesson, Emmerich Fuchs, and Gunther H. Leber. Comparison of physical and software--implemented fault injection techniques. IEEE Transactions on Computers, 52(9):1115--1133, December 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. C. Baumann. Radiation-induced soft errors in advanced semiconductor technologies. IEEE Transactions on Device and Materials Reliability, 5(3):305--316, September 2005.Google ScholarGoogle ScholarCross RefCross Ref
  7. Daniel L. Boley, Richard P. Brent, Gene H. Golub, and Franklin T. Luk. Algorithmic fault tolerance using the Lanczos method. SIAM Journal on Matrix Analysis and Applications, 13(1):312 -- 332, January 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Charng da Lu and Daniel A Reed. Assessing fault sensitivity in mpi applications. In Supercomputing, November 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Tim Davis. University of Florida Sparse Matrix Collection. NA Digest, 97(23), June 1997.Google ScholarGoogle Scholar
  10. J. Dongarra, A. Lumsdaine, R. Pozo, and K. Remington. A sparse matrix library in C++ for high performance architectures. In Object Oriented Numerics Conference, pages 214--218, 1994.Google ScholarGoogle Scholar
  11. J.F. Ziegler et al. IBM experiments in soft fails in computer electronics (1978--1994). IBM Journal of Research and Development, 40(1), 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Gene H. Golub and Charles F. Van Loan. Matrix computations. Johns Hopkins University Press, 1996.Google ScholarGoogle Scholar
  13. J. Greenough, L. Howell A. Kuhl, A. Shestakov, U. Creach, A.Miller, E. Tarwater, A. Cook, and B. Cabot. Raptor: Software and applications on BlueGene/L. In BlueGene/L Workshop, October 2003.Google ScholarGoogle Scholar
  14. David M. Hiemstra and Allan Baril. Single event upset characterization of the pentium mmx and pentium II microprocessors using proton. IEEE Transactions on Nuclear Science, 46(6):1453--1460, December 1999.Google ScholarGoogle ScholarCross RefCross Ref
  15. K.H. Huang and J.A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, 33:518--528, June 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Kudva, Jeffrey W. Kellington, Pia N. Sanda, Ryan McBeth, John Schumann, and Ron Kalla. Soft error derating of ibm power6 microprocessor using statistical fault injection. In IEEE Workshop on Silicon Errors in Logic -- System Effects, April 2007.Google ScholarGoogle Scholar
  17. J. Langou, Z. Chen, G. Bosilca, and J. Dongarra. Recovery patterns for iterative methods in a parallel unstable environment. SIAM Journal on Scientific Computing, 30(1):102--116, November 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Austin Lesea and Joe Fabula. The Rosetta experiment: Atmospheric soft error rate testing in differing technology FPGAs -- 90 nanometer update. In Workshop on System Effects of Logic Soft Errors, April 2005.Google ScholarGoogle Scholar
  19. Hatem Ltaief, Marc Garbey, and Edgar Gabriel. Parallel fault tolerant algorithms for parabolic problems. In Euro-Par Conference on Parallel Processing, pages 700--709, November 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. A. McClelland, J. L. Maienschein, A. L. Nichols, J. F. Wardell, A. I. Atwood, and P. O. Curran. ALE3D model predictions and materials characterization for the cookoff response. In Joint Army Navy NASA Air Force 38th Combustions Subcommittee, 26th Airbreathing Propulsion Subcommittee, 20th Propulsion Systems Hazards Subcommittee and 2nd Modeling and Simulation Subcommittee Joint Meeting, March 2007.Google ScholarGoogle Scholar
  21. P.T. McDonald, W.J. Stapor, and B.G. Henson. PC603E 32-bit RISC microprocessor radiation effects study. White paper, Innovative Concepts Inc., 1999.Google ScholarGoogle Scholar
  22. A. Messer, P. Bernadat, G. Fu, D. Chen, Z. Dimitrijevic, D. Lie, D.D. Mannaru, A. Riska, and D. Milojicic. Susceptibility of commodity systems and software to memory soft errors. IEEE Transactions on Computers, 53(12):1557 -- 1568, December 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Sarah Michalak, Kevin W. Harris, Nicolas W. Hengartner, Bruce E. Takala, and Stephen A. Wender. Predicting the number of fatal soft errors in Los Alamos National Laboratory's ASC Q supercomputer. IEEE Transactions on Device and Materials Reliability, 5(3), 2005.Google ScholarGoogle ScholarCross RefCross Ref
  24. A. Mishra and P. Banerjee. An algorithm based error detection scheme for the multigrid algorithm. In International Symposium on Fault--Tolerant Computing, pages 12 -- 19, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Couchman H. M. P., Thomas P. A., and Pearce F. R. Hydra: an adaptive-mesh implementation of SPH. Astrophysical Journal, 452:797--813, April 1995.Google ScholarGoogle ScholarCross RefCross Ref
  26. Paula Prata and Joao Gabriel Silva. Algorithm based fault tolerance versus result-checking for matrix computations. In International Symposium on Fault-Tolerant Computing, pages 4--11, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. H. Quinn and P. Graham. Terrestrial--based radiation upsets: a cautionary tale. In IEEE Symposium on Field-Programmable Custom Computing Machines, pages 193--202, April 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Roy-Chowdhury and P. Banerjee. Algorithm-based fault location and recovery for matrix computations. In International Symposium on Fault-Tolerant Computing, June 1994.Google ScholarGoogle ScholarCross RefCross Ref
  29. Terrazon Semiconductor. Soft errors in electronic memory. White paper, Terrazon Semiconductor, 2004.Google ScholarGoogle Scholar
  30. P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi. Modeling the effect of technology trends on the soft error rate of combinational logic. In International Conference on Dependable Systems and Networks, pages 389--398, June 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Daniel Skarin, Martin Sanfridson, and Johan Karlsson. Impact of soft errors in a brake-by-wire system. In IEEE Workshop on Silicon Errors in Logic -- System Effects, April 2007.Google ScholarGoogle Scholar
  32. Nicholas J. Wang, Aqeel Mahesri, and Sanjay J. Patel. Examining ace analysis reliability estimates using fault injection. In International Symposium on Computer Architecture, June 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Hamid R. Zarandi and Seyed Ghassem Miremadi. Dependability evaluation of altera FPGA-based embedded systems subjected to SEUs. Microelectronics and Reliability, 47(2--3):461--470, 2006.Google ScholarGoogle Scholar
  34. Qihong Zhang and Jung H. Kim. An efficient method to reduce roundoff error in matrix multiplication with algorithm-based fault tolerance. In International Conference on Wafer Scale Integration, pages 32--39, January 1994Google ScholarGoogle Scholar

Index Terms

  1. Soft error vulnerability of iterative linear algebra methods

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ICS '08: Proceedings of the 22nd annual international conference on Supercomputing
        June 2008
        390 pages
        ISBN:9781605581583
        DOI:10.1145/1375527

        Copyright © 2008 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 7 June 2008

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate629of2,180submissions,29%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader