skip to main content
10.1145/2493123.2462920acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Correcting soft errors online in LU factorization

Published:17 June 2013Publication History

ABSTRACT

In high-performance systems, the probability of failure is higher with more processors. Errors in calculations may occur that cannot be detected by outside means. To address this problem, we create a checksum-based approach that detects and recovers from calculation errors. We apply this approach to the LU factorization algorithm used by High Performance Linpack. Our approach has low overhead; in contrast to an existing approach that requires repeated calculation, it repeats only a fraction of the calculation during recovery. Because of error propagation, the existing approach has to repeat calculations when soft errors occur. Our approach detects and corrects errors during the calculation before they are propagated. The frequency of checking can be adjusted for the error rate, resulting in a flexible method of fault tolerance.

References

  1. C. J. Anfinson and F. T. Luk. A linear algebraic model of algorithm-based fault tolerance. IEEE Transactions on Computers, 37(12), December 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. P. Banerjee and J. Abraham. Bounds on algorithm-based fault tolerance in multiple processor systems. IEEE Transactions on Computers, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Banerjee, J. T. Rahmeh, C. B. Stunkel, V. S. S. Nair, K. Roy, V. Balasubramanian, and J. A. Abraham. Algorithm-based fault tolerance on a hypercube multiprocessor. IEEE Transactions on Computers, C-39:1132--1145, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. G. Bronevetsky and B. de Supinski. Soft error vulnerability of iterative linear algebra methods. In International Conference on Supercomputing, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. G. Bronevetsky, B. R. de Supinski, and M. Schulz. A foundation for the accurate prediction of the soft error vulnerability of scientic applications. In IEEE Workshop on Silicon Errors in Logic - System Effects, 2009.Google ScholarGoogle Scholar
  6. Z. Chen. Optimal real number codes for fault tolerant matrix operations. In Proceedings of the ACM/IEEE SC2009 Conference on High Performance Networking, Computing, Storage, and Analysis, Portland, OR, USA, November 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Z. Chen and J. Dongarra. Algorithm-based fault tolerance for fail-stop failures. IEEE Transactions on Parallel and Distributed Systems, 19(12), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Z. Chen and J. J. Dongarra. Condition numbers of gaussian random matrices. SIAM J. Matrix Anal. Appl., 27:603--620, July 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. Davies, C. Karlsson, H. Liu, C. Ding, and Z. Chen. High performance linpack benchmark: A fault tolerant implementation without checkpointing. In Proceedings of the 25th ACM International Conference on Supercomputing. ACM Press, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Ding, C. Karlsson, H. Liu, T. Davies, and Z. Chen. Matrix multiplication on gpus with on-line fault tolerance. In Proceedings of the 9th IEEE International Symposium on Parallel and Distributed Processing with Applications. IEEE Computer Society Press, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. Du, A. Bouteiller, G. Bosilca, T. Herault, and J. Dongarra. Algorithm-based fault tolerance for dense matrix factorizations. In 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '12), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. Du, P. Luszczek, S. Tomov, and J. Dongarra. High performance dense linear system solver with soft error resilience. In IEEE Cluster, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Feng, S. Gupta, A. Ansari, and S. Mahlke. Shoestring: probabilistic soft error reliability on the cheap. In Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems, ASPLOS '10, pages 385--396, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Garrett, Z. Chen, and D. E. Smith. Constructing numerically stable real number codes using evolutionary computation. In Proceedings of the 12th annual conference on Genetic and evolutionary computation, GECCO '10, pages 1163--1170, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. A. Gunnels, R. A. van de Geijn, D. S. Katz, and E. S. Quintana-Orti. Fault-tolerant high-performance matrix multiplication: Theory and practice. In The International Conference on Dependable Systems and Networks, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Hakkarinen and Z. Chen. Algorithmic Cholesky factorization fault recovery. In Proceedings of the 24th IEEE International Parallel and Distributed Processing Symposium, Atlanta, GA, USA, April 2010.Google ScholarGoogle ScholarCross RefCross Ref
  17. I. S. Haque and V. S. Pande. Hard data on soft errors: A large-scale assessment of real-world error rates in gpgpu. CoRR, abs/0910.0505, 2009.Google ScholarGoogle Scholar
  18. K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, C-33:518--528, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Jou and J. Abraham. Fault-tolerant matrix arithmetic and signal processing on highly concurrent computing structures. In Proceedings of the IEEE, volume 74, May 1986.Google ScholarGoogle Scholar
  20. Y. Kim. Fault Tolerant Matrix Operations for Parallel and Distributed Systems. PhD thesis, University of Tennessee, Knoxville, June 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. F. T. Luk and H. Park. An analysis of algorithm-based fault tolerance techniques. Journal of Parallel and Distributed Computing, 5(2):172--184, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. K. Malkowski, P. Raghavan, and M. Kandemir. Analyzing the soft-error resilience of linear solvers on multicore multiprocessors. In 24th IEEE International Parallel and Distributed Processing Symposium, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  23. M. Nicolaidis. Time redundancy based soft-error tolerance to rescue nanometer technologies. In VLSI Test Symposium, 1999. Proceedings. 17th IEEE, pages 86--94, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Silva, P. Prata, M. Rela, and H. Madeira. Practical issues in the use of ABFT and a new failure model. In Fault-Tolerant Computing, 1998. Digest of Papers. Twenty-Eighth Annual International Symposium on, pages 26 --35, June 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Correcting soft errors online in LU factorization

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
      June 2013
      276 pages
      ISBN:9781450319102
      DOI:10.1145/2493123
      • General Chairs:
      • Manish Parashar,
      • Jon Weissman,
      • Program Chairs:
      • Dick Epema,
      • Renato Figueiredo

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 17 June 2013

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      HPDC '13 Paper Acceptance Rate20of131submissions,15%Overall Acceptance Rate166of966submissions,17%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader