ABSTRACT
In high-performance systems, the probability of failure is higher with more processors. Errors in calculations may occur that cannot be detected by outside means. To address this problem, we create a checksum-based approach that detects and recovers from calculation errors. We apply this approach to the LU factorization algorithm used by High Performance Linpack. Our approach has low overhead; in contrast to an existing approach that requires repeated calculation, it repeats only a fraction of the calculation during recovery. Because of error propagation, the existing approach has to repeat calculations when soft errors occur. Our approach detects and corrects errors during the calculation before they are propagated. The frequency of checking can be adjusted for the error rate, resulting in a flexible method of fault tolerance.
- C. J. Anfinson and F. T. Luk. A linear algebraic model of algorithm-based fault tolerance. IEEE Transactions on Computers, 37(12), December 1988. Google ScholarDigital Library
- P. Banerjee and J. Abraham. Bounds on algorithm-based fault tolerance in multiple processor systems. IEEE Transactions on Computers, 2006. Google ScholarDigital Library
- P. Banerjee, J. T. Rahmeh, C. B. Stunkel, V. S. S. Nair, K. Roy, V. Balasubramanian, and J. A. Abraham. Algorithm-based fault tolerance on a hypercube multiprocessor. IEEE Transactions on Computers, C-39:1132--1145, 1990. Google ScholarDigital Library
- G. Bronevetsky and B. de Supinski. Soft error vulnerability of iterative linear algebra methods. In International Conference on Supercomputing, 2008. Google ScholarDigital Library
- G. Bronevetsky, B. R. de Supinski, and M. Schulz. A foundation for the accurate prediction of the soft error vulnerability of scientic applications. In IEEE Workshop on Silicon Errors in Logic - System Effects, 2009.Google Scholar
- Z. Chen. Optimal real number codes for fault tolerant matrix operations. In Proceedings of the ACM/IEEE SC2009 Conference on High Performance Networking, Computing, Storage, and Analysis, Portland, OR, USA, November 2009. Google ScholarDigital Library
- Z. Chen and J. Dongarra. Algorithm-based fault tolerance for fail-stop failures. IEEE Transactions on Parallel and Distributed Systems, 19(12), 2008. Google ScholarDigital Library
- Z. Chen and J. J. Dongarra. Condition numbers of gaussian random matrices. SIAM J. Matrix Anal. Appl., 27:603--620, July 2005. Google ScholarDigital Library
- T. Davies, C. Karlsson, H. Liu, C. Ding, and Z. Chen. High performance linpack benchmark: A fault tolerant implementation without checkpointing. In Proceedings of the 25th ACM International Conference on Supercomputing. ACM Press, 2011. Google ScholarDigital Library
- C. Ding, C. Karlsson, H. Liu, T. Davies, and Z. Chen. Matrix multiplication on gpus with on-line fault tolerance. In Proceedings of the 9th IEEE International Symposium on Parallel and Distributed Processing with Applications. IEEE Computer Society Press, 2011. Google ScholarDigital Library
- P. Du, A. Bouteiller, G. Bosilca, T. Herault, and J. Dongarra. Algorithm-based fault tolerance for dense matrix factorizations. In 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '12), 2012. Google ScholarDigital Library
- P. Du, P. Luszczek, S. Tomov, and J. Dongarra. High performance dense linear system solver with soft error resilience. In IEEE Cluster, 2011. Google ScholarDigital Library
- S. Feng, S. Gupta, A. Ansari, and S. Mahlke. Shoestring: probabilistic soft error reliability on the cheap. In Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems, ASPLOS '10, pages 385--396, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- A. Garrett, Z. Chen, and D. E. Smith. Constructing numerically stable real number codes using evolutionary computation. In Proceedings of the 12th annual conference on Genetic and evolutionary computation, GECCO '10, pages 1163--1170, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- J. A. Gunnels, R. A. van de Geijn, D. S. Katz, and E. S. Quintana-Orti. Fault-tolerant high-performance matrix multiplication: Theory and practice. In The International Conference on Dependable Systems and Networks, 2001. Google ScholarDigital Library
- D. Hakkarinen and Z. Chen. Algorithmic Cholesky factorization fault recovery. In Proceedings of the 24th IEEE International Parallel and Distributed Processing Symposium, Atlanta, GA, USA, April 2010.Google ScholarCross Ref
- I. S. Haque and V. S. Pande. Hard data on soft errors: A large-scale assessment of real-world error rates in gpgpu. CoRR, abs/0910.0505, 2009.Google Scholar
- K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, C-33:518--528, 1984. Google ScholarDigital Library
- J. Jou and J. Abraham. Fault-tolerant matrix arithmetic and signal processing on highly concurrent computing structures. In Proceedings of the IEEE, volume 74, May 1986.Google Scholar
- Y. Kim. Fault Tolerant Matrix Operations for Parallel and Distributed Systems. PhD thesis, University of Tennessee, Knoxville, June 1996. Google ScholarDigital Library
- F. T. Luk and H. Park. An analysis of algorithm-based fault tolerance techniques. Journal of Parallel and Distributed Computing, 5(2):172--184, 1988. Google ScholarDigital Library
- K. Malkowski, P. Raghavan, and M. Kandemir. Analyzing the soft-error resilience of linear solvers on multicore multiprocessors. In 24th IEEE International Parallel and Distributed Processing Symposium, 2010.Google ScholarCross Ref
- M. Nicolaidis. Time redundancy based soft-error tolerance to rescue nanometer technologies. In VLSI Test Symposium, 1999. Proceedings. 17th IEEE, pages 86--94, 1999. Google ScholarDigital Library
- J. Silva, P. Prata, M. Rela, and H. Madeira. Practical issues in the use of ABFT and a new failure model. In Fault-Tolerant Computing, 1998. Digest of Papers. Twenty-Eighth Annual International Symposium on, pages 26 --35, June 1998. Google ScholarDigital Library
Index Terms
- Correcting soft errors online in LU factorization
Recommendations
Correcting soft errors online in LU factorization
HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computingIn high-performance systems, the probability of failure is higher with more processors. Errors in calculations may occur that cannot be detected by outside means. To address this problem, we create a checksum-based approach that detects and recovers ...
High performance linpack benchmark: a fault tolerant implementation without checkpointing
ICS '11: Proceedings of the international conference on SupercomputingThe probability that a failure will occur before the end of the computation increases as the number of processors used in a high performance computing application increases. For long running applications using a large number of processors, it is ...
SRC: soft error detection and recovery for high performance linpack
ICS '11: Proceedings of the international conference on SupercomputingIn high-performance systems, the probability of failure is higher for larger systems. Errors in calculations may occur that cannot be detected by any other means. To address this problem, we create a checksum-based approach that detects and recovers ...
Comments