research-article

Correcting soft errors online in LU factorization

Authors:
Teresa Davies

Colorado School of Mines, Golden, CO, USA

Colorado School of Mines, Golden, CO, USA
View Profile

,
Zizhong Chen

University of California, Riverside, Riverside, CA, USA

University of California, Riverside, Riverside, CA, USA
View Profile

HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computingJune 2013Pages 167–178https://doi.org/10.1145/2493123.2462920

Published:17 June 2013Publication History

HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

Pages 167–178

ABSTRACT

In high-performance systems, the probability of failure is higher with more processors. Errors in calculations may occur that cannot be detected by outside means. To address this problem, we create a checksum-based approach that detects and recovers from calculation errors. We apply this approach to the LU factorization algorithm used by High Performance Linpack. Our approach has low overhead; in contrast to an existing approach that requires repeated calculation, it repeats only a fraction of the calculation during recovery. Because of error propagation, the existing approach has to repeat calculations when soft errors occur. Our approach detects and corrects errors during the calculation before they are propagated. The frequency of checking can be adjusted for the error rate, resulting in a flexible method of fault tolerance.

References

C. J. Anfinson and F. T. Luk. A linear algebraic model of algorithm-based fault tolerance. IEEE Transactions on Computers, 37(12), December 1988. Google ScholarDigital Library
P. Banerjee and J. Abraham. Bounds on algorithm-based fault tolerance in multiple processor systems. IEEE Transactions on Computers, 2006. Google ScholarDigital Library
P. Banerjee, J. T. Rahmeh, C. B. Stunkel, V. S. S. Nair, K. Roy, V. Balasubramanian, and J. A. Abraham. Algorithm-based fault tolerance on a hypercube multiprocessor. IEEE Transactions on Computers, C-39:1132--1145, 1990. Google ScholarDigital Library
G. Bronevetsky and B. de Supinski. Soft error vulnerability of iterative linear algebra methods. In International Conference on Supercomputing, 2008. Google ScholarDigital Library
G. Bronevetsky, B. R. de Supinski, and M. Schulz. A foundation for the accurate prediction of the soft error vulnerability of scientic applications. In IEEE Workshop on Silicon Errors in Logic - System Effects, 2009.Google Scholar
Z. Chen. Optimal real number codes for fault tolerant matrix operations. In Proceedings of the ACM/IEEE SC2009 Conference on High Performance Networking, Computing, Storage, and Analysis, Portland, OR, USA, November 2009. Google ScholarDigital Library
Z. Chen and J. Dongarra. Algorithm-based fault tolerance for fail-stop failures. IEEE Transactions on Parallel and Distributed Systems, 19(12), 2008. Google ScholarDigital Library
Z. Chen and J. J. Dongarra. Condition numbers of gaussian random matrices. SIAM J. Matrix Anal. Appl., 27:603--620, July 2005. Google ScholarDigital Library
T. Davies, C. Karlsson, H. Liu, C. Ding, and Z. Chen. High performance linpack benchmark: A fault tolerant implementation without checkpointing. In Proceedings of the 25th ACM International Conference on Supercomputing. ACM Press, 2011. Google ScholarDigital Library
C. Ding, C. Karlsson, H. Liu, T. Davies, and Z. Chen. Matrix multiplication on gpus with on-line fault tolerance. In Proceedings of the 9th IEEE International Symposium on Parallel and Distributed Processing with Applications. IEEE Computer Society Press, 2011. Google ScholarDigital Library
P. Du, A. Bouteiller, G. Bosilca, T. Herault, and J. Dongarra. Algorithm-based fault tolerance for dense matrix factorizations. In 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '12), 2012. Google ScholarDigital Library
P. Du, P. Luszczek, S. Tomov, and J. Dongarra. High performance dense linear system solver with soft error resilience. In IEEE Cluster, 2011. Google ScholarDigital Library
S. Feng, S. Gupta, A. Ansari, and S. Mahlke. Shoestring: probabilistic soft error reliability on the cheap. In Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems, ASPLOS '10, pages 385--396, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
A. Garrett, Z. Chen, and D. E. Smith. Constructing numerically stable real number codes using evolutionary computation. In Proceedings of the 12th annual conference on Genetic and evolutionary computation, GECCO '10, pages 1163--1170, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
J. A. Gunnels, R. A. van de Geijn, D. S. Katz, and E. S. Quintana-Orti. Fault-tolerant high-performance matrix multiplication: Theory and practice. In The International Conference on Dependable Systems and Networks, 2001. Google ScholarDigital Library
D. Hakkarinen and Z. Chen. Algorithmic Cholesky factorization fault recovery. In Proceedings of the 24th IEEE International Parallel and Distributed Processing Symposium, Atlanta, GA, USA, April 2010.Google ScholarCross Ref
I. S. Haque and V. S. Pande. Hard data on soft errors: A large-scale assessment of real-world error rates in gpgpu. CoRR, abs/0910.0505, 2009.Google Scholar
K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, C-33:518--528, 1984. Google ScholarDigital Library
J. Jou and J. Abraham. Fault-tolerant matrix arithmetic and signal processing on highly concurrent computing structures. In Proceedings of the IEEE, volume 74, May 1986.Google Scholar
Y. Kim. Fault Tolerant Matrix Operations for Parallel and Distributed Systems. PhD thesis, University of Tennessee, Knoxville, June 1996. Google ScholarDigital Library
F. T. Luk and H. Park. An analysis of algorithm-based fault tolerance techniques. Journal of Parallel and Distributed Computing, 5(2):172--184, 1988. Google ScholarDigital Library
K. Malkowski, P. Raghavan, and M. Kandemir. Analyzing the soft-error resilience of linear solvers on multicore multiprocessors. In 24th IEEE International Parallel and Distributed Processing Symposium, 2010.Google ScholarCross Ref
M. Nicolaidis. Time redundancy based soft-error tolerance to rescue nanometer technologies. In VLSI Test Symposium, 1999. Proceedings. 17th IEEE, pages 86--94, 1999. Google ScholarDigital Library
J. Silva, P. Prata, M. Rela, and H. Madeira. Practical issues in the use of ABFT and a new failure model. In Fault-Tolerant Computing, 1998. Digest of Papers. Twenty-Eighth Annual International Symposium on, pages 26 --35, June 1998. Google ScholarDigital Library

Index Terms

Correcting soft errors online in LU factorization
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks

Recommendations

Correcting soft errors online in LU factorization
HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

In high-performance systems, the probability of failure is higher with more processors. Errors in calculations may occur that cannot be detected by outside means. To address this problem, we create a checksum-based approach that detects and recovers ...
Read More
High performance linpack benchmark: a fault tolerant implementation without checkpointing
ICS '11: Proceedings of the international conference on Supercomputing

The probability that a failure will occur before the end of the computation increases as the number of processors used in a high performance computing application increases. For long running applications using a large number of processors, it is ...
Read More
SRC: soft error detection and recovery for high performance linpack
ICS '11: Proceedings of the international conference on Supercomputing

In high-performance systems, the probability of failure is higher for larger systems. Errors in calculations may occur that cannot be detected by any other means. To address this problem, we create a checksum-based approach that detects and recovers ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
June 2013
276 pages
ISBN:9781450319102
DOI:10.1145/2493123
General Chairs:
Manish Parashar
Rutgers University, USA
,
Jon Weissman
University of Minnesota, USA
,
Program Chairs:
Dick Epema
Delft University of Technology and Eindhoven University of Technology, The Netherlands
,
Renato Figueiredo
University of Florida, USA and Vrije Universiteit, The Netherlands
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 June 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
LU factorization
algorithm-based recovery
fault tolerance
high performance linpack benchmark
soft errors
Qualifiers
- research-article
Conference

Acceptance Rates
HPDC '13 Paper Acceptance Rate20of131submissions,15%Overall Acceptance Rate166of966submissions,17%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 42
  Total Citations
  View Citations
- 248
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Correcting soft errors online in LU factorization

HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Correcting soft errors online in LU factorization

High performance linpack benchmark: a fault tolerant implementation without checkpointing

SRC: soft error detection and recovery for high performance linpack

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Correcting soft errors online in LU factorization

HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Correcting soft errors online in LU factorization

High performance linpack benchmark: a fault tolerant implementation without checkpointing

SRC: soft error detection and recovery for high performance linpack

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media