Article

Modeling and improving data cache reliability: 1

Authors:
Ismail Kadayif

Canakkale Onsekiz Mart University

Canakkale Onsekiz Mart University
View Profile

,
Mahmut Kandemir

The Pennsylvania State University

The Pennsylvania State University
View Profile

SIGMETRICS '07: Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systemsJune 2007https://doi.org/10.1145/1254882.1254884

Published:12 June 2007Publication History

SIGMETRICS '07: Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems

ABSTRACT

Soft errors arising from energetic particle strikes pose a significant reliability concern for computing systems, especially for those running in noisy environments. Technology scaling and aggressive leakage control mechanisms make the problem caused by these transient errors even more severe. Therefore, it is very important to employ reliability enhancing mechanisms in processor/memory designs to protect them against soft errors. To do so, we first need to model soft errors, and then study cost/reliability tradeoffs among various reliability enhancing techniques based on the model so that system requirements could be met.

Since cache memories take the largest fraction of on-chip real estate today and their share is expected to continue to grow in future designs, they are more vulnerable to soft errors, as compared to many other components of a computing system. In this paper, we first focus on a soft error model for L1 data caches, and then explore different reliability enhancing mechanisms. More specifically, we define a metric called AVFC (Architectural Vulnerability Factor for Caches), which represents the probability with which a fault in the cache can be visible in the final output of the program. Based on this model, we then propose three architectural schemes for improving reliability in the existence of soft errors. Our first scheme prevents an error from propagating to the lower levels in the memory hierarchy by not forwarding the unmodified data words of a dirty cache block to the L2 cache when the dirty block is to be replaced. The second scheme proposed selectively invalidates cache blocks to reduce their vulnerable periods, decreasing their chances of catching any soft errors. Based on the AVFC metric, our experimental results show that these two schemes are very effective in alleviating soft errors in the L1 data cache. Specifically, by using our first scheme, it is possible to improve the AVFC metric by 32% without any performance loss. On the other hand, the second scheme enhances the AVFC metric between 60% and 97%, at the cost of a performance degradation which varies from 0% to 21.3%, depending on how aggressively the cache blocks are invalidated. To reduce the performance overhead caused by cache block invalidation, we also propose a third scheme which tries to bring a fresh copy of the invalidated block into the cache via prefetching. Our experimental results indicate that, this scheme can reduce the performance overheads to less than 1% for all applications in our experimental suite, at the cost of giving up a tolerable portion of the reliability enhancement the second scheme achieves.

References

SimpleScalar toolset. http://www.simplescalar.comGoogle Scholar
SPEC 2000 Benchmark. http://www.spec.orgGoogle Scholar
T. Calin, M. Nicolaidis, and R. Velazco. Upset hardened memory design for submicron CMOS technology. IEEE Trans. on Nuclear Science, 43(6), Dec. 1996.Google ScholarCross Ref
E. H. Cannon, D. D. Reinhardt, and P. S. Makowenskyj. SRAM SER in 90, 130 and 180nm Bulk and SOI Technologies. Int. Rel. Phys. Symp., Apr. 2004.Google Scholar
C. Carmichael. Triple module redundancy design techniques for virtex FPGAs. Xilinx Aplication Notes 197, v1.0, Nov. 2001.Google Scholar
C. L. Chen and M. Y. Hsiao. Error-correcting codes for semiconductor memory applications: a state of the art review. Reliable Computer Systems - Design and Evaluation, Digital Press, 2nd Ed., pp. 771--786, 1992.Google Scholar
V. Degalahal, N. Vijaykrishnan, and M. J. Irwin. Analyzing soft errors in leakage optimized SRAM design. VLSI Design Conference, Jan. 2003. Google ScholarDigital Library
V. Degalahal, L. Li, V. Narayanan, M. Kandemir, and M. J. Irwin. Soft errors issues in low-power caches. IEEE Trans. on Very Large Scale Integ. Sys., 13(10):1157--1166, Oct. 2005. Google ScholarDigital Library
M. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I. Pomeranz. Transient-fault recovery for chip multiprocessors. Int. Symp. on Comp. Arch., 2003. Google ScholarDigital Library
M. A. Gomaa and T. N. Vijaykumar. Opportunistic transient-fault detection. Int. Symp. on Comp. Arch., June 2005. Google ScholarDigital Library
S. Hareland, J. Maiz, M. Alavi, K. Mistry, S. Walstra, and C. Dai. Impact of CMOS scaling and SOI on soft error rates of logic processes. VLSI Technology Digest of Technical Papers, 2001.Google ScholarCross Ref
F. Irom, F. F. Farmamesh, A. H. Johnson, G. M. Swift, and D. G. Millward. Single-event upset in commercial silicon-on-insulator PowerPC microprocessors. IEEE Trans. on Nucl. Sci., 49(6), Dec. 2002.Google ScholarCross Ref
T. Karnik, B. Bloechel, K. Soumyanath, V. De, and S. Borkar. Scaling trends of cosmic rays induced soft errors in static latches beyond 0.18μ. Symp. on VLSI Circuits Digest of Technical Papers, 2001.Google Scholar
T. Karnik, P. Hazucha, and J. Patel. Characterization of soft errors caused by single event upsets in CMOS processes. IEEE Trans. on Dep. and Sec. Comp, 1(2):128--143, June 2004. Google ScholarDigital Library
S. Kaxiras, Z. Hu, and M. Martonosi. Cache decay: exploiting generational behavior to reduce cache leakage power. Int. Symp. on Comp. Arch., 2001. Google ScholarDigital Library
S. Kim and A. K. Somani. Soft error sensitivity characterization for microprocessor dependability enhancement strategy. Int. Conf. on Dep. Sys. and Net., 2002. Google ScholarDigital Library
S. Kumar and A. Aggarwal. Reducing resource redundancy for concurrent error detection techniques in high performance microprocessors. Int. Symp. on High-Per. Comp. Arch., 2006.Google ScholarCross Ref
H. H. S. Lee, G. S. Tyson, and M. K. Farrens. Eager writeback -a technique for improving bandwidth utilization. Int. Symp. on Micro., 2000. Google ScholarDigital Library
X. Li, S. V. Adve, P. Bose, and J. A. Rivers. SoftArch: an architecture-level tool for modeling and analyzing soft errors. Dependable Systems and Networks, 2005. Google ScholarDigital Library
S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. Int. Symp. on Micro., Dec. 2003. Google ScholarDigital Library
S. S. Mukherjee, J. Emer, and S. K. Reinhardt. The soft error problem: an architectural perspective. Int. Symp. on High-Perf. Comp. Arch., 2005. Google ScholarDigital Library
H. T. Nguyen and Y. Yagil. A systematic approach to SER estimation and solutions. IEEE Int. Rel. Phys. Symp., 2003.Google ScholarCross Ref
D. K. Pradhan. Fault-tolerant computer system design. Computer Science Press, Second Print, 2003. Google ScholarDigital Library
J. Ray, J. Hoe, and B. Falsafi. Dual use of superscalar datapath for transient-fault detection and recovery. Int. Symp. on Micro., 2001. Google ScholarDigital Library
S. K. Reinhardt and S. S. Mukherjee. Transient fault detection via simultaneous multithreading. Int. Symp. on Comp. Arch., June 2000. Google ScholarDigital Library
P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi. Modeling the effect of technology trends on the soft error rate of combinational logic. Int. Conf. on Dep. Sys. and Net., June 2002. Google ScholarDigital Library
V. Sridharan, H. Asadi, M. B. Tahoori, and D. Kaeli. Reducing data cache susceptibility to soft errors. IEEE Trans. on Dep. and Sec. Comp., 3(4): 353--364, 2006. Google ScholarDigital Library
T. Vijaykumar, I. Pomeranz, and K. Cheng. Transient-fault recovery using simultaneous multithreading. Int. Conf. on Comp. Arch., 2002. Google ScholarDigital Library
N. Wang and S. Patel. Modeling the effect of transient errors on high performance microprocessors. Center for Circuits, Systems, and Software, March 2003.Google Scholar
N. J. Wang, J. Quek, T. M. Rafacz, and S. J. Patel. Characterizing the effects of transient faults on a high-performance processor pipeline. Int. Conf. on Dep. Sys. and Net., 2004. Google ScholarDigital Library
C. Weaver, J. Emer, S. S. Mukherjee, and S. K. Reinhardt. Techniques to reduce the soft error rate of a high performance microprocessor. Int. Symp. on Comp. Arch., 2004. Google ScholarDigital Library
J. F. Ziegler. Terrestrial cosmic rays. IBM Journal of Research and Development, 40(1):19--39, Jan. 1996. Google ScholarDigital Library

Index Terms

Modeling and improving data cache reliability: 1
1. Hardware

Recommendations

Modeling and improving data cache reliability: 1
SIGMETRICS '07 Conference Proceedings

Soft errors arising from energetic particle strikes pose a significant reliability concern for computing systems, especially for those running in noisy environments. Technology scaling and aggressive leakage control mechanisms make the problem caused by ...
Read More
Modeling soft errors for data caches and alleviating their effects on data reliability

Soft errors caused by strikes arising from energetic particles pose a significant reliability concern for computing systems. In this study, we first introduce a model for soft error occurrence and propagation in cache memories. Based on this model, we ...
Read More
Reducing Data Cache Susceptibility to Soft Errors

Data caches are a fundamental component of most modern microprocessors. They provide for efficient read/write access to data memory. Errors occurring in the data cache can corrupt data values or state, and can easily propagate throughout the memory ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMETRICS '07: Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
June 2007
398 pages
ISBN:9781595936394
DOI:10.1145/1254882
General Chair:
Leana Golubchik
University of Southern California, USA
,
Program Chairs:
Mostafa Ammar
Georgia Institute of Technology, USA
,
Mor Harchol-Balter
Carnegie Mellon University, USA
ACM SIGMETRICS Performance Evaluation Review Volume 35, Issue 1
SIGMETRICS '07 Conference Proceedings
June 2007
382 pages
ISSN:0163-5999
DOI:10.1145/1269899
Issue’s Table of Contents
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 June 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data caches
data integrity
reliability
soft errors
vulnerability factors
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate459of2,691submissions,17%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 821
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Modeling and improving data cache reliability: 1

SIGMETRICS '07: Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Modeling and improving data cache reliability: 1

Modeling soft errors for data caches and alleviating their effects on data reliability

Reducing Data Cache Susceptibility to Soft Errors