ABSTRACT
Soft errors arising from energetic particle strikes pose a significant reliability concern for computing systems, especially for those running in noisy environments. Technology scaling and aggressive leakage control mechanisms make the problem caused by these transient errors even more severe. Therefore, it is very important to employ reliability enhancing mechanisms in processor/memory designs to protect them against soft errors. To do so, we first need to model soft errors, and then study cost/reliability tradeoffs among various reliability enhancing techniques based on the model so that system requirements could be met.
Since cache memories take the largest fraction of on-chip real estate today and their share is expected to continue to grow in future designs, they are more vulnerable to soft errors, as compared to many other components of a computing system. In this paper, we first focus on a soft error model for L1 data caches, and then explore different reliability enhancing mechanisms. More specifically, we define a metric called AVFC (Architectural Vulnerability Factor for Caches), which represents the probability with which a fault in the cache can be visible in the final output of the program. Based on this model, we then propose three architectural schemes for improving reliability in the existence of soft errors. Our first scheme prevents an error from propagating to the lower levels in the memory hierarchy by not forwarding the unmodified data words of a dirty cache block to the L2 cache when the dirty block is to be replaced. The second scheme proposed selectively invalidates cache blocks to reduce their vulnerable periods, decreasing their chances of catching any soft errors. Based on the AVFC metric, our experimental results show that these two schemes are very effective in alleviating soft errors in the L1 data cache. Specifically, by using our first scheme, it is possible to improve the AVFC metric by 32% without any performance loss. On the other hand, the second scheme enhances the AVFC metric between 60% and 97%, at the cost of a performance degradation which varies from 0% to 21.3%, depending on how aggressively the cache blocks are invalidated. To reduce the performance overhead caused by cache block invalidation, we also propose a third scheme which tries to bring a fresh copy of the invalidated block into the cache via prefetching. Our experimental results indicate that, this scheme can reduce the performance overheads to less than 1% for all applications in our experimental suite, at the cost of giving up a tolerable portion of the reliability enhancement the second scheme achieves.
- SimpleScalar toolset. http://www.simplescalar.comGoogle Scholar
- SPEC 2000 Benchmark. http://www.spec.orgGoogle Scholar
- T. Calin, M. Nicolaidis, and R. Velazco. Upset hardened memory design for submicron CMOS technology. IEEE Trans. on Nuclear Science, 43(6), Dec. 1996.Google ScholarCross Ref
- E. H. Cannon, D. D. Reinhardt, and P. S. Makowenskyj. SRAM SER in 90, 130 and 180nm Bulk and SOI Technologies. Int. Rel. Phys. Symp., Apr. 2004.Google Scholar
- C. Carmichael. Triple module redundancy design techniques for virtex FPGAs. Xilinx Aplication Notes 197, v1.0, Nov. 2001.Google Scholar
- C. L. Chen and M. Y. Hsiao. Error-correcting codes for semiconductor memory applications: a state of the art review. Reliable Computer Systems - Design and Evaluation, Digital Press, 2nd Ed., pp. 771--786, 1992.Google Scholar
- V. Degalahal, N. Vijaykrishnan, and M. J. Irwin. Analyzing soft errors in leakage optimized SRAM design. VLSI Design Conference, Jan. 2003. Google ScholarDigital Library
- V. Degalahal, L. Li, V. Narayanan, M. Kandemir, and M. J. Irwin. Soft errors issues in low-power caches. IEEE Trans. on Very Large Scale Integ. Sys., 13(10):1157--1166, Oct. 2005. Google ScholarDigital Library
- M. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I. Pomeranz. Transient-fault recovery for chip multiprocessors. Int. Symp. on Comp. Arch., 2003. Google ScholarDigital Library
- M. A. Gomaa and T. N. Vijaykumar. Opportunistic transient-fault detection. Int. Symp. on Comp. Arch., June 2005. Google ScholarDigital Library
- S. Hareland, J. Maiz, M. Alavi, K. Mistry, S. Walstra, and C. Dai. Impact of CMOS scaling and SOI on soft error rates of logic processes. VLSI Technology Digest of Technical Papers, 2001.Google ScholarCross Ref
- F. Irom, F. F. Farmamesh, A. H. Johnson, G. M. Swift, and D. G. Millward. Single-event upset in commercial silicon-on-insulator PowerPC microprocessors. IEEE Trans. on Nucl. Sci., 49(6), Dec. 2002.Google ScholarCross Ref
- T. Karnik, B. Bloechel, K. Soumyanath, V. De, and S. Borkar. Scaling trends of cosmic rays induced soft errors in static latches beyond 0.18μ. Symp. on VLSI Circuits Digest of Technical Papers, 2001.Google Scholar
- T. Karnik, P. Hazucha, and J. Patel. Characterization of soft errors caused by single event upsets in CMOS processes. IEEE Trans. on Dep. and Sec. Comp, 1(2):128--143, June 2004. Google ScholarDigital Library
- S. Kaxiras, Z. Hu, and M. Martonosi. Cache decay: exploiting generational behavior to reduce cache leakage power. Int. Symp. on Comp. Arch., 2001. Google ScholarDigital Library
- S. Kim and A. K. Somani. Soft error sensitivity characterization for microprocessor dependability enhancement strategy. Int. Conf. on Dep. Sys. and Net., 2002. Google ScholarDigital Library
- S. Kumar and A. Aggarwal. Reducing resource redundancy for concurrent error detection techniques in high performance microprocessors. Int. Symp. on High-Per. Comp. Arch., 2006.Google ScholarCross Ref
- H. H. S. Lee, G. S. Tyson, and M. K. Farrens. Eager writeback -a technique for improving bandwidth utilization. Int. Symp. on Micro., 2000. Google ScholarDigital Library
- X. Li, S. V. Adve, P. Bose, and J. A. Rivers. SoftArch: an architecture-level tool for modeling and analyzing soft errors. Dependable Systems and Networks, 2005. Google ScholarDigital Library
- S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. Int. Symp. on Micro., Dec. 2003. Google ScholarDigital Library
- S. S. Mukherjee, J. Emer, and S. K. Reinhardt. The soft error problem: an architectural perspective. Int. Symp. on High-Perf. Comp. Arch., 2005. Google ScholarDigital Library
- H. T. Nguyen and Y. Yagil. A systematic approach to SER estimation and solutions. IEEE Int. Rel. Phys. Symp., 2003.Google ScholarCross Ref
- D. K. Pradhan. Fault-tolerant computer system design. Computer Science Press, Second Print, 2003. Google ScholarDigital Library
- J. Ray, J. Hoe, and B. Falsafi. Dual use of superscalar datapath for transient-fault detection and recovery. Int. Symp. on Micro., 2001. Google ScholarDigital Library
- S. K. Reinhardt and S. S. Mukherjee. Transient fault detection via simultaneous multithreading. Int. Symp. on Comp. Arch., June 2000. Google ScholarDigital Library
- P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi. Modeling the effect of technology trends on the soft error rate of combinational logic. Int. Conf. on Dep. Sys. and Net., June 2002. Google ScholarDigital Library
- V. Sridharan, H. Asadi, M. B. Tahoori, and D. Kaeli. Reducing data cache susceptibility to soft errors. IEEE Trans. on Dep. and Sec. Comp., 3(4): 353--364, 2006. Google ScholarDigital Library
- T. Vijaykumar, I. Pomeranz, and K. Cheng. Transient-fault recovery using simultaneous multithreading. Int. Conf. on Comp. Arch., 2002. Google ScholarDigital Library
- N. Wang and S. Patel. Modeling the effect of transient errors on high performance microprocessors. Center for Circuits, Systems, and Software, March 2003.Google Scholar
- N. J. Wang, J. Quek, T. M. Rafacz, and S. J. Patel. Characterizing the effects of transient faults on a high-performance processor pipeline. Int. Conf. on Dep. Sys. and Net., 2004. Google ScholarDigital Library
- C. Weaver, J. Emer, S. S. Mukherjee, and S. K. Reinhardt. Techniques to reduce the soft error rate of a high performance microprocessor. Int. Symp. on Comp. Arch., 2004. Google ScholarDigital Library
- J. F. Ziegler. Terrestrial cosmic rays. IBM Journal of Research and Development, 40(1):19--39, Jan. 1996. Google ScholarDigital Library
Index Terms
- Modeling and improving data cache reliability: 1
Recommendations
Modeling and improving data cache reliability: 1
SIGMETRICS '07 Conference ProceedingsSoft errors arising from energetic particle strikes pose a significant reliability concern for computing systems, especially for those running in noisy environments. Technology scaling and aggressive leakage control mechanisms make the problem caused by ...
Modeling soft errors for data caches and alleviating their effects on data reliability
Soft errors caused by strikes arising from energetic particles pose a significant reliability concern for computing systems. In this study, we first introduce a model for soft error occurrence and propagation in cache memories. Based on this model, we ...
Reducing Data Cache Susceptibility to Soft Errors
Data caches are a fundamental component of most modern microprocessors. They provide for efficient read/write access to data memory. Errors occurring in the data cache can corrupt data values or state, and can easily propagate throughout the memory ...
Comments