skip to main content
10.1145/1254882.1254884acmconferencesArticle/Chapter ViewAbstractPublication PagesmetricsConference Proceedingsconference-collections
Article

Modeling and improving data cache reliability: 1

Published:12 June 2007Publication History

ABSTRACT

Soft errors arising from energetic particle strikes pose a significant reliability concern for computing systems, especially for those running in noisy environments. Technology scaling and aggressive leakage control mechanisms make the problem caused by these transient errors even more severe. Therefore, it is very important to employ reliability enhancing mechanisms in processor/memory designs to protect them against soft errors. To do so, we first need to model soft errors, and then study cost/reliability tradeoffs among various reliability enhancing techniques based on the model so that system requirements could be met.

Since cache memories take the largest fraction of on-chip real estate today and their share is expected to continue to grow in future designs, they are more vulnerable to soft errors, as compared to many other components of a computing system. In this paper, we first focus on a soft error model for L1 data caches, and then explore different reliability enhancing mechanisms. More specifically, we define a metric called AVFC (Architectural Vulnerability Factor for Caches), which represents the probability with which a fault in the cache can be visible in the final output of the program. Based on this model, we then propose three architectural schemes for improving reliability in the existence of soft errors. Our first scheme prevents an error from propagating to the lower levels in the memory hierarchy by not forwarding the unmodified data words of a dirty cache block to the L2 cache when the dirty block is to be replaced. The second scheme proposed selectively invalidates cache blocks to reduce their vulnerable periods, decreasing their chances of catching any soft errors. Based on the AVFC metric, our experimental results show that these two schemes are very effective in alleviating soft errors in the L1 data cache. Specifically, by using our first scheme, it is possible to improve the AVFC metric by 32% without any performance loss. On the other hand, the second scheme enhances the AVFC metric between 60% and 97%, at the cost of a performance degradation which varies from 0% to 21.3%, depending on how aggressively the cache blocks are invalidated. To reduce the performance overhead caused by cache block invalidation, we also propose a third scheme which tries to bring a fresh copy of the invalidated block into the cache via prefetching. Our experimental results indicate that, this scheme can reduce the performance overheads to less than 1% for all applications in our experimental suite, at the cost of giving up a tolerable portion of the reliability enhancement the second scheme achieves.

References

  1. SimpleScalar toolset. http://www.simplescalar.comGoogle ScholarGoogle Scholar
  2. SPEC 2000 Benchmark. http://www.spec.orgGoogle ScholarGoogle Scholar
  3. T. Calin, M. Nicolaidis, and R. Velazco. Upset hardened memory design for submicron CMOS technology. IEEE Trans. on Nuclear Science, 43(6), Dec. 1996.Google ScholarGoogle ScholarCross RefCross Ref
  4. E. H. Cannon, D. D. Reinhardt, and P. S. Makowenskyj. SRAM SER in 90, 130 and 180nm Bulk and SOI Technologies. Int. Rel. Phys. Symp., Apr. 2004.Google ScholarGoogle Scholar
  5. C. Carmichael. Triple module redundancy design techniques for virtex FPGAs. Xilinx Aplication Notes 197, v1.0, Nov. 2001.Google ScholarGoogle Scholar
  6. C. L. Chen and M. Y. Hsiao. Error-correcting codes for semiconductor memory applications: a state of the art review. Reliable Computer Systems - Design and Evaluation, Digital Press, 2nd Ed., pp. 771--786, 1992.Google ScholarGoogle Scholar
  7. V. Degalahal, N. Vijaykrishnan, and M. J. Irwin. Analyzing soft errors in leakage optimized SRAM design. VLSI Design Conference, Jan. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. V. Degalahal, L. Li, V. Narayanan, M. Kandemir, and M. J. Irwin. Soft errors issues in low-power caches. IEEE Trans. on Very Large Scale Integ. Sys., 13(10):1157--1166, Oct. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I. Pomeranz. Transient-fault recovery for chip multiprocessors. Int. Symp. on Comp. Arch., 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. A. Gomaa and T. N. Vijaykumar. Opportunistic transient-fault detection. Int. Symp. on Comp. Arch., June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Hareland, J. Maiz, M. Alavi, K. Mistry, S. Walstra, and C. Dai. Impact of CMOS scaling and SOI on soft error rates of logic processes. VLSI Technology Digest of Technical Papers, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  12. F. Irom, F. F. Farmamesh, A. H. Johnson, G. M. Swift, and D. G. Millward. Single-event upset in commercial silicon-on-insulator PowerPC microprocessors. IEEE Trans. on Nucl. Sci., 49(6), Dec. 2002.Google ScholarGoogle ScholarCross RefCross Ref
  13. T. Karnik, B. Bloechel, K. Soumyanath, V. De, and S. Borkar. Scaling trends of cosmic rays induced soft errors in static latches beyond 0.18μ. Symp. on VLSI Circuits Digest of Technical Papers, 2001.Google ScholarGoogle Scholar
  14. T. Karnik, P. Hazucha, and J. Patel. Characterization of soft errors caused by single event upsets in CMOS processes. IEEE Trans. on Dep. and Sec. Comp, 1(2):128--143, June 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Kaxiras, Z. Hu, and M. Martonosi. Cache decay: exploiting generational behavior to reduce cache leakage power. Int. Symp. on Comp. Arch., 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Kim and A. K. Somani. Soft error sensitivity characterization for microprocessor dependability enhancement strategy. Int. Conf. on Dep. Sys. and Net., 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Kumar and A. Aggarwal. Reducing resource redundancy for concurrent error detection techniques in high performance microprocessors. Int. Symp. on High-Per. Comp. Arch., 2006.Google ScholarGoogle ScholarCross RefCross Ref
  18. H. H. S. Lee, G. S. Tyson, and M. K. Farrens. Eager writeback -a technique for improving bandwidth utilization. Int. Symp. on Micro., 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. X. Li, S. V. Adve, P. Bose, and J. A. Rivers. SoftArch: an architecture-level tool for modeling and analyzing soft errors. Dependable Systems and Networks, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. Int. Symp. on Micro., Dec. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. S. Mukherjee, J. Emer, and S. K. Reinhardt. The soft error problem: an architectural perspective. Int. Symp. on High-Perf. Comp. Arch., 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. H. T. Nguyen and Y. Yagil. A systematic approach to SER estimation and solutions. IEEE Int. Rel. Phys. Symp., 2003.Google ScholarGoogle ScholarCross RefCross Ref
  23. D. K. Pradhan. Fault-tolerant computer system design. Computer Science Press, Second Print, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Ray, J. Hoe, and B. Falsafi. Dual use of superscalar datapath for transient-fault detection and recovery. Int. Symp. on Micro., 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. K. Reinhardt and S. S. Mukherjee. Transient fault detection via simultaneous multithreading. Int. Symp. on Comp. Arch., June 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi. Modeling the effect of technology trends on the soft error rate of combinational logic. Int. Conf. on Dep. Sys. and Net., June 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. V. Sridharan, H. Asadi, M. B. Tahoori, and D. Kaeli. Reducing data cache susceptibility to soft errors. IEEE Trans. on Dep. and Sec. Comp., 3(4): 353--364, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. T. Vijaykumar, I. Pomeranz, and K. Cheng. Transient-fault recovery using simultaneous multithreading. Int. Conf. on Comp. Arch., 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. N. Wang and S. Patel. Modeling the effect of transient errors on high performance microprocessors. Center for Circuits, Systems, and Software, March 2003.Google ScholarGoogle Scholar
  30. N. J. Wang, J. Quek, T. M. Rafacz, and S. J. Patel. Characterizing the effects of transient faults on a high-performance processor pipeline. Int. Conf. on Dep. Sys. and Net., 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. C. Weaver, J. Emer, S. S. Mukherjee, and S. K. Reinhardt. Techniques to reduce the soft error rate of a high performance microprocessor. Int. Symp. on Comp. Arch., 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. F. Ziegler. Terrestrial cosmic rays. IBM Journal of Research and Development, 40(1):19--39, Jan. 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Modeling and improving data cache reliability: 1

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGMETRICS '07: Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
          June 2007
          398 pages
          ISBN:9781595936394
          DOI:10.1145/1254882
          • cover image ACM SIGMETRICS Performance Evaluation Review
            ACM SIGMETRICS Performance Evaluation Review  Volume 35, Issue 1
            SIGMETRICS '07 Conference Proceedings
            June 2007
            382 pages
            ISSN:0163-5999
            DOI:10.1145/1269899
            Issue’s Table of Contents

          Copyright © 2007 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 12 June 2007

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          Overall Acceptance Rate459of2,691submissions,17%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader