skip to main content
research-article

Memory Errors in Modern Systems: The Good, The Bad, and The Ugly

Published:14 March 2015Publication History
Skip Abstract Section

Abstract

Several recent publications have shown that hardware faults in the memory subsystem are commonplace. These faults are predicted to become more frequent in future systems that contain orders of magnitude more DRAM and SRAM than found in current memory subsystems. These memory subsystems will need to provide resilience techniques to tolerate these faults when deployed in high-performance computing systems and data centers containing tens of thousands of nodes. Therefore, it is critical to understand the efficacy of current hardware resilience techniques to determine whether they will be suitable for future systems. In this paper, we present a study of DRAM and SRAM faults and errors from the field. We use data from two leadership-class high-performance computer systems to analyze the reliability impact of hardware resilience schemes that are deployed in current systems. Our study has several key findings about the efficacy of many currently deployed reliability techniques such as DRAM ECC, DDR address/command parity, and SRAM ECC and parity. We also perform a methodological study, and find that counting errors instead of faults, a common practice among researchers and data center operators, can lead to incorrect conclusions about system reliability. Finally, we use our data to project the needs of future large-scale systems. We find that SRAM faults are unlikely to pose a significantly larger reliability threat in the future, while DRAM faults will be a major concern and stronger DRAM resilience schemes will be needed to maintain acceptable failure rates similar to those found on today's systems.

References

  1. Flux calculator. http://seutest.com/cgi-bin/FluxCalculator.cgi.Google ScholarGoogle Scholar
  2. mcelog: memory error handling in user space. http://halobates.de/lk10-mcelog.pdf.Google ScholarGoogle Scholar
  3. AMD. AMD graphics cores next (GCN) architecture. http://www.amd.com/us/Documents/GCN_Architecture_whitepaper.pdf.Google ScholarGoogle Scholar
  4. AMD. Bios and kernel developer guide (BKDG) for AMD family 10h models 00h-0fh processors. http://developer.amd.com/wordpress/media/2012/10/31116.pdf.Google ScholarGoogle Scholar
  5. AMD. AMD64 architecture programmer's manual volume 2: System programming, revision 3.23. http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/24593_APM_v21.pdf, 2013.Google ScholarGoogle Scholar
  6. A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 1(1):11--33, Jan.-Mar. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Baumann. Radiation-induced soft errors in advanced semi-conductor technologies. IEEE Transactions on Device and Materials Reliability, 5(3):305--316, Sept. 2005.Google ScholarGoogle ScholarCross RefCross Ref
  8. K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick. Exascale computing study: Technology challenges in achieving exascale systems, Peter Kogge, editor & study lead, Sep. 2008.Google ScholarGoogle Scholar
  9. L. Borucki, G. Schindlbeck, and C. Slayman. Comparison of accelerated DRAM soft error rates measured at component and system level. In IEEE International Reliability Physics Symposium (IRPS), pages 482--487, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  10. C. Constantinescu. Impact of deep submicron technology on dependability of VLSI circuits. In International Conference on Dependable Systems and Networks (DSN), pages 205--209, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. Constantinescu. Trends and challenges in VLSI circuit reliability. IEEE Micro, 23(4):14--19, Jul.-Aug. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. Di Martino, Z. Kalbarczyk, R. K. Iyer, F. Baccanico, J. Fullop, and W. Kramer. Lessons learned from the analysis of system failures at petascale: The case of blue waters. In International Conference on Dependable Systems and Networks (DSN), pages 610--621, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Dixit, R. Heald, and A. Wood. Trends from ten years of soft error experimentation. In IEEE Workshop on Silicon Errors in Logic - System Effects (SELSE), 2009.Google ScholarGoogle Scholar
  14. N. El-Sayed, I. A. Stefanovici, G. Amvrosiadis, A. A. Hwang, and B. Schroeder. Temperature management in data centers: why some (might) like it hot. In International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), pages 163--174, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. X. Huang, W.-C. Lee, C. Kuo, D. Hisamoto, L. Chang, J. Kedzierski, E. Anderson, H. Takeuchi, Y.-K. Choi, K. Asano, V. Subramanian, T.-J. King, J. Bokor, and C. Hu. Sub 50-nm FinFET: PMOS. In International Electron Devices Meeting (IEDM), pages 67--70, 1999.Google ScholarGoogle Scholar
  16. A. A. Hwang, I. A. Stefanovici, and B. Schroeder. Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 111--122, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. E. Ibe, H. Taniguchi, Y. Yahagi, K. i. Shimbo, , and T. Toba. Impact of scaling on neutron-induced soft error in SRAMs from a 250 nm to a 22 nm design rule. In IEEE Transactions on Electron Devices, pages 1527--1538, Jul. 2010.Google ScholarGoogle Scholar
  18. X. Jian, H. Duwe, J. Sartori, V. Sridharan, and R. Kumar. Low-power, low-storage-overhead chipkill correct via multi-line error correction. In International Conference on High Performance Computing, Networking, Storage and Analysis (SC), pages 24:1--24:12, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, C. Wilkerson, K. Lai, and O. Mutlu. Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors. In International Symposium on Computer Architecture (ISCA), pages 361 -- 372, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. X. Li, M. C. Huang, K. Shen, and L. Chu. A realistic evaluation of memory hardware errors and software system susceptibility. In USENIX Annual Technical Conference (USENIX- ATC), pages 6--20, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. X. Li, K. Shen, M. C. Huang, and L. Chu. A memory soft error measurement on production systems. In USENIX Annual Technical Conference (USENIXATC), pages 21:1--21:6, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. P. W. Lisowski and K. F. Schoenberg. The Los Alamos neutron science center. In Nuclear Instruments and Methods, volume 562:2, pages 910--914, June 2006.Google ScholarGoogle Scholar
  23. T. May and M. H. Woods. Alpha-particle-induced soft errors in dynamic memories. IEEE Transactions on Electron Devices, 26(1):2--9, Jan. 1979.Google ScholarGoogle ScholarCross RefCross Ref
  24. A. Messer, P. Bernadat, G. Fu, D. Chen, Z. Dimitrijevic, D. Lie, D. Mannaru, A. Riska, and D. Milojicic. Susceptibility of commodity systems and software to memory soft errors. IEEE Transactions on Computers, 53(12):1557--1568, Dec. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In International Symposium on Microarchitecture (MICRO), pages 29--40, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. T. Pawlowski. Memory errors and mitigation: Keynote talk for SELSE 2014. In IEEE Workshop on Silicon Errors in Logic - System Effects (SELSE), 2014.Google ScholarGoogle Scholar
  27. H. Quinn, P. Graham, and T. Fairbanks. SEEs induced by high-energy protons and neutrons in SDRAM. In IEEE Radiation Effects Data Workshop (REDW), pages 1--5, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  28. B. Schroeder. Personal Communication.Google ScholarGoogle Scholar
  29. B. Schroeder and G. Gibson. A large-scale study of failures in high-performance computing systems. In International Conference on Dependable Systems and Networks (DSN), pages 249--258, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM errors in the wild: a large-scale field study. Commun. ACM, 54(2):100--107, Feb. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. T. Siddiqua, A. Papathanasiou, A. Biswas, and S. Gurumurthi. Analysis of memory errors from large-scale field data collection. In IEEE Workshop on Silicon Errors in Logic - System Effects (SELSE), 2013.Google ScholarGoogle Scholar
  32. J. Sim, G. H. Loh, V. Sridharan, and M. O'Connor. Resilient die-stacked DRAM caches. In International Symposium on Computer Architecture (ISCA), pages 416--427, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. V. Sridharan and D. Liberty. A study of DRAM failures in the field. In International Conference on High Performance Computing, Networking, Storage and Analysis (SC), pages 76:1--76:11, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. V. Sridharan, J. Stearley, N. DeBardeleben, S. Blanchard, and S. Gurumurthi. Feng shui of supercomputer memory: Positional effects in DRAM and SRAM faults. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 22:1--22:11, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. A. N. Udipi, N. Muralimanohar, R. Balsubramonian, A. Davis, and N. P. Jouppi. LOT-ECC: Localized and tiered reliability mechanisms for commodity memory systems. In International Symposium on Computer Architecture (ISCA), pages 285--296, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. Wadden, A. Lyashevsky, S. Gurumurthi, V. Sridharan, and K. Skadron. Real-world design and evaluation of compiler-managed GPU redundant multithreading. In International Symposium on Computer Architecture (ISCA), pages 73--84, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. C. Wilkerson, A. R. Alameldeen, Z. Chishti, W. Wu, D. Somasekhar, and S.-l. Lu. Reducing cache power with low-cost, multi-bit error-correcting codes. In International Symposium on Computer Architecture (ISCA), pages 83--93, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Memory Errors in Modern Systems: The Good, The Bad, and The Ugly

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGARCH Computer Architecture News
        ACM SIGARCH Computer Architecture News  Volume 43, Issue 1
        ASPLOS'15
        March 2015
        676 pages
        ISSN:0163-5964
        DOI:10.1145/2786763
        Issue’s Table of Contents
        • cover image ACM Conferences
          ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems
          March 2015
          720 pages
          ISBN:9781450328357
          DOI:10.1145/2694344

        Copyright © 2015 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 14 March 2015

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader