skip to main content
research-article

DRAM errors in the wild: a large-scale field study

Published:15 June 2009Publication History
Skip Abstract Section

Abstract

Errors in dynamic random access memory (DRAM) are a common form of hardware failure in modern compute clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work exists on DRAM in laboratory conditions, little has been reported on real DRAM failures in large production clusters. In this paper, we analyze measurements of memory errors in a large fleet of commodity servers over a period of 2.5 years. The collected data covers multiple vendors, DRAM capacities and technologies, and comprises many millions of DIMM days.

The goal of this paper is to answer questions such as the following: How common are memory errors in practice? What are their statistical properties? How are they affected by external factors, such as temperature and utilization, and by chip-specific factors, such as chip density, memory technology and DIMM age?

We find that DRAM error behavior in the field differs in many key aspects from commonly held assumptions. For example, we observe DRAM error rates that are orders of magnitude higher than previously reported, with 25,000 to 70,000 errors per billion device hours per Mbit and more than 8% of DIMMs affected by errors per year. We provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which previous work suspects to be the dominant error mode. We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a surprisingly small effect on error behavior in the field, when taking all other factors into account. Finally, unlike commonly feared, we don't observe any indication that newer generations of DIMMs have worse error behavior.

References

  1. Mosys adds soft-error protection, correction. Semiconductor Business News, 28 Jan. 2002.Google ScholarGoogle Scholar
  2. Z. Al-Ars, A. J. van de Goor, J. Braun, and D. Richter. Simulation based analysis of temperature effect on the faulty behavior of embedded drams. In ITC'01: Proc. of the 2001 IEEE International Test Conference, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Baumann. Soft errors in advanced computer systems. IEEE Design and Test of Computers, pages 258--266, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. In Proc. of OSDI'06, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. In Proc. of OSDI'06, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. In Proc. of OSDI'06, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Govindavajhala and A. W. Appel. Using memory errors to attack a virtual machine. In SP '03: Proc. of the 2003 IEEE Symposium on Security and Privacy, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. Hamamoto, S. Sugiura, and S. Sawada. On the retention time distribution of dynamic random access memory (dram). IEEE Transactions on Electron Devices, 45(6):1300--1309, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  9. A. H. Johnston. Scaling and technology issues for soft error rates. In Proc. of the 4th Annual Conf. on Reliability, 2000.Google ScholarGoogle Scholar
  10. X. Li, K. Shen, M. Huang, and L. Chu. A memory soft error measurement on production systems. In Proc. of USENIX Annual Technical Conference, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. C. May and M. H. Woods. Alpha-particle-induced soft errors in dynamic memories. IEEE Transactions on Electron Devices, 26(1), 1979.Google ScholarGoogle ScholarCross RefCross Ref
  12. T. C. May and M. H. Woods. Alpha-particle-induced soft errors in dynamic memories. IEEE Transactions on Electron Devices, 26(1), 1979.Google ScholarGoogle ScholarCross RefCross Ref
  13. D. Milojicic, A. Messer, J. Shau, G. Fu, and A. Munoz. Increasing relevance of memory hardware errors: a case for recoverable programming models. In Proc. of the 9th ACM SIGOPS European workshop, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. S. Mukherjee, J. Emer, T. Fossum, and S. K. Reinhardt. Cache scrubbing in microprocessors: Myth or necessity? In PRDC '04: Proceedings of the 10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. S. Mukherjee, J. Emer, and S. K. Reinhardt. The soft error problem: An architectural perspective. In HPCA '05: Proc. of the 11th International Symposium on High-Performance Computer Architecture, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. E. Normand. Single event upset at ground level. IEEE Transaction on Nuclear Sciences, 6(43):2742--2750, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  17. T. J. O'Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, and J. L. Walsh. Field testing for cosmic ray soft errors in semiconductor memories. IBM J. Res. Dev., 40(1), 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with sawzall. Scientific Programming Journal, Special Issue on Grids and Worldwide Computing Programming Models and Infrastructure, 13(4), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. B. Schroeder and G. A. Gibson. A large scale study of failures in high-performance-computing systems. In DSN 2006: Proc. of the International Conference on Dependable Systems and Networks, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. B. Schroeder and G. A. Gibson. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In 5th USENIX FAST Conference, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. B. Schroeder and G. A. Gibson. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In 5th USENIX FAST Conference, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Xu, S. Chen, Z. Kalbarczyk, and R. K. Iyer. An experimental study of security vulnerabilities caused by errors. In DSN 2001: Proc. of the 2001 International Conference on Dependable Systems and Networks, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. F. Ziegler and W. A. Lanford. Effect of Cosmic Rays on Computer Memories. Science, 206:776--788, 1979.Google ScholarGoogle Scholar

Index Terms

  1. DRAM errors in the wild: a large-scale field study

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM SIGMETRICS Performance Evaluation Review
            ACM SIGMETRICS Performance Evaluation Review  Volume 37, Issue 1
            SIGMETRICS '09
            June 2009
            320 pages
            ISSN:0163-5999
            DOI:10.1145/2492101
            Issue’s Table of Contents
            • cover image ACM Conferences
              SIGMETRICS '09: Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
              June 2009
              336 pages
              ISBN:9781605585116
              DOI:10.1145/1555349

            Copyright © 2009 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 15 June 2009

            Check for updates

            Qualifiers

            • research-article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader