Skip to main content
Top
Published in: International Journal of Parallel Programming 5/2016

01-10-2016

Asteroid: Scalable Online Memory Diagnostics for Multi-core, Multi-socket Servers

Authors: Musfiq Rahman, Bruce R. Childers

Published in: International Journal of Parallel Programming | Issue 5/2016

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Memory diagnostics are important to improving the resilience of DRAM main memory. As bit cell size reaches physical limits, DRAM memory will be more likely to suffer both transient and permanent errors. Memory diagnostics that operate online can be a component of a comprehensive strategy to allay errors. This paper presents a novel approach, Asteroid, to integrate online memory diagnostics during workload execution. The approach supports diagnostics that adapt at runtime to workload behavior and resource availability to maximize test quality while reducing performance overhead. We describe Asteroid’s design and how it can be efficiently integrated with a hierarchical memory allocator in modern operating systems. We also present how the framework enables control policies to dynamically configure a diagnostic. Using an adaptive policy, in a 16-core server, Asteroid has modest overhead of 1–4 % for workloads with low to high memory demand. For these workloads, Asteroid’s adaptive policy has good error coverage and can thoroughly test memory.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Footnotes
1
We do not show a sensitivity study of the parameters because the parameters should be tuned to a given target system with offline profiling. This tuning is orthogonal to our contribution and relatively uninteresting.
 
Literature
1.
go back to reference Borkar, S.: Microarchitecture and design challenges for gigascale integration. In: International Symposium on Microarchitecture (2004) Borkar, S.: Microarchitecture and design challenges for gigascale integration. In: International Symposium on Microarchitecture (2004)
2.
go back to reference Borkar, S., Karnik, T., Narendra, S., Tschanz, J., Keshavarzi, A., De, V.: Parameter variations and impact on circuits and microarchitecture. In: Design Automation Conference (2003) Borkar, S., Karnik, T., Narendra, S., Tschanz, J., Keshavarzi, A., De, V.: Parameter variations and impact on circuits and microarchitecture. In: Design Automation Conference (2003)
3.
go back to reference Constantinescu, C.: Trends and challenges in VLSI circuit reliability. IEEE Micro 23(4), 14–19 (2003)CrossRef Constantinescu, C.: Trends and challenges in VLSI circuit reliability. IEEE Micro 23(4), 14–19 (2003)CrossRef
4.
go back to reference Dell, T.J.: A white paper on the benefits of chipkill. IBM Microelectron. Div. (1997) Dell, T.J.: A white paper on the benefits of chipkill. IBM Microelectron. Div. (1997)
5.
go back to reference Du, Y., Zhou, M., Childers, B., Mosse, D., Melhem, R.: Supporting superpages in non-contiguous physical memory. In: IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pp. 223–234 (2015) Du, Y., Zhou, M., Childers, B., Mosse, D., Melhem, R.: Supporting superpages in non-contiguous physical memory. In: IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pp. 223–234 (2015)
6.
go back to reference Elm, C., Klein, M., Tavangarian, D.: Automatic on-line memory testsin workstations. In: Workshop in Memory Technology, Design and Testing (1994) Elm, C., Klein, M., Tavangarian, D.: Automatic on-line memory testsin workstations. In: Workshop in Memory Technology, Design and Testing (1994)
7.
go back to reference Hwang, A.A., Stefanovici, I.A., Schroeder, B.: Cosmic rays don’t strike twice: understanding the nature of DRAM errors and the implications for system design. In: Confernce on Arch. Support for Programming Language and Operating System (2012) Hwang, A.A., Stefanovici, I.A., Schroeder, B.: Cosmic rays don’t strike twice: understanding the nature of DRAM errors and the implications for system design. In: Confernce on Arch. Support for Programming Language and Operating System (2012)
8.
go back to reference Li, X., Huang, M.C., Shen, K., Chu, L.: A realistic evaluation of memory hardware errors and software system susceptibility. In: USENIX Annual Technical Conference (2010) Li, X., Huang, M.C., Shen, K., Chu, L.: A realistic evaluation of memory hardware errors and software system susceptibility. In: USENIX Annual Technical Conference (2010)
10.
go back to reference Nair, P.J., Kim, D.-H., Qureshi, M.K.: Archshield: architectural framework for assisting DRAM scaling by tolerating high error rates. In: International Symposium on Computer Architecture (2013) Nair, P.J., Kim, D.-H., Qureshi, M.K.: Archshield: architectural framework for assisting DRAM scaling by tolerating high error rates. In: International Symposium on Computer Architecture (2013)
11.
go back to reference Nightingale, E.B., Douceur, J.R., Orgovan, V.: Cycles, cells and platters: an empirical analysis of hardware failures on a million consumer PCs. In: European Conference on Computer Systems (2011) Nightingale, E.B., Douceur, J.R., Orgovan, V.: Cycles, cells and platters: an empirical analysis of hardware failures on a million consumer PCs. In: European Conference on Computer Systems (2011)
12.
go back to reference Rahman, M., Childers, B.R.: Asteroid: scalable online memory diagnostics. In: ACM International Conference on Computing Frontiers (2015) Rahman, M., Childers, B.R.: Asteroid: scalable online memory diagnostics. In: ACM International Conference on Computing Frontiers (2015)
13.
go back to reference Rahman, M., Childers, B.R., Cho, S.: COMeT: continuous online memory test. In: Pacific Rim Dependability Conference (2011) Rahman, M., Childers, B.R., Cho, S.: COMeT: continuous online memory test. In: Pacific Rim Dependability Conference (2011)
14.
go back to reference Rahman, M., Childers, B.R., Cho, S.: COMeT+: continuous online memory testing with multi-threading extension. IEEE Trans. Comput. 63(7), 1668–1681 (2014)MathSciNetCrossRef Rahman, M., Childers, B.R., Cho, S.: COMeT+: continuous online memory testing with multi-threading extension. IEEE Trans. Comput. 63(7), 1668–1681 (2014)MathSciNetCrossRef
15.
go back to reference Schirmeier, H., Neuhalfen, J., Korb, I., Spinczyk, O., Engel, M.: Rampage: graceful degradation management for memory errors in commodity Linux servers. In: Pacific Rim Dependability Conference (2011) Schirmeier, H., Neuhalfen, J., Korb, I., Spinczyk, O., Engel, M.: Rampage: graceful degradation management for memory errors in commodity Linux servers. In: Pacific Rim Dependability Conference (2011)
16.
go back to reference Schroeder, B., Pinheiro, E., Weber, W.-D.: DRAM errors in the wild: a large-scale field study. In: International Conference on Measurement and Modeling of Computer System (2009) Schroeder, B., Pinheiro, E., Weber, W.-D.: DRAM errors in the wild: a large-scale field study. In: International Conference on Measurement and Modeling of Computer System (2009)
17.
go back to reference Singh, A., Bose, D., Darisala, S.: Software based in-system memorytest for highly available systems. In: Workshop Memory Technology, Design and Testing (2005) Singh, A., Bose, D., Darisala, S.: Software based in-system memorytest for highly available systems. In: Workshop Memory Technology, Design and Testing (2005)
18.
go back to reference Tang, D., Carruthers, P., Totari, Z., Shapiro, M.W.: Assessment of the effect of memory page retirement on system RAS against hardware faults. In: International Conference on Dependable Systems and Networks (2006) Tang, D., Carruthers, P., Totari, Z., Shapiro, M.W.: Assessment of the effect of memory page retirement on system RAS against hardware faults. In: International Conference on Dependable Systems and Networks (2006)
19.
go back to reference van de Goor, A., Tlili, I.: March tests for word-oriented memories. In: Design Automation and Test in Europe Conference (1998) van de Goor, A., Tlili, I.: March tests for word-oriented memories. In: Design Automation and Test in Europe Conference (1998)
20.
go back to reference Wu, C.-F., Huang, C.-T., Cheng, K.-L., Wu, C.-W.: Simulation-based test algorithm generation for random access memories. In: IEEE VLSI Test Symposium, pp. 291–296 (2000) Wu, C.-F., Huang, C.-T., Cheng, K.-L., Wu, C.-W.: Simulation-based test algorithm generation for random access memories. In: IEEE VLSI Test Symposium, pp. 291–296 (2000)
Metadata
Title
Asteroid: Scalable Online Memory Diagnostics for Multi-core, Multi-socket Servers
Authors
Musfiq Rahman
Bruce R. Childers
Publication date
01-10-2016
Publisher
Springer US
Published in
International Journal of Parallel Programming / Issue 5/2016
Print ISSN: 0885-7458
Electronic ISSN: 1573-7640
DOI
https://doi.org/10.1007/s10766-016-0400-2

Other articles of this Issue 5/2016

International Journal of Parallel Programming 5/2016 Go to the issue

Premium Partner