Skip to main content
Erschienen in: International Journal of Parallel Programming 5/2016

01.10.2016

Asteroid: Scalable Online Memory Diagnostics for Multi-core, Multi-socket Servers

verfasst von: Musfiq Rahman, Bruce R. Childers

Erschienen in: International Journal of Parallel Programming | Ausgabe 5/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Memory diagnostics are important to improving the resilience of DRAM main memory. As bit cell size reaches physical limits, DRAM memory will be more likely to suffer both transient and permanent errors. Memory diagnostics that operate online can be a component of a comprehensive strategy to allay errors. This paper presents a novel approach, Asteroid, to integrate online memory diagnostics during workload execution. The approach supports diagnostics that adapt at runtime to workload behavior and resource availability to maximize test quality while reducing performance overhead. We describe Asteroid’s design and how it can be efficiently integrated with a hierarchical memory allocator in modern operating systems. We also present how the framework enables control policies to dynamically configure a diagnostic. Using an adaptive policy, in a 16-core server, Asteroid has modest overhead of 1–4 % for workloads with low to high memory demand. For these workloads, Asteroid’s adaptive policy has good error coverage and can thoroughly test memory.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
1
We do not show a sensitivity study of the parameters because the parameters should be tuned to a given target system with offline profiling. This tuning is orthogonal to our contribution and relatively uninteresting.
 
Literatur
1.
Zurück zum Zitat Borkar, S.: Microarchitecture and design challenges for gigascale integration. In: International Symposium on Microarchitecture (2004) Borkar, S.: Microarchitecture and design challenges for gigascale integration. In: International Symposium on Microarchitecture (2004)
2.
Zurück zum Zitat Borkar, S., Karnik, T., Narendra, S., Tschanz, J., Keshavarzi, A., De, V.: Parameter variations and impact on circuits and microarchitecture. In: Design Automation Conference (2003) Borkar, S., Karnik, T., Narendra, S., Tschanz, J., Keshavarzi, A., De, V.: Parameter variations and impact on circuits and microarchitecture. In: Design Automation Conference (2003)
3.
Zurück zum Zitat Constantinescu, C.: Trends and challenges in VLSI circuit reliability. IEEE Micro 23(4), 14–19 (2003)CrossRef Constantinescu, C.: Trends and challenges in VLSI circuit reliability. IEEE Micro 23(4), 14–19 (2003)CrossRef
4.
Zurück zum Zitat Dell, T.J.: A white paper on the benefits of chipkill. IBM Microelectron. Div. (1997) Dell, T.J.: A white paper on the benefits of chipkill. IBM Microelectron. Div. (1997)
5.
Zurück zum Zitat Du, Y., Zhou, M., Childers, B., Mosse, D., Melhem, R.: Supporting superpages in non-contiguous physical memory. In: IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pp. 223–234 (2015) Du, Y., Zhou, M., Childers, B., Mosse, D., Melhem, R.: Supporting superpages in non-contiguous physical memory. In: IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pp. 223–234 (2015)
6.
Zurück zum Zitat Elm, C., Klein, M., Tavangarian, D.: Automatic on-line memory testsin workstations. In: Workshop in Memory Technology, Design and Testing (1994) Elm, C., Klein, M., Tavangarian, D.: Automatic on-line memory testsin workstations. In: Workshop in Memory Technology, Design and Testing (1994)
7.
Zurück zum Zitat Hwang, A.A., Stefanovici, I.A., Schroeder, B.: Cosmic rays don’t strike twice: understanding the nature of DRAM errors and the implications for system design. In: Confernce on Arch. Support for Programming Language and Operating System (2012) Hwang, A.A., Stefanovici, I.A., Schroeder, B.: Cosmic rays don’t strike twice: understanding the nature of DRAM errors and the implications for system design. In: Confernce on Arch. Support for Programming Language and Operating System (2012)
8.
Zurück zum Zitat Li, X., Huang, M.C., Shen, K., Chu, L.: A realistic evaluation of memory hardware errors and software system susceptibility. In: USENIX Annual Technical Conference (2010) Li, X., Huang, M.C., Shen, K., Chu, L.: A realistic evaluation of memory hardware errors and software system susceptibility. In: USENIX Annual Technical Conference (2010)
10.
Zurück zum Zitat Nair, P.J., Kim, D.-H., Qureshi, M.K.: Archshield: architectural framework for assisting DRAM scaling by tolerating high error rates. In: International Symposium on Computer Architecture (2013) Nair, P.J., Kim, D.-H., Qureshi, M.K.: Archshield: architectural framework for assisting DRAM scaling by tolerating high error rates. In: International Symposium on Computer Architecture (2013)
11.
Zurück zum Zitat Nightingale, E.B., Douceur, J.R., Orgovan, V.: Cycles, cells and platters: an empirical analysis of hardware failures on a million consumer PCs. In: European Conference on Computer Systems (2011) Nightingale, E.B., Douceur, J.R., Orgovan, V.: Cycles, cells and platters: an empirical analysis of hardware failures on a million consumer PCs. In: European Conference on Computer Systems (2011)
12.
Zurück zum Zitat Rahman, M., Childers, B.R.: Asteroid: scalable online memory diagnostics. In: ACM International Conference on Computing Frontiers (2015) Rahman, M., Childers, B.R.: Asteroid: scalable online memory diagnostics. In: ACM International Conference on Computing Frontiers (2015)
13.
Zurück zum Zitat Rahman, M., Childers, B.R., Cho, S.: COMeT: continuous online memory test. In: Pacific Rim Dependability Conference (2011) Rahman, M., Childers, B.R., Cho, S.: COMeT: continuous online memory test. In: Pacific Rim Dependability Conference (2011)
14.
Zurück zum Zitat Rahman, M., Childers, B.R., Cho, S.: COMeT+: continuous online memory testing with multi-threading extension. IEEE Trans. Comput. 63(7), 1668–1681 (2014)MathSciNetCrossRef Rahman, M., Childers, B.R., Cho, S.: COMeT+: continuous online memory testing with multi-threading extension. IEEE Trans. Comput. 63(7), 1668–1681 (2014)MathSciNetCrossRef
15.
Zurück zum Zitat Schirmeier, H., Neuhalfen, J., Korb, I., Spinczyk, O., Engel, M.: Rampage: graceful degradation management for memory errors in commodity Linux servers. In: Pacific Rim Dependability Conference (2011) Schirmeier, H., Neuhalfen, J., Korb, I., Spinczyk, O., Engel, M.: Rampage: graceful degradation management for memory errors in commodity Linux servers. In: Pacific Rim Dependability Conference (2011)
16.
Zurück zum Zitat Schroeder, B., Pinheiro, E., Weber, W.-D.: DRAM errors in the wild: a large-scale field study. In: International Conference on Measurement and Modeling of Computer System (2009) Schroeder, B., Pinheiro, E., Weber, W.-D.: DRAM errors in the wild: a large-scale field study. In: International Conference on Measurement and Modeling of Computer System (2009)
17.
Zurück zum Zitat Singh, A., Bose, D., Darisala, S.: Software based in-system memorytest for highly available systems. In: Workshop Memory Technology, Design and Testing (2005) Singh, A., Bose, D., Darisala, S.: Software based in-system memorytest for highly available systems. In: Workshop Memory Technology, Design and Testing (2005)
18.
Zurück zum Zitat Tang, D., Carruthers, P., Totari, Z., Shapiro, M.W.: Assessment of the effect of memory page retirement on system RAS against hardware faults. In: International Conference on Dependable Systems and Networks (2006) Tang, D., Carruthers, P., Totari, Z., Shapiro, M.W.: Assessment of the effect of memory page retirement on system RAS against hardware faults. In: International Conference on Dependable Systems and Networks (2006)
19.
Zurück zum Zitat van de Goor, A., Tlili, I.: March tests for word-oriented memories. In: Design Automation and Test in Europe Conference (1998) van de Goor, A., Tlili, I.: March tests for word-oriented memories. In: Design Automation and Test in Europe Conference (1998)
20.
Zurück zum Zitat Wu, C.-F., Huang, C.-T., Cheng, K.-L., Wu, C.-W.: Simulation-based test algorithm generation for random access memories. In: IEEE VLSI Test Symposium, pp. 291–296 (2000) Wu, C.-F., Huang, C.-T., Cheng, K.-L., Wu, C.-W.: Simulation-based test algorithm generation for random access memories. In: IEEE VLSI Test Symposium, pp. 291–296 (2000)
Metadaten
Titel
Asteroid: Scalable Online Memory Diagnostics for Multi-core, Multi-socket Servers
verfasst von
Musfiq Rahman
Bruce R. Childers
Publikationsdatum
01.10.2016
Verlag
Springer US
Erschienen in
International Journal of Parallel Programming / Ausgabe 5/2016
Print ISSN: 0885-7458
Elektronische ISSN: 1573-7640
DOI
https://doi.org/10.1007/s10766-016-0400-2

Weitere Artikel der Ausgabe 5/2016

International Journal of Parallel Programming 5/2016 Zur Ausgabe