Skip to main content
Top
Published in: Journal of Electronic Testing 2/2013

01-04-2013

Self-Adaptive Fault Tolerance in Multi-/Many-Core Systems

Authors: Cristiana Bolchini, Matteo Carminati, Antonio Miele

Published in: Journal of Electronic Testing | Issue 2/2013

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This paper presents a novel approach to the design of multi-/many-core systems with an adaptive level of reliability. The approach defines a layer at the operating system level that achieves fault detection/tolerance/diagnosis properties by means of thread replication and re-execution mechanisms. The layer applies the most convenient hardening mechanism to achieve the desired trade-off between reliability and performance by adapting at run-time to the changes of the working scenario. The proposed strategy has been applied in a set of experimental sessions considering a real-world parallel application, to evaluate its benefits on the final system with respect to various strategies selected at design time.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Show more products
Literature
2.
go back to reference Aggarwal N, Ranganathan P, Jouppi NP, Smith JE (2007) Configurable isolation: building high availability systems with commodity multi-core processors. In: Proceeding international symposium on computer architecture, pp 470–481 Aggarwal N, Ranganathan P, Jouppi NP, Smith JE (2007) Configurable isolation: building high availability systems with commodity multi-core processors. In: Proceeding international symposium on computer architecture, pp 470–481
3.
go back to reference Auslander M, Dasilva D, Edelsohn D, Krieger O, Ostrowski M, Rosenburg B, Wisniewski RW, Xenidis J (2002) K42 overview. Tech. rep., IBM T. J. Watson Research Center Auslander M, Dasilva D, Edelsohn D, Krieger O, Ostrowski M, Rosenburg B, Wisniewski RW, Xenidis J (2002) K42 overview. Tech. rep., IBM T. J. Watson Research Center
4.
go back to reference Baumann A, Barham P, Dagand PE, Harris T, Isaacs R, Peter S, Roscoe T, Schüpbach A, Singhania A (2009) The multikernel: a new OS architecture for scalable multicore systems. In: Proceeding ACM symposium on operating systems principles (SOSP), pp 29–44, New York Baumann A, Barham P, Dagand PE, Harris T, Isaacs R, Peter S, Roscoe T, Schüpbach A, Singhania A (2009) The multikernel: a new OS architecture for scalable multicore systems. In: Proceeding ACM symposium on operating systems principles (SOSP), pp 29–44, New York
5.
go back to reference Bolchini C, Miele A, Sciuto D (2012) An adaptive approach for online fault management in many-core architectures. In: Proceeding conference on design, automation and test in Europe (DATE), pp 1429–1432 Bolchini C, Miele A, Sciuto D (2012) An adaptive approach for online fault management in many-core architectures. In: Proceeding conference on design, automation and test in Europe (DATE), pp 1429–1432
6.
go back to reference Chen Z, Yang M, Francia G, Dongarra J (2007) Self adaptive application level fault tolerance for parallel and distributed computing. In: Proceeding international parallel and distributed processing symposium (IPDPS), pp 1–8 Chen Z, Yang M, Francia G, Dongarra J (2007) Self adaptive application level fault tolerance for parallel and distributed computing. In: Proceeding international parallel and distributed processing symposium (IPDPS), pp 1–8
7.
go back to reference ECSS: Methods for the calculation of radiation received and its effects andapolicyfordesignmargins. Tech. Rep. ECSS-E-ST-10-12C European Cooperation for Space Standardization (2008) ECSS: Methods for the calculation of radiation received and its effects andapolicyfordesignmargins. Tech. Rep. ECSS-E-ST-10-12C European Cooperation for Space Standardization (2008)
8.
go back to reference Gizopoulos D, Psarakis M, Adve S, Ramachandran P, Hari S, Sorin D, Meixner A, Biswas A, Vera X (2011) Architectures for online error detection and recovery in multicore processors. In: Proceeding conference on design, automation and test in europe (DATE), pp 533–538 Gizopoulos D, Psarakis M, Adve S, Ramachandran P, Hari S, Sorin D, Meixner A, Biswas A, Vera X (2011) Architectures for online error detection and recovery in multicore processors. In: Proceeding conference on design, automation and test in europe (DATE), pp 533–538
9.
go back to reference Horn P (2001) Autonomic Computing: IBM’s Perspective on the State of Information Technology Horn P (2001) Autonomic Computing: IBM’s Perspective on the State of Information Technology
10.
go back to reference Huang J, Blech J, Raabe A, Buckl C, Knoll A (2011) Analysis and optimization of fault-tolerant task scheduling on multiprocessor embedded systems. In: Proceeding international conference Hw/Sw codesign and system synthesis, pp 247–256 Huang J, Blech J, Raabe A, Buckl C, Knoll A (2011) Analysis and optimization of fault-tolerant task scheduling on multiprocessor embedded systems. In: Proceeding international conference Hw/Sw codesign and system synthesis, pp 247–256
12.
go back to reference Kephart JO, Chess DM (2003) The vision of autonomic computing. IEEE Comput 36:41–50CrossRef Kephart JO, Chess DM (2003) The vision of autonomic computing. IEEE Comput 36:41–50CrossRef
13.
go back to reference Kouadri A, Heron O, Montagne R (2011) A lightweight API for an adaptive software fault tolerance using POSIX-thread replication. In: Proceeding international conference on architecture of computing systems (ARCS), pp 16–19 Kouadri A, Heron O, Montagne R (2011) A lightweight API for an adaptive software fault tolerance using POSIX-thread replication. In: Proceeding international conference on architecture of computing systems (ARCS), pp 16–19
14.
go back to reference LaFrieda C, Ipek E, Martinez JF, Manohar R (2007) Utilizing dynamically coupled cores to form a resilient chip multiprocessor. In: Proceeding conference dependable systems and networks (DSN), pp 317–326 LaFrieda C, Ipek E, Martinez JF, Manohar R (2007) Utilizing dynamically coupled cores to form a resilient chip multiprocessor. In: Proceeding conference dependable systems and networks (DSN), pp 317–326
15.
go back to reference Lattuada M, Pilato C, Tumeo A, Ferrandi F (2009) Performance modeling of parallel applications on MPSoCs. In: Proceeding 11th international conference on system-on-chip (SoC), pp 64–67 Lattuada M, Pilato C, Tumeo A, Ferrandi F (2009) Performance modeling of parallel applications on MPSoCs. In: Proceeding 11th international conference on system-on-chip (SoC), pp 64–67
16.
go back to reference Meloni P, Tuveri G, Raffo L, Cannella E, Stefanov T, Derin O, Fiorin L, Sami M (2012) System adaptivity and fault-tolerance in NoC-based MPSoCs: the MADNESS project approach. In: Proceeding EUROMICRO conference digital system design (DSD), pp 517–524 Meloni P, Tuveri G, Raffo L, Cannella E, Stefanov T, Derin O, Fiorin L, Sami M (2012) System adaptivity and fault-tolerance in NoC-based MPSoCs: the MADNESS project approach. In: Proceeding EUROMICRO conference digital system design (DSD), pp 517–524
17.
go back to reference Mukherjee S, Kontz M, Reinhardt S (2002) Detailed design and evaluation of redundant multi-threading alternatives. In: Proc Intl Symp Comput Architecture. 99–110 Mukherjee S, Kontz M, Reinhardt S (2002) Detailed design and evaluation of redundant multi-threading alternatives. In: Proc Intl Symp Comput Architecture. 99–110
18.
go back to reference Normand E (1996) Single event upset at ground level. IEEE Trans Nuclear Sci 43(6):2742–2750CrossRef Normand E (1996) Single event upset at ground level. IEEE Trans Nuclear Sci 43(6):2742–2750CrossRef
20.
go back to reference Salehie M, Tahvildari L (2009) Self-adaptive software: Landscape and research challenges. ACM Trans Autonomous and Adaptive Systems 4:14:1–14:42 Salehie M, Tahvildari L (2009) Self-adaptive software: Landscape and research challenges. ACM Trans Autonomous and Adaptive Systems 4:14:1–14:42
21.
go back to reference STMicroelectronics and CEA (2010) Platform 2012: A many-core programmable accelerator for ultra-efficient embedded computing in nanometer technology. In: Research workshop on STMicroelectronics Platform 2012 STMicroelectronics and CEA (2010) Platform 2012: A many-core programmable accelerator for ultra-efficient embedded computing in nanometer technology. In: Research workshop on STMicroelectronics Platform 2012
22.
go back to reference Teraflux (2011) Definition of ISA extensions, custom devices and external COTSon API extensions. In: Teraflux: Exploiting dataflow parallelism in Tera-device computing Teraflux (2011) Definition of ISA extensions, custom devices and external COTSon API extensions. In: Teraflux: Exploiting dataflow parallelism in Tera-device computing
25.
go back to reference Weis S, Garbade A, Wolf J, Fechner B, Mendelson A, Giorgi R, Ungerer T (2011) A fault detection and recovery architecture for a teradevice dataflow system. In: Workshop on data-flow execution models for extreme scale computing (DFM), pp 38–44 Weis S, Garbade A, Wolf J, Fechner B, Mendelson A, Giorgi R, Ungerer T (2011) A fault detection and recovery architecture for a teradevice dataflow system. In: Workshop on data-flow execution models for extreme scale computing (DFM), pp 38–44
26.
go back to reference Wells PM, Chakraborty K, Sohi GS (2009) Mixed-mode multicore reliability. In: Proceeding international conference architectural support for programming languages and operating systems, pp 169–180 Wells PM, Chakraborty K, Sohi GS (2009) Mixed-mode multicore reliability. In: Proceeding international conference architectural support for programming languages and operating systems, pp 169–180
27.
go back to reference Wirthlin M, Johnson E, Rollins N, Caffrey M, Graham P (2003) The reliability of FPGA circuit designs in the presence of radiation induced configuration upsets. In: Proceeding symposium field-programmable custom computing machines (FCCM), pp 133–142 Wirthlin M, Johnson E, Rollins N, Caffrey M, Graham P (2003) The reliability of FPGA circuit designs in the presence of radiation induced configuration upsets. In: Proceeding symposium field-programmable custom computing machines (FCCM), pp 133–142
Metadata
Title
Self-Adaptive Fault Tolerance in Multi-/Many-Core Systems
Authors
Cristiana Bolchini
Matteo Carminati
Antonio Miele
Publication date
01-04-2013
Publisher
Springer US
Published in
Journal of Electronic Testing / Issue 2/2013
Print ISSN: 0923-8174
Electronic ISSN: 1573-0727
DOI
https://doi.org/10.1007/s10836-013-5367-y

Other articles of this Issue 2/2013

Journal of Electronic Testing 2/2013 Go to the issue

EditorialNotes

Guest Editorial