Skip to main content
Erschienen in: The Journal of Supercomputing 6/2022

07.01.2022

Evaluating low-level software-based hardening techniques for configurable GPU architectures

verfasst von: Marcio M. Goncalves, Josie E. Rodriguez Condia, Matteo Sonza Reorda, Luca Sterpone, Jose Rodrigo Azambuja

Erschienen in: The Journal of Supercomputing | Ausgabe 6/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The high processing power of GPUs makes them attractive for safety-critical applications, where transient effects are a major concern, and resilience must be enforced without compromising performance. Configurable softcore GPUs are a recent technology that allows detailed reliability assessment capable of bringing directions to the design of reliable GPU applications. This work investigates the reliability of the register files and the pipeline of a softcore GPU under radiation-induced faults. It proposes software-based fault tolerance techniques to mitigate errors. Faults are simulated at the register transfer level in four case-study algorithms, and the Architectural Vulnerability Factor (AVF) and Mean Workload to Failure (MWTF) are checked over different GPU configurations. Results indicate that software-based techniques efficiently reduce AVF. In terms of MWTF, results show that the best cases depend on an optimized balance between GPU configuration, application runtime, and AVF.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
4.
Zurück zum Zitat Bojarski M, Del Testa D, Dworakowski D, Firner B, Flepp B, Goyal P, Jackel LD, Monfort M, Muller U, Zhang J et al (2016) End to end learning for self-driving cars. arXiv preprint arXiv:160407316 Bojarski M, Del Testa D, Dworakowski D, Firner B, Flepp B, Goyal P, Jackel LD, Monfort M, Muller U, Zhang J et al (2016) End to end learning for self-driving cars. arXiv preprint arXiv:​160407316
6.
Zurück zum Zitat Oliveira DA, Rech P, Quinn HM, Fairbanks TD, Monroe L, Michalak SE, Anderson-Cook C, Navaux PO, Carro L (2014) Modern GPUs radiation sensitivity evaluation and mitigation through duplication with comparison. IEEE Trans Nucl Sci 61(6):3115–3122CrossRef Oliveira DA, Rech P, Quinn HM, Fairbanks TD, Monroe L, Michalak SE, Anderson-Cook C, Navaux PO, Carro L (2014) Modern GPUs radiation sensitivity evaluation and mitigation through duplication with comparison. IEEE Trans Nucl Sci 61(6):3115–3122CrossRef
7.
Zurück zum Zitat Pilla LL, Rech P, Silvestri F, Frost C, Navaux POA, Reorda MS, Carro L (2014) Software-based hardening strategies for neutron sensitive FFT algorithms on GPUs. IEEE Trans Nucl Sci 61(4):1874–1880CrossRef Pilla LL, Rech P, Silvestri F, Frost C, Navaux POA, Reorda MS, Carro L (2014) Software-based hardening strategies for neutron sensitive FFT algorithms on GPUs. IEEE Trans Nucl Sci 61(4):1874–1880CrossRef
8.
Zurück zum Zitat Slayman C (2010) Soft errors—past history and recent discoveries. In: IEEE International Integrated Reliability Workshop Final Report, pp 25–30 Slayman C (2010) Soft errors—past history and recent discoveries. In: IEEE International Integrated Reliability Workshop Final Report, pp 25–30
9.
Zurück zum Zitat Dixit A, Wood A (2011) The impact of new technology on soft error rates. In: International Reliability Physics Symposium, pp 1–7 Dixit A, Wood A (2011) The impact of new technology on soft error rates. In: International Reliability Physics Symposium, pp 1–7
11.
Zurück zum Zitat Tiwari D, Gupta S, Rogers J, Maxwell D, Rech P, Vazhkudai S, Oliveira D, Londo D, DeBardeleben N, Navaux P, Carro L, Bland A (2015) Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. In: 2015 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 331–342. https://doi.org/10.1109/HPCA.2015.7056044 Tiwari D, Gupta S, Rogers J, Maxwell D, Rech P, Vazhkudai S, Oliveira D, Londo D, DeBardeleben N, Navaux P, Carro L, Bland A (2015) Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. In: 2015 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 331–342. https://​doi.​org/​10.​1109/​HPCA.​2015.​7056044
12.
Zurück zum Zitat Hari SKS, Tsai T, Stephenson M, Keckler SW, Emer J (2017) SASSIFI: an architecture-level fault injection tool for GPU application resilience evaluation. In: 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp 249–258 Hari SKS, Tsai T, Stephenson M, Keckler SW, Emer J (2017) SASSIFI: an architecture-level fault injection tool for GPU application resilience evaluation. In: 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp 249–258
13.
Zurück zum Zitat Gonçalves M, Saquetti M, Kastensmidt F, Azambuja JR (2017) A low-level software-based fault tolerance approach to detect SEUs in GPUs’ register files. Microelectron Reliab 76:665–669CrossRef Gonçalves M, Saquetti M, Kastensmidt F, Azambuja JR (2017) A low-level software-based fault tolerance approach to detect SEUs in GPUs’ register files. Microelectron Reliab 76:665–669CrossRef
14.
Zurück zum Zitat Gonçalves M, Saquetti M, Azambuja JR (2018) Evaluating the reliability of a GPU pipeline to SEU and the impacts of software-based and hardware-based fault tolerance techniques. Microelectron Reliab 88:931–935CrossRef Gonçalves M, Saquetti M, Azambuja JR (2018) Evaluating the reliability of a GPU pipeline to SEU and the impacts of software-based and hardware-based fault tolerance techniques. Microelectron Reliab 88:931–935CrossRef
15.
Zurück zum Zitat Mahmoud A, Hari SKS, Sullivan MB, Tsai T, Keckler SW (2018) Optimizing software-directed instruction replication for GPU error detection. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, pp 842–853 Mahmoud A, Hari SKS, Sullivan MB, Tsai T, Keckler SW (2018) Optimizing software-directed instruction replication for GPU error detection. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, pp 842–853
16.
Zurück zum Zitat Rhod EL, Lisbôa CAL, Carro L, Sonza Reorda M, Violante M (2008) Hardware and software transparency in the protection of programs against SEUs and SETs. J Electron Test 24(1–3):45–56CrossRef Rhod EL, Lisbôa CAL, Carro L, Sonza Reorda M, Violante M (2008) Hardware and software transparency in the protection of programs against SEUs and SETs. J Electron Test 24(1–3):45–56CrossRef
19.
Zurück zum Zitat Goncalves MM, Azambuja JR, Condia JER, Sonza Reorda M, Sterpone L (2020) Evaluating software-based hardening techniques for general-purpose registers on a GPGPU. In: 2020 IEEE Latin-American Test Symposium (LATS). IEEE, pp 1–6 Goncalves MM, Azambuja JR, Condia JER, Sonza Reorda M, Sterpone L (2020) Evaluating software-based hardening techniques for general-purpose registers on a GPGPU. In: 2020 IEEE Latin-American Test Symposium (LATS). IEEE, pp 1–6
20.
21.
Zurück zum Zitat Wadden J, Lyashevsky A, Gurumurthi S, Sridharan V, Skadron K (2014) Real-world design and evaluation of compiler-managed GPU redundant multithreading. In: 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), pp 73–84. https://doi.org/10.1109/ISCA.2014.6853227 Wadden J, Lyashevsky A, Gurumurthi S, Sridharan V, Skadron K (2014) Real-world design and evaluation of compiler-managed GPU redundant multithreading. In: 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), pp 73–84. https://​doi.​org/​10.​1109/​ISCA.​2014.​6853227
22.
Zurück zum Zitat Rech P, Aguiar C, Frost C, Carro L (2013) An efficient and experimentally tuned software-based hardening strategy for matrix multiplication on GPUs. IEEE Trans Nucl Sci 60(4):2797–2804CrossRef Rech P, Aguiar C, Frost C, Carro L (2013) An efficient and experimentally tuned software-based hardening strategy for matrix multiplication on GPUs. IEEE Trans Nucl Sci 60(4):2797–2804CrossRef
23.
Zurück zum Zitat Braun C, Halder S, Wunderlich HJ (2014) A-abft: autonomous algorithm-based fault tolerance for matrix multiplications on graphics processing units. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. IEEE, pp 443–454 Braun C, Halder S, Wunderlich HJ (2014) A-abft: autonomous algorithm-based fault tolerance for matrix multiplications on graphics processing units. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. IEEE, pp 443–454
24.
Zurück zum Zitat Sullivan MB, Hari SKS, Zimmer B, Tsai T, Keckler SW (2018) Swapcodes: error codes for hardware-software cooperative GPU pipeline error detection. In: 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, pp 762–774 Sullivan MB, Hari SKS, Zimmer B, Tsai T, Keckler SW (2018) Swapcodes: error codes for hardware-software cooperative GPU pipeline error detection. In: 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, pp 762–774
27.
Zurück zum Zitat Gupta M, Lowell D, Kalamatianos J, Raasch S, Sridharan V, Tullsen D, Gupta R (2017) Compiler techniques to reduce the synchronization overhead of GPU redundant multithreading. In: 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, pp 1–6 Gupta M, Lowell D, Kalamatianos J, Raasch S, Sridharan V, Tullsen D, Gupta R (2017) Compiler techniques to reduce the synchronization overhead of GPU redundant multithreading. In: 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, pp 1–6
28.
Zurück zum Zitat Sundaram A, Aakel A, Lockhart D, Thaker D, Franklin D (2008) Efficient fault tolerance in multi-media applications through selective instruction replication. In: Proceedings of the 2008 workshop on Radiation effects and fault tolerance in nanometer technologies, pp 339–346 Sundaram A, Aakel A, Lockhart D, Thaker D, Franklin D (2008) Efficient fault tolerance in multi-media applications through selective instruction replication. In: Proceedings of the 2008 workshop on Radiation effects and fault tolerance in nanometer technologies, pp 339–346
29.
Zurück zum Zitat Kalra C, Previlon F, Rubin N, Kaeli D (2020) Armorall: compiler-based resilience targeting GPU applications. ACM Trans Archit Code Optim (TACO) 17(2):1–24CrossRef Kalra C, Previlon F, Rubin N, Kaeli D (2020) Armorall: compiler-based resilience targeting GPU applications. ACM Trans Archit Code Optim (TACO) 17(2):1–24CrossRef
30.
Zurück zum Zitat Goncalves M, Fernandes F, Lamb I, Rech P, Azambuja JR (2019) Selective fault tolerance for register files of graphics processing units. IEEE Trans Nucl Sci 66(7):1449–1456CrossRef Goncalves M, Fernandes F, Lamb I, Rech P, Azambuja JR (2019) Selective fault tolerance for register files of graphics processing units. IEEE Trans Nucl Sci 66(7):1449–1456CrossRef
31.
Zurück zum Zitat dos Santos FF, Brandalero M, Basso PM, Hubner M, Carro L, Rech P (2020) Reduced-precision dwc for mixed-precision GPUs. In: 2020 IEEE 26th International Symposium on On-Line Testing and Robust System Design (IOLTS). IEEE, pp 1–6 dos Santos FF, Brandalero M, Basso PM, Hubner M, Carro L, Rech P (2020) Reduced-precision dwc for mixed-precision GPUs. In: 2020 IEEE 26th International Symposium on On-Line Testing and Robust System Design (IOLTS). IEEE, pp 1–6
33.
Zurück zum Zitat Lindholm E, Nickolls J, Oberman S, Montrym J (2008) Nvidia tesla: a unified graphics and computing architecture. IEEE Micro 28(2):39–55CrossRef Lindholm E, Nickolls J, Oberman S, Montrym J (2008) Nvidia tesla: a unified graphics and computing architecture. IEEE Micro 28(2):39–55CrossRef
34.
Zurück zum Zitat Oh N, Shirvani PP, McCluskey EJ (2002) Error detection by duplicated instructions in super-scalar processors. IEEE Trans Reliab 51(1):63–75CrossRef Oh N, Shirvani PP, McCluskey EJ (2002) Error detection by duplicated instructions in super-scalar processors. IEEE Trans Reliab 51(1):63–75CrossRef
35.
Zurück zum Zitat Azambuja JR, Lapolli A, Rosa L, Kastensmidt FL (2011) Detecting sees in microprocessors through a non-intrusive hybrid technique. IEEE Trans Nucl Sci 58(3):993–1000CrossRef Azambuja JR, Lapolli A, Rosa L, Kastensmidt FL (2011) Detecting sees in microprocessors through a non-intrusive hybrid technique. IEEE Trans Nucl Sci 58(3):993–1000CrossRef
36.
Zurück zum Zitat Mukherjee SS, Weaver C, Emer J, Reinhardt SK, Austin T (2003) A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In: Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36., pp 29–40. https://doi.org/10.1109/MICRO.2003.1253181 Mukherjee SS, Weaver C, Emer J, Reinhardt SK, Austin T (2003) A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In: Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36., pp 29–40. https://​doi.​org/​10.​1109/​MICRO.​2003.​1253181
38.
Metadaten
Titel
Evaluating low-level software-based hardening techniques for configurable GPU architectures
verfasst von
Marcio M. Goncalves
Josie E. Rodriguez Condia
Matteo Sonza Reorda
Luca Sterpone
Jose Rodrigo Azambuja
Publikationsdatum
07.01.2022
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 6/2022
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-021-04154-z

Weitere Artikel der Ausgabe 6/2022

The Journal of Supercomputing 6/2022 Zur Ausgabe

Premium Partner