Skip to main content

Tipp

Weitere Artikel dieser Ausgabe durch Wischen aufrufen

Erschienen in: Journal of Electronic Testing 3/2022

08.06.2022

Deep Soft Error Propagation Modeling Using Graph Attention Network

verfasst von: Junchi Ma, Zongtao Duan, Lei Tang

Erschienen in: Journal of Electronic Testing | Ausgabe 3/2022

Einloggen, um Zugang zu erhalten
share
TEILEN

Abstract

Soft errors are increasing in computer systems due to shrinking feature sizes. Soft errors can induce incorrect outputs, also called silent data corruption (SDC), which raises no warnings in the system and hence is difficult to detect. To prevent SDC effectively, protection techniques require a fine-grained profiling of SDC-prone instructions, which is often obtained by applying machine learning models. However, these models rely on handcrafted features, and lack the ability to reason about SDC propagation, which leads to an inferior SDC prediction performance. We propose a novel Graph Attention neTwork to Predict SDC-prone instructions (GATPS). The GATPS representation is a heterogeneous graph with different types of edges to represent various instruction relations. By stacking layers in which nodes are able to attend over their neighborhoods’ features, GATPS automatically captures the structural features that contribute to SDC propagation. The attention mechanism is applied to compute the importance values to the neighboring nodes, which quantifies the fault effect on the neighboring nodes. Moreover, the inductive model of GATPS can be applied to unseen programs without retraining, and it requires no fault injection information of the target program. Experiments revealed GATPS achieved a 34% higher F1 score compared to the baseline method and a 40-fold speedup compared to the fault injection approach.

Sie möchten Zugang zu diesem Inhalt erhalten? Dann informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 69.000 Bücher
  • über 500 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt 90 Tage mit der neuen Mini-Lizenz testen!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 50.000 Bücher
  • über 380 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe



 


Jetzt 90 Tage mit der neuen Mini-Lizenz testen!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 58.000 Bücher
  • über 300 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko





Jetzt 90 Tage mit der neuen Mini-Lizenz testen!

Weitere Produktempfehlungen anzeigen
Literatur
1.
Zurück zum Zitat Abadi M, Barham P, Chen J et al (2016) Tensorflow: A system for large-scale machine learning. In: Proc. USENIX symposium on operating systems design and implementation (OSDI). IEEE, pp 265–283 Abadi M, Barham P, Chen J et al (2016) Tensorflow: A system for large-scale machine learning. In: Proc. USENIX symposium on operating systems design and implementation (OSDI). IEEE, pp 265–283
2.
Zurück zum Zitat Benacchio T, Bonaventura L, Altenbernd M et al (2021) Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction. Int J High Perform Comput Appl 35(4):285–311 Benacchio T, Bonaventura L, Altenbernd M et al (2021) Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction. Int J High Perform Comput Appl 35(4):285–311
4.
Zurück zum Zitat Fang B, Lu Q, Pattabiraman K et al (2016) ePVF: An enhanced program vulnerability factor methodology for cross-layer resilience analysis. In: Dependable Systems and Networks (DSN). IEEE, pp 168–179 Fang B, Lu Q, Pattabiraman K et al (2016) ePVF: An enhanced program vulnerability factor methodology for cross-layer resilience analysis. In: Dependable Systems and Networks (DSN). IEEE, pp 168–179
5.
Zurück zum Zitat Gao Y, Gupta SK, Wang Y et al (2014) An energy-aware fault tolerant scheduling framework for soft error resilient cloud computing systems. In: Proc. Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, pp 1–6 Gao Y, Gupta SK, Wang Y et al (2014) An energy-aware fault tolerant scheduling framework for soft error resilient cloud computing systems. In: Proc. Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, pp 1–6
6.
Zurück zum Zitat Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, pp 249–256 Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, pp 249–256
7.
Zurück zum Zitat Guo L, Li D, Laguna I (2021) Paris: Predicting application resilience using machine learning. J Parallel Distrib Comput Guo L, Li D, Laguna I (2021) Paris: Predicting application resilience using machine learning. J Parallel Distrib Comput
8.
Zurück zum Zitat Hashimoto M, Wang L (2020) Soft error and its countermeasures in terrestrial environment. In: Proc. Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, pp 617–622 Hashimoto M, Wang L (2020) Soft error and its countermeasures in terrestrial environment. In: Proc. Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, pp 617–622
9.
Zurück zum Zitat Hari SKS, Adve SV, Naeimi H (2012) Low-cost program-level detectors for reducing silent data corruptions. In: Dependable Systems and Networks (DSN). IEEE, pp 1–12 Hari SKS, Adve SV, Naeimi H (2012) Low-cost program-level detectors for reducing silent data corruptions. In: Dependable Systems and Networks (DSN). IEEE, pp 1–12
10.
Zurück zum Zitat Hari SKS, Adve SV, Naeimi H et al (2012) Relyzer: Exploiting application-level fault equivalence to analyze application resiliency to transient faults. In: Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, pp 123–134 Hari SKS, Adve SV, Naeimi H et al (2012) Relyzer: Exploiting application-level fault equivalence to analyze application resiliency to transient faults. In: Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, pp 123–134
11.
Zurück zum Zitat Hamilton W, Ying Z, Leskovec J (2017) Inductive representation learning on large graphs. In: Advances in Neural Information Processing Systems (NIPS). IEEE, pp 1024–1034 Hamilton W, Ying Z, Leskovec J (2017) Inductive representation learning on large graphs. In: Advances in Neural Information Processing Systems (NIPS). IEEE, pp 1024–1034
12.
Zurück zum Zitat Hong H, Guo H, Lin Y et al (2020) An attention-based graph neural network for heterogeneous structural learning. In: Proc. Conference on Artificial Intelligence (AAAI). AI Access Foundation, pp 4132–4139 Hong H, Guo H, Lin Y et al (2020) An attention-based graph neural network for heterogeneous structural learning. In: Proc. Conference on Artificial Intelligence (AAAI). AI Access Foundation, pp 4132–4139
13.
Zurück zum Zitat Kalra C, Previlon F, Rubin N et al (2020) Armorall: Compiler-based resilience targeting gpu applications. ACM Trans Archit Code Optim 17(2):1–24 CrossRef Kalra C, Previlon F, Rubin N et al (2020) Armorall: Compiler-based resilience targeting gpu applications. ACM Trans Archit Code Optim 17(2):1–24 CrossRef
14.
Zurück zum Zitat Laguna I, Schulz M, Richards DF et al (2016) Ipas: Intelligent protection against silent output corruption in scientific applications. In: Proc. International Symposium on Code Generation and Optimization (CGO). IEEE, pp 227–238 Laguna I, Schulz M, Richards DF et al (2016) Ipas: Intelligent protection against silent output corruption in scientific applications. In: Proc. International Symposium on Code Generation and Optimization (CGO). IEEE, pp 227–238
15.
Zurück zum Zitat Li G, Pattabiraman K (2018) Modeling input-dependent error propagation in programs. In: Dependable Systems and Networks (DSN). IEEE, pp 279–290 Li G, Pattabiraman K (2018) Modeling input-dependent error propagation in programs. In: Dependable Systems and Networks (DSN). IEEE, pp 279–290
16.
Zurück zum Zitat Li G, Pattabiraman K, Hari SKS et al (2018) Modeling soft-error propagation in programs. In: Dependable Systems and Networks (DSN). IEEE, pp 27–38 Li G, Pattabiraman K, Hari SKS et al (2018) Modeling soft-error propagation in programs. In: Dependable Systems and Networks (DSN). IEEE, pp 27–38
17.
Zurück zum Zitat Li Z, Menon H, Maljovec D et al (2020) SpotSDC: Revealing the silent data corruption propagation in high-performance computing systems. IEEE Trans Vis Comput Graph Li Z, Menon H, Maljovec D et al (2020) SpotSDC: Revealing the silent data corruption propagation in high-performance computing systems. IEEE Trans Vis Comput Graph
18.
Zurück zum Zitat Li Z, Menon H, Mohror K et al (2021) Understanding a program's resiliency through error propagation. In: Proc. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). ACM, pp 362–373 Li Z, Menon H, Mohror K et al (2021) Understanding a program's resiliency through error propagation. In: Proc. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). ACM, pp 362–373
19.
Zurück zum Zitat Liu C, Gu J, Yan Z et al (2019) SDC-causing error detection based on lightweight vulnerability prediction. In: Proc. Asian Conference on Machine Learning (ACML). IEEE, pp 1049–1064 Liu C, Gu J, Yan Z et al (2019) SDC-causing error detection based on lightweight vulnerability prediction. In: Proc. Asian Conference on Machine Learning (ACML). IEEE, pp 1049–1064
20.
Zurück zum Zitat Lu Q, Pattabiraman K, Gupta MS et al (2014) SDCTune: A model for predicting the SDC proneness of an application for configurable protection. In: Compilers, Architecture and Synthesis for Embedded Systems (CASES). ACM, pp 1–10 Lu Q, Pattabiraman K, Gupta MS et al (2014) SDCTune: A model for predicting the SDC proneness of an application for configurable protection. In: Compilers, Architecture and Synthesis for Embedded Systems (CASES). ACM, pp 1–10
21.
Zurück zum Zitat Luk CK, Cohn R, Muth R et al (2005) Pin: building customized program analysis tools with dynamic instrumentation. ACM Sigplan Notices 40(6):190–200 CrossRef Luk CK, Cohn R, Muth R et al (2005) Pin: building customized program analysis tools with dynamic instrumentation. ACM Sigplan Notices 40(6):190–200 CrossRef
22.
Zurück zum Zitat Ma J, Wang Y (2017) Characterization of stack behavior under soft errors. In: Proc. Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, pp 1538–1543 Ma J, Wang Y (2017) Characterization of stack behavior under soft errors. In: Proc. Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, pp 1538–1543
23.
Zurück zum Zitat Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605 MATH Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605 MATH
24.
Zurück zum Zitat Schlichtkrull M, Kipf TN, Bloem P et al (2018) Modeling relational data with graph convolutional networks. In: Proc. European Semantic Web Conference. Springer, pp 593–607 Schlichtkrull M, Kipf TN, Bloem P et al (2018) Modeling relational data with graph convolutional networks. In: Proc. European Semantic Web Conference. Springer, pp 593–607
25.
Zurück zum Zitat Velickovic P, Cucurull G, Casanova A et al (2018) Graph attention networks. In: Proc. International Conference on Learning Representations (ICLR). IEEE, pp 1–12 Velickovic P, Cucurull G, Casanova A et al (2018) Graph attention networks. In: Proc. International Conference on Learning Representations (ICLR). IEEE, pp 1–12
26.
Zurück zum Zitat Xin X, Li ML (2012) Understanding soft error propagation using efficient vulnerability-driven fault injection. In: Dependable Systems and Networks (DSN). IEEE, pp 1–12 Xin X, Li ML (2012) Understanding soft error propagation using efficient vulnerability-driven fault injection. In: Dependable Systems and Networks (DSN). IEEE, pp 1–12
27.
Zurück zum Zitat Yang N, Wang Y (2019) Predicting the silent data corruption vulnerability of instructions in programs. In Proc. International Conference on Parallel and Distributed Systems (ICPADS). IEEE, pp 862–869 Yang N, Wang Y (2019) Predicting the silent data corruption vulnerability of instructions in programs. In Proc. International Conference on Parallel and Distributed Systems (ICPADS). IEEE, pp 862–869
Metadaten
Titel
Deep Soft Error Propagation Modeling Using Graph Attention Network
verfasst von
Junchi Ma
Zongtao Duan
Lei Tang
Publikationsdatum
08.06.2022
Verlag
Springer US
Erschienen in
Journal of Electronic Testing / Ausgabe 3/2022
Print ISSN: 0923-8174
Elektronische ISSN: 1573-0727
DOI
https://doi.org/10.1007/s10836-022-06005-y

Weitere Artikel der Ausgabe 3/2022

Journal of Electronic Testing 3/2022 Zur Ausgabe

EditorialNotes

Editorial