Skip to main content
Erschienen in: International Journal of Parallel Programming 3/2021

28.03.2021

Predicting the Soft Error Vulnerability of Parallel Applications Using Machine Learning

verfasst von: Işıl Öz, Sanem Arslan

Erschienen in: International Journal of Parallel Programming | Ausgabe 3/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

With the widespread use of the multicore systems having smaller transistor sizes, soft errors become an important issue for parallel program execution. Fault injection is a prevalent method to quantify the soft error rates of the applications. However, it is very time consuming to perform detailed fault injection experiments. Therefore, prediction-based techniques have been proposed to evaluate the soft error vulnerability in a faster way. In this work, we present a soft error vulnerability prediction approach for parallel applications using machine learning algorithms. We define a set of features including thread communication, data sharing, parallel programming, and performance characteristics; and train our models based on three ML algorithms. This study uses the parallel programming features, as well as the combination of all features for the first time in vulnerability prediction of parallel programs. We propose two models for the soft error vulnerability prediction: (1) A regression model with rigorous feature selection analysis that estimates correct execution rates, (2) A novel classification model that predicts the vulnerability level of the target programs. We get maximum prediction accuracy rate of 73.2% for the regression-based model, and achieve 89% F-score for our classification model.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
2.
Zurück zum Zitat Andersch, M., Juurlink, B., Chi, C.C.: A benchmark suite for evaluating parallel programming models. In: Proceedings 24th Workshop on Parallel Systems and Algorithms (2011) Andersch, M., Juurlink, B., Chi, C.C.: A benchmark suite for evaluating parallel programming models. In: Proceedings 24th Workshop on Parallel Systems and Algorithms (2011)
3.
Zurück zum Zitat Barrow-Williams, N., Fensch, C., Moore, S.: A communication characterisation of splash-2 and parsec. In: IEEE International Symposium on Workload Characterization (IISWC) (2009) Barrow-Williams, N., Fensch, C., Moore, S.: A communication characterisation of splash-2 and parsec. In: IEEE International Symposium on Workload Characterization (IISWC) (2009)
4.
Zurück zum Zitat Bienia, C., Kumar, S., Singh, J.P., Li, K.: The parsec benchmark suite: Characterization and architectural implications. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (2008) Bienia, C., Kumar, S., Singh, J.P., Li, K.: The parsec benchmark suite: Characterization and architectural implications. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (2008)
5.
6.
Zurück zum Zitat Chatzidimitriou, A., Bodmann, P., Papadimitriou, G., Gizopoulos, D., Rech, P.: Demystifying soft error assessment strategies on arm cpus: microarchitectural fault injection vs. neutron beam experiments. In: 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 26–38 (2019) Chatzidimitriou, A., Bodmann, P., Papadimitriou, G., Gizopoulos, D., Rech, P.: Demystifying soft error assessment strategies on arm cpus: microarchitectural fault injection vs. neutron beam experiments. In: 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 26–38 (2019)
7.
Zurück zum Zitat Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)MATH Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)MATH
8.
Zurück zum Zitat da Rosa, F.R., Garibotti, R., Ost, L., Reis, R.: Using machine learning techniques to evaluate multicore soft error reliability. IEEE Trans. Circuits Syst. I: Reg. Pap. 66(6), 2151–2164 (2019)CrossRef da Rosa, F.R., Garibotti, R., Ost, L., Reis, R.: Using machine learning techniques to evaluate multicore soft error reliability. IEEE Trans. Circuits Syst. I: Reg. Pap. 66(6), 2151–2164 (2019)CrossRef
10.
Zurück zum Zitat Diener, M., Cruz, E.H., Pilla, L.L., Dupros, F., Navaux, P.O.: Characterizing communication and page usage of parallel applications for thread and data mapping. Perform. Eval. 88–89, 18–36 (2015)CrossRef Diener, M., Cruz, E.H., Pilla, L.L., Dupros, F., Navaux, P.O.: Characterizing communication and page usage of parallel applications for thread and data mapping. Perform. Eval. 88–89, 18–36 (2015)CrossRef
11.
Zurück zum Zitat Diener, M., Cruz, E.H.M., Alves, M.A.Z., Alhakeem, M.S., Navaux, P.O.A., Heiß, H.U.: Locality and balance for communication-aware thread mapping in multicore systems. In: European Conference on Parallel Processing (Euro-Par) (2015) Diener, M., Cruz, E.H.M., Alves, M.A.Z., Alhakeem, M.S., Navaux, P.O.A., Heiß, H.U.: Locality and balance for communication-aware thread mapping in multicore systems. In: European Conference on Parallel Processing (Euro-Par) (2015)
12.
Zurück zum Zitat Drucker, H., Burges, C.J.C., Kaufman, L., Smola, A., Vapnik, V.: Support vector regression machines. In: Proceedings of the 9th International Conference on Neural Information Processing Systems (NIPS) (1996) Drucker, H., Burges, C.J.C., Kaufman, L., Smola, A., Vapnik, V.: Support vector regression machines. In: Proceedings of the 9th International Conference on Neural Information Processing Systems (NIPS) (1996)
14.
Zurück zum Zitat Guo, L., Li, D., Laguna, I.: PARIS: Predicting Application Resilience Using Machine Learning. arXiv e-prints arXiv:1812.02944 (2018) Guo, L., Li, D., Laguna, I.: PARIS: Predicting Application Resilience Using Machine Learning. arXiv e-prints arXiv:1812.02944 (2018)
15.
Zurück zum Zitat Hari, S.K.S., Tsai, T., Stephenson, M., Keckler, S.W., Emer, J.: Sassifi: An architecture-level fault injection tool for gpu application resilience evaluation. In: 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (2017) Hari, S.K.S., Tsai, T., Stephenson, M., Keckler, S.W., Emer, J.: Sassifi: An architecture-level fault injection tool for gpu application resilience evaluation. In: 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (2017)
16.
Zurück zum Zitat Iqbal, S.M.Z., Liang, Y., Grahn, H.: Parmibench—an open-source benchmark for embedded multiprocessor systems. IEEE Comput. Archit. Lett. 9(2), 45–48 (2010)CrossRef Iqbal, S.M.Z., Liang, Y., Grahn, H.: Parmibench—an open-source benchmark for embedded multiprocessor systems. IEEE Comput. Archit. Lett. 9(2), 45–48 (2010)CrossRef
17.
Zurück zum Zitat Kalra, C., Previlon, F., Li, X., Rubin, N., Kaeli, D.: Prism: Predicting resilience of gpu applications using statistical methods. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis (2018) Kalra, C., Previlon, F., Li, X., Rubin, N., Kaeli, D.: Prism: Predicting resilience of gpu applications using statistical methods. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis (2018)
18.
Zurück zum Zitat Laguna, I., Schulz, M., Richards, D.F., Calhoun, J., Olson, L.: Ipas: Intelligent protection against silent output corruption in scientific applications. In: 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) (2016) Laguna, I., Schulz, M., Richards, D.F., Calhoun, J., Olson, L.: Ipas: Intelligent protection against silent output corruption in scientific applications. In: 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) (2016)
19.
Zurück zum Zitat Leveugle, R., Calvez, A., Maistri, P., Vanhauwaert, P.: Statistical fault injection: quantified error and confidence. In: Proceedings of the Conference on Design, Automation and Test in Europe (DATE) (2009) Leveugle, R., Calvez, A., Maistri, P., Vanhauwaert, P.: Statistical fault injection: quantified error and confidence. In: Proceedings of the Conference on Design, Automation and Test in Europe (DATE) (2009)
20.
Zurück zum Zitat Li, G., Pattabiraman, K., Hari, S.K.S., Sullivan, M., Tsai, T.: Modeling soft-error propagation in programs. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) (2018) Li, G., Pattabiraman, K., Hari, S.K.S., Sullivan, M., Tsai, T.: Modeling soft-error propagation in programs. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) (2018)
21.
Zurück zum Zitat Liu, L., Ci, L., Liu, W., Yang, H.: Identifying sdc-causing instructions based on random forests algorithm. KSII Trans. Internet Inf. Syst. 13, 1566–1582 (2019) Liu, L., Ci, L., Liu, W., Yang, H.: Identifying sdc-causing instructions based on random forests algorithm. KSII Trans. Internet Inf. Syst. 13, 1566–1582 (2019)
22.
Zurück zum Zitat Liu, Y., Li, J., Zhuang, Y.: Instruction sdc vulnerability prediction using long short-term memory neural network. In: Gan, G., Li, B., Li, X., Wang, S. (eds.) Advanced Data Mining and Applications, pp. 140–149. Springer, Cham (2018)CrossRef Liu, Y., Li, J., Zhuang, Y.: Instruction sdc vulnerability prediction using long short-term memory neural network. In: Gan, G., Li, B., Li, X., Wang, S. (eds.) Advanced Data Mining and Applications, pp. 140–149. Springer, Cham (2018)CrossRef
23.
Zurück zum Zitat Lu, Q., Pattabiraman, K., Gupta, M.S., Rivers, J.A.: Sdctune: A model for predicting the sdc proneness of an application for configurable protection. In: International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES) (2014) Lu, Q., Pattabiraman, K., Gupta, M.S., Rivers, J.A.: Sdctune: A model for predicting the sdc proneness of an application for configurable protection. In: International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES) (2014)
24.
Zurück zum Zitat Luk, C.K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V.J., Hazelwood, K.: Pin: Building customized program analysis tools with dynamic instrumentation. In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI) (2005) Luk, C.K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V.J., Hazelwood, K.: Pin: Building customized program analysis tools with dynamic instrumentation. In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI) (2005)
25.
Zurück zum Zitat Mittal, S., Vetter, J.S.: A survey of techniques for modeling and improving reliability of computing systems. IEEE Trans. Parall. Distrib. Syst. 27(4), 1226–1238 (2016)CrossRef Mittal, S., Vetter, J.S.: A survey of techniques for modeling and improving reliability of computing systems. IEEE Trans. Parall. Distrib. Syst. 27(4), 1226–1238 (2016)CrossRef
26.
Zurück zum Zitat Mukherjee, S.: Architecture Design for Soft Errors. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2008) Mukherjee, S.: Architecture Design for Soft Errors. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2008)
27.
Zurück zum Zitat Mutlu, B.O., Kestor, G., Cristal, A., Unsal, O., Krishnamoorthy, S.: Ground-truth prediction to accelerate soft-error impact analysis for iterative methods. In: 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC) (2019) Mutlu, B.O., Kestor, G., Cristal, A., Unsal, O., Krishnamoorthy, S.: Ground-truth prediction to accelerate soft-error impact analysis for iterative methods. In: 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC) (2019)
28.
Zurück zum Zitat Nie, B., Xue, J., Gupta, S., Patel, T., Engelmann, C., Smirni, E., Tiwari, D.: Machine learning models for gpu error prediction in a large scale hpc system. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) (2018) Nie, B., Xue, J., Gupta, S., Patel, T., Engelmann, C., Smirni, E., Tiwari, D.: Machine learning models for gpu error prediction in a large scale hpc system. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) (2018)
29.
Zurück zum Zitat Oliveira, D., Moreira, F.B., Rech, P., Navaux, P.: Predicting the reliability behavior of hpc applications. In: International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) (2018) Oliveira, D., Moreira, F.B., Rech, P., Navaux, P.: Predicting the reliability behavior of hpc applications. In: International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) (2018)
30.
Zurück zum Zitat Oliveira, D.A.G.D., Pilla, L.L., Hanzich, M., Fratin, V., Fernandes, F., Lunardi, C., Cela, J.M., Navaux, P.O.A., Carro, L., Rech, P.: Radiation-induced error criticality in modern hpc parallel accelerators. In: 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 577–588 (2017) Oliveira, D.A.G.D., Pilla, L.L., Hanzich, M., Fratin, V., Fernandes, F., Lunardi, C., Cela, J.M., Navaux, P.O.A., Carro, L., Rech, P.: Radiation-induced error criticality in modern hpc parallel accelerators. In: 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 577–588 (2017)
31.
Zurück zum Zitat Parasyris, K., Tziantzoulis, G., Antonopoulos, C.D., Bellas, N.: Gemfi: A fault injection tool for studying the behavior of applications on unreliable substrates. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 622–629 (2014) Parasyris, K., Tziantzoulis, G., Antonopoulos, C.D., Bellas, N.: Gemfi: A fault injection tool for studying the behavior of applications on unreliable substrates. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 622–629 (2014)
32.
Zurück zum Zitat Pearce, O., Gamblin, T., de Supinski, B.R., Schulz, M., Amato, N.M.: Quantifying the effectiveness of load balance algorithms. In: Proceedings of the 26th ACM International Conference on Supercomputing (2012) Pearce, O., Gamblin, T., de Supinski, B.R., Schulz, M., Amato, N.M.: Quantifying the effectiveness of load balance algorithms. In: Proceedings of the 26th ACM International Conference on Supercomputing (2012)
33.
Zurück zum Zitat Poovey, J., Railing, B., Conte, T.: Parallel pattern detection for architectural improvements. In: Proceedings of the 3rd USENIX Conference Hot Topic Parallelism (2011) Poovey, J., Railing, B., Conte, T.: Parallel pattern detection for architectural improvements. In: Proceedings of the 3rd USENIX Conference Hot Topic Parallelism (2011)
34.
Zurück zum Zitat Rodrigues, G.S., Kastensmidt, F.L., Reis, R., Rosa, F., Ost, L.: Analyzing the impact of using pthreads versus openmp under fault injection in arm cortex-a9 dual-core. In: 16th European Conference on Radiation and Its Effects on Components and Systems (RADECS) (2016) Rodrigues, G.S., Kastensmidt, F.L., Reis, R., Rosa, F., Ost, L.: Analyzing the impact of using pthreads versus openmp under fault injection in arm cortex-a9 dual-core. In: 16th European Conference on Radiation and Its Effects on Components and Systems (RADECS) (2016)
35.
Zurück zum Zitat Rosa, F.d., Bandeira, V., Reis, R., Ost, L.: Extensive evaluation of programming models and isas impact on multicore soft error reliability. In: Proceedings of the 55th Annual Design Automation Conference (DAC) (2018) Rosa, F.d., Bandeira, V., Reis, R., Ost, L.: Extensive evaluation of programming models and isas impact on multicore soft error reliability. In: Proceedings of the 55th Annual Design Automation Conference (DAC) (2018)
36.
Zurück zum Zitat Snir, M., Wisniewski, R.W., Abraham, J.A., Adve, S.V., Bagchi, S., Balaji, P., Belak, J., Bose, P., Cappello, F., Carlson, B., Chien, A.A., Coteus, P., Debardeleben, N.A., Diniz, P.C., Engelmann, C., Erez, M., Fazzari, S., Geist, A., Gupta, R., Johnson, F., Krishnamoorthy, S., Leyffer, S., Liberty, D., Mitra, S., Munson, T., Schreiber, R., Stearley, J., Hensbergen, E.V.: Addressing failures in exascale computing. Int. J. High Perform. Comput. Appl. 28(2), 129–173 (2014)CrossRef Snir, M., Wisniewski, R.W., Abraham, J.A., Adve, S.V., Bagchi, S., Balaji, P., Belak, J., Bose, P., Cappello, F., Carlson, B., Chien, A.A., Coteus, P., Debardeleben, N.A., Diniz, P.C., Engelmann, C., Erez, M., Fazzari, S., Geist, A., Gupta, R., Johnson, F., Krishnamoorthy, S., Leyffer, S., Liberty, D., Mitra, S., Munson, T., Schreiber, R., Stearley, J., Hensbergen, E.V.: Addressing failures in exascale computing. Int. J. High Perform. Comput. Appl. 28(2), 129–173 (2014)CrossRef
37.
Zurück zum Zitat Sridharan, V., DeBardeleben, N., Blanchard, S., Ferreira, K.B., Stearley, J., Shalf, J., Gurumurthi, S.: Memory errors in modern systems: The good, the bad, and the ugly. In: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2015) Sridharan, V., DeBardeleben, N., Blanchard, S., Ferreira, K.B., Stearley, J., Shalf, J., Gurumurthi, S.: Memory errors in modern systems: The good, the bad, and the ugly. In: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2015)
38.
Zurück zum Zitat Stratton, J.A., Rodrigues, C., Sung, I.J., Obeid, N., Chang, L.W., Anssari, N., Liu, G.D., mei W. Hwu, W.: Parboil: A revised benchmark suite for scientific and commercial throughput computing. IMPACT Technical Report 12-01, University of Illinois at Urbana-Champaign (2012) Stratton, J.A., Rodrigues, C., Sung, I.J., Obeid, N., Chang, L.W., Anssari, N., Liu, G.D., mei W. Hwu, W.: Parboil: A revised benchmark suite for scientific and commercial throughput computing. IMPACT Technical Report 12-01, University of Illinois at Urbana-Champaign (2012)
39.
Zurück zum Zitat Tanikella, K., Koy, Y., Jeyapaul, R., Kyoungwoo Lee, Shrivastava, A.: gemv: A validated toolset for the early exploration of system reliability. In: 2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP), pp. 159–163 (2016) Tanikella, K., Koy, Y., Jeyapaul, R., Kyoungwoo Lee, Shrivastava, A.: gemv: A validated toolset for the early exploration of system reliability. In: 2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP), pp. 159–163 (2016)
40.
Zurück zum Zitat Vishnu, A.V., Dam, H., Tallent, N.R., Kerbyson, D.J., Hoisie, A.: Fault modeling of extreme scale applications using machine learning. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (2016) Vishnu, A.V., Dam, H., Tallent, N.R., Kerbyson, D.J., Hoisie, A.: Fault modeling of extreme scale applications using machine learning. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (2016)
41.
Zurück zum Zitat Wei, J., Thomas, A., Li, G., Pattabiraman, K.: Quantifying the accuracy of high-level fault injection techniques for hardware faults. In: International Conference on Dependable Systems and Networks (DSN) (2014) Wei, J., Thomas, A., Li, G., Pattabiraman, K.: Quantifying the accuracy of high-level fault injection techniques for hardware faults. In: International Conference on Dependable Systems and Networks (DSN) (2014)
42.
Zurück zum Zitat Woo, S.C., Ohara, M., Torrie, E., Singh, J.P., Gupta, A.: The splash-2 programs: Characterization and methodological considerations. In: Proceedings of the 22Nd Annual International Symposium on Computer Architecture (ISCA) (1995) Woo, S.C., Ohara, M., Torrie, E., Singh, J.P., Gupta, A.: The splash-2 programs: Characterization and methodological considerations. In: Proceedings of the 22Nd Annual International Symposium on Computer Architecture (ISCA) (1995)
43.
Zurück zum Zitat Yang, N., Wang, Y.: Predicting the silent data corruption vulnerability of instructions in programs. In: 2019 IEEE 25th International Conference on Parallel and Distributed Systems (ICPADS) (2019) Yang, N., Wang, Y.: Predicting the silent data corruption vulnerability of instructions in programs. In: 2019 IEEE 25th International Conference on Parallel and Distributed Systems (ICPADS) (2019)
Metadaten
Titel
Predicting the Soft Error Vulnerability of Parallel Applications Using Machine Learning
verfasst von
Işıl Öz
Sanem Arslan
Publikationsdatum
28.03.2021
Verlag
Springer US
Erschienen in
International Journal of Parallel Programming / Ausgabe 3/2021
Print ISSN: 0885-7458
Elektronische ISSN: 1573-7640
DOI
https://doi.org/10.1007/s10766-021-00707-0

Weitere Artikel der Ausgabe 3/2021

International Journal of Parallel Programming 3/2021 Zur Ausgabe