Skip to main content
Erschienen in: The Journal of Supercomputing 17/2022

13.06.2022

Studying error propagation on application data structure and hardware

verfasst von: Zuhal Ozturk, Haluk Rahmi Topcuoglu, Mahmut Taylan Kandemir

Erschienen in: The Journal of Supercomputing | Ausgabe 17/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

As technology scales, transistors become smaller and aggressive power optimization techniques combined with high operation frequencies and performance-enhancing microarchitectural techniques are employed to achieve increasingly higher performance and power efficiencies. Unfortunately, these developments make the modern systems more vulnerable to soft errors, which are becoming a critical issues in both hardware and software domains. Motivated by this observation, in this work, we propose, implement, and evaluate two error propagation metrics in order to characterize error propagation at both software and hardware levels. The first metric aims to measure error propagation on program data structures, whereas the second one measures the fraction of corrupted locations in the cache memory structure for a given period of time. We evaluate our proposed metrics by performing an empirical study of two application programs using both single-threaded and multi-threaded executions, and varying various experimental parameters such as thread count, error rate, location of errors, and architectural parameters. Our extensive experimental analysis reveals that error propagation over program data structures is highly dependent on application behavior.Further, depending on the cache parameters used, propagation of errors on cache can exhibit different patterns. This paper also discusses how our observed error propagation trends in program data structures and data caches are correlated with each other, focusing in particular on the differences in error propagation speeds in application data structures and data caches.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Fußnoten
1
Clearly, OS data structures can also be corrupted; but, in this work, we exclusively focus on application data structures.
 
2
This is the case for example in embedded and mobile systems.
 
Literatur
1.
Zurück zum Zitat Rebaudengo M, Reorda MS, Violante M. An accurate analysis of the effects of soft errors in the instruction and data caches of a pipelined microprocessor. In (2003) Design. Autom Test Europe Conf Exhib 2003:602–607 Rebaudengo M, Reorda MS, Violante M. An accurate analysis of the effects of soft errors in the instruction and data caches of a pipelined microprocessor. In (2003) Design. Autom Test Europe Conf Exhib 2003:602–607
2.
Zurück zum Zitat Gold BT, Smolens JC, Falsafi B, Hoe JC. The granularity of soft-error containment in shared-memory multiprocessors; 2006 Gold BT, Smolens JC, Falsafi B, Hoe JC. The granularity of soft-error containment in shared-memory multiprocessors; 2006
3.
Zurück zum Zitat Smolens JC, Gold BT, Kim J, Falsafi B, Hoe JC, Nowatzyk AG. Fingerprinting: bounding soft-error detection latency and bandwidth. In: ACM SIGPLAN Notices. vol. 39; 2004. p. 224–234 Smolens JC, Gold BT, Kim J, Falsafi B, Hoe JC, Nowatzyk AG. Fingerprinting: bounding soft-error detection latency and bandwidth. In: ACM SIGPLAN Notices. vol. 39; 2004. p. 224–234
4.
Zurück zum Zitat Medeiros GE, Bortolon FT, Reis R, Ost L. Evaluation of compiler optimization flags effects on soft error resiliency. In: 2018 31st Symposium on Integrated Circuits and Systems Design (SBCCI); 2018. p. 1–6 Medeiros GE, Bortolon FT, Reis R, Ost L. Evaluation of compiler optimization flags effects on soft error resiliency. In: 2018 31st Symposium on Integrated Circuits and Systems Design (SBCCI); 2018. p. 1–6
5.
Zurück zum Zitat Gava J, Bandiera V, Reis R, Ost L. Evaluation of compilers effects on OpenMP soft error resiliency. In: 2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI); 2019. p. 259–264 Gava J, Bandiera V, Reis R, Ost L. Evaluation of compilers effects on OpenMP soft error resiliency. In: 2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI); 2019. p. 259–264
6.
Zurück zum Zitat Lins FM, Tambara LA, Kastensmidt FL, Rech P (2017) Register file criticality and compiler optimization effects on embedded microprocessor reliability. IEEE Trans Nucl Sci 64(8):2179–2187 Lins FM, Tambara LA, Kastensmidt FL, Rech P (2017) Register file criticality and compiler optimization effects on embedded microprocessor reliability. IEEE Trans Nucl Sci 64(8):2179–2187
7.
Zurück zum Zitat Baumann RC. Radiation-induced soft errors in advanced semiconductor technologies. IEEE Transactions on Device and Materials Reliability. 2005 Sept;5(3):305–316 Baumann RC. Radiation-induced soft errors in advanced semiconductor technologies. IEEE Transactions on Device and Materials Reliability. 2005 Sept;5(3):305–316
8.
Zurück zum Zitat Cappello F, Al G, Gropp W, Kale S, Kramer B, Snir M (2014) Toward exascale resilience: 2014 update. Supercomput Frontiers Innovations: Int J 1(1):5–28 Cappello F, Al G, Gropp W, Kale S, Kramer B, Snir M (2014) Toward exascale resilience: 2014 update. Supercomput Frontiers Innovations: Int J 1(1):5–28
9.
Zurück zum Zitat Baumann RC (2001) Soft errors in advanced semiconductor devices-part I: the three radiation sources. IEEE Trans Device Mater Reliab 1(1):17–22CrossRef Baumann RC (2001) Soft errors in advanced semiconductor devices-part I: the three radiation sources. IEEE Trans Device Mater Reliab 1(1):17–22CrossRef
10.
Zurück zum Zitat Shivakumar P, Kistler M, Keckler SW, Burger D, Alvisi L. Modeling the effect of technology trends on the soft error rate of combinational logic. In: Proceedings International Conference on Dependable Systems and Networks; 2002. p. 389–398 Shivakumar P, Kistler M, Keckler SW, Burger D, Alvisi L. Modeling the effect of technology trends on the soft error rate of combinational logic. In: Proceedings International Conference on Dependable Systems and Networks; 2002. p. 389–398
11.
Zurück zum Zitat O’Gorman TJ, Ross JM, Taber AH, Ziegler JF, Muhlfeld HP, Montrose CJ et al (1996) Field testing for cosmic ray soft errors in semiconductor memories. IBM J Res Dev 40(1):41–50CrossRef O’Gorman TJ, Ross JM, Taber AH, Ziegler JF, Muhlfeld HP, Montrose CJ et al (1996) Field testing for cosmic ray soft errors in semiconductor memories. IBM J Res Dev 40(1):41–50CrossRef
12.
Zurück zum Zitat Borkar S (2005) Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro 25(6):10–16CrossRef Borkar S (2005) Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro 25(6):10–16CrossRef
13.
Zurück zum Zitat Asadi GH, Sridharan V, Tahoori MB, Kaeli D. Balancing performance and reliability in the memory hierarchy. In: IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.; 2005. p. 269–279 Asadi GH, Sridharan V, Tahoori MB, Kaeli D. Balancing performance and reliability in the memory hierarchy. In: IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.; 2005. p. 269–279
14.
Zurück zum Zitat Sikai L, Jun Y. A method of soft error propagation based on cellular automata. In: 2018 IEEE 3rd International Conference on Cloud Computing and Big Data Analysis (ICCCBDA); 2018. p. 617–622 Sikai L, Jun Y. A method of soft error propagation based on cellular automata. In: 2018 IEEE 3rd International Conference on Cloud Computing and Big Data Analysis (ICCCBDA); 2018. p. 617–622
15.
Zurück zum Zitat Mukherjee SS, Kontz M, Reinhardt SK. Detailed design and evaluation of redundant multi-threading alternatives. In: Proceedings 29th Annual International Symposium on Computer Architecture; 2002. p. 99–110 Mukherjee SS, Kontz M, Reinhardt SK. Detailed design and evaluation of redundant multi-threading alternatives. In: Proceedings 29th Annual International Symposium on Computer Architecture; 2002. p. 99–110
16.
Zurück zum Zitat Reinhardt SK, Mukherjee SS. Transient fault detection via simultaneous multithreading. In: Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No. RS00201); 2000. p. 25–36 Reinhardt SK, Mukherjee SS. Transient fault detection via simultaneous multithreading. In: Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No. RS00201); 2000. p. 25–36
17.
Zurück zum Zitat Rotta R, Ferreira RS, Nolte J. Real-time dynamic hardware reconfiguration for processors with redundant functional units. In: 2020 IEEE 23rd International Symposium on Real-Time Distributed Computing (ISORC); 2020. p. 154–155 Rotta R, Ferreira RS, Nolte J. Real-time dynamic hardware reconfiguration for processors with redundant functional units. In: 2020 IEEE 23rd International Symposium on Real-Time Distributed Computing (ISORC); 2020. p. 154–155
18.
Zurück zum Zitat Ainsworth S, Jones TM. Parallel error detection using heterogeneous cores. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN); 2018. p. 338–349 Ainsworth S, Jones TM. Parallel error detection using heterogeneous cores. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN); 2018. p. 338–349
19.
Zurück zum Zitat Györök G, Beszédes B. Duplicated control unit based embedded fault-masking systems. In: 2017 IEEE 15th International Symposium on Intelligent Systems and Informatics (SISY); 2017. p. 283–288 Györök G, Beszédes B. Duplicated control unit based embedded fault-masking systems. In: 2017 IEEE 15th International Symposium on Intelligent Systems and Informatics (SISY); 2017. p. 283–288
20.
Zurück zum Zitat Reis GA, Chang J, Vachharajani N, Rangan R, August DI. SWIFT: Software implemented fault tolerance. In: Proceedings of the International Symposium on Code Generation and Optimization; 2005. p. 243–254 Reis GA, Chang J, Vachharajani N, Rangan R, August DI. SWIFT: Software implemented fault tolerance. In: Proceedings of the International Symposium on Code Generation and Optimization; 2005. p. 243–254
21.
Zurück zum Zitat Asghari SA, Marvasti MB, Rahmani AM (2018) Enhancing transient fault tolerance in embedded systems through an OS task level redundancy approach. Futur Gener Comput Syst 87:58–65CrossRef Asghari SA, Marvasti MB, Rahmani AM (2018) Enhancing transient fault tolerance in embedded systems through an OS task level redundancy approach. Futur Gener Comput Syst 87:58–65CrossRef
22.
Zurück zum Zitat Mahmoud A, Hari SKS, Sullivan MB, Tsai T, Keckler SW. Optimizing software-directed instruction replication for gpu error detection. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis; 2018. p. 842–853 Mahmoud A, Hari SKS, Sullivan MB, Tsai T, Keckler SW. Optimizing software-directed instruction replication for gpu error detection. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis; 2018. p. 842–853
23.
Zurück zum Zitat Thati VB, Vankeirsbilck J, Penneman N, Pissoort D, Boydens J. An improved data error detection technique for dependable embedded software. In: 2018 IEEE 23rd Pacific Rim International Symposium on Dependable Computing (PRDC); 2018. p. 213–220 Thati VB, Vankeirsbilck J, Penneman N, Pissoort D, Boydens J. An improved data error detection technique for dependable embedded software. In: 2018 IEEE 23rd Pacific Rim International Symposium on Dependable Computing (PRDC); 2018. p. 213–220
24.
Zurück zum Zitat Chen YS, Chen PS. A software-based redundant execution programming model for transient fault detection and correction. In: 2016 45th International Conference on Parallel Processing Workshops (ICPPW); 2016. p. 66–71 Chen YS, Chen PS. A software-based redundant execution programming model for transient fault detection and correction. In: 2016 45th International Conference on Parallel Processing Workshops (ICPPW); 2016. p. 66–71
25.
Zurück zum Zitat Vallero A, Savino A, Chatzidimitriou A, Kaliorakis M, Kooli M, Riera M et al (2018) SyRA: Early system reliability analysis for cross-layer soft errors resilience in memory arrays of microprocessor systems. IEEE Trans Comput 68(5):765–783MathSciNetCrossRefMATH Vallero A, Savino A, Chatzidimitriou A, Kaliorakis M, Kooli M, Riera M et al (2018) SyRA: Early system reliability analysis for cross-layer soft errors resilience in memory arrays of microprocessor systems. IEEE Trans Comput 68(5):765–783MathSciNetCrossRefMATH
26.
Zurück zum Zitat Cheng E, Mirkhani S, Szafaryn LG, Cher CY, Cho H, Skadron K, et al. CLEAR: Cross-layer exploration for architecting resilience-combining hardware and software techniques to tolerate soft errors in processor cores. In: Proceedings of the 53rd Annual Design Automation Conference; 2016. p. 1–6 Cheng E, Mirkhani S, Szafaryn LG, Cher CY, Cho H, Skadron K, et al. CLEAR: Cross-layer exploration for architecting resilience-combining hardware and software techniques to tolerate soft errors in processor cores. In: Proceedings of the 53rd Annual Design Automation Conference; 2016. p. 1–6
27.
Zurück zum Zitat Vallero A, Savino A, Politano G, Di Carlo S, Chatzidimitriou A, Tselonis S, et al. Cross-layer system reliability assessment framework for hardware faults. In: 2016 IEEE International Test Conference (ITC); 2016. p. 1–10 Vallero A, Savino A, Politano G, Di Carlo S, Chatzidimitriou A, Tselonis S, et al. Cross-layer system reliability assessment framework for hardware faults. In: 2016 IEEE International Test Conference (ITC); 2016. p. 1–10
28.
Zurück zum Zitat Gupta M, Sridharan V, Roberts D, Prodromou A, Venkat A, Tullsen D, et al. Reliability-aware data placement for heterogeneous memory architecture. In: 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA); 2018. p. 583–595 Gupta M, Sridharan V, Roberts D, Prodromou A, Venkat A, Tullsen D, et al. Reliability-aware data placement for heterogeneous memory architecture. In: 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA); 2018. p. 583–595
29.
Zurück zum Zitat Jaulmes L, Moretó M, Valero M, Erez M, Casas M. Runtime-guided ECC protection using online estimation of memory vulnerability. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis; 2020. p. 1–14 Jaulmes L, Moretó M, Valero M, Erez M, Casas M. Runtime-guided ECC protection using online estimation of memory vulnerability. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis; 2020. p. 1–14
30.
Zurück zum Zitat Asadi G, Tahoori MB. An analytical approach for soft error rate estimation in digital circuits. In: 2005 IEEE International Symposium on Circuits and Systems; 2005. p. 2991–2994 Asadi G, Tahoori MB. An analytical approach for soft error rate estimation in digital circuits. In: 2005 IEEE International Symposium on Circuits and Systems; 2005. p. 2991–2994
31.
Zurück zum Zitat Mukherjee SS, Emer J, Reinhardt SK. The soft error problem: An architectural perspective. In: 11th International Symposium on High-Performance Computer Architecture; 2005. p. 243–247 Mukherjee SS, Emer J, Reinhardt SK. The soft error problem: An architectural perspective. In: 11th International Symposium on High-Performance Computer Architecture; 2005. p. 243–247
32.
Zurück zum Zitat Weaver C, Emer J, Mukherjee SS, Reinhardt SK (2004) Techniques to reduce the soft error rate of a high-performance microprocessor. ACM SIGARCH Comput Archit News 32(2):264CrossRef Weaver C, Emer J, Mukherjee SS, Reinhardt SK (2004) Techniques to reduce the soft error rate of a high-performance microprocessor. ACM SIGARCH Comput Archit News 32(2):264CrossRef
33.
Zurück zum Zitat Upasani G, Vera X, González A. Reducing due-fit of caches by exploiting acoustic wave detectors for error recovery. In: 2013 IEEE 19th International On-Line Testing Symposium (IOLTS); 2013. p. 85–91 Upasani G, Vera X, González A. Reducing due-fit of caches by exploiting acoustic wave detectors for error recovery. In: 2013 IEEE 19th International On-Line Testing Symposium (IOLTS); 2013. p. 85–91
34.
Zurück zum Zitat Fratin V, Oliveira D, Lunardi C, Santos F, Rodrigues G, Rech P. Code-dependent and architecture-dependent reliability behaviors. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN); 2018. p. 13–26 Fratin V, Oliveira D, Lunardi C, Santos F, Rodrigues G, Rech P. Code-dependent and architecture-dependent reliability behaviors. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN); 2018. p. 13–26
35.
Zurück zum Zitat Utrera G, Gil M, Martorell X. Analysis of the impact factors on data error propagation in HPC applications. In: 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP); 2018. p. 546–549 Utrera G, Gil M, Martorell X. Analysis of the impact factors on data error propagation in HPC applications. In: 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP); 2018. p. 546–549
36.
Zurück zum Zitat Ferreira RR, Da Rolt J, Nazar GL, Moreira AF, Carro L. Adaptive low-power architecture for high-performance and reliable embedded computing. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks; 2014. p. 538–549 Ferreira RR, Da Rolt J, Nazar GL, Moreira AF, Carro L. Adaptive low-power architecture for high-performance and reliable embedded computing. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks; 2014. p. 538–549
37.
Zurück zum Zitat Hu J, Wang S, Ziavras SG. On the exploitation of narrow-width values for improving register file reliability. IEEE Transactions on Very Large Scale Integration (VLSI) systems. 2009;17(7):953–963 Hu J, Wang S, Ziavras SG. On the exploitation of narrow-width values for improving register file reliability. IEEE Transactions on Very Large Scale Integration (VLSI) systems. 2009;17(7):953–963
38.
Zurück zum Zitat Subasi O, Arias J, Unsal O, Labarta J, Cristal A. Nanocheckpoints: A task-based asynchronous dataflow framework for efficient and scalable checkpoint/restart. In: 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing; 2015. p. 99–102 Subasi O, Arias J, Unsal O, Labarta J, Cristal A. Nanocheckpoints: A task-based asynchronous dataflow framework for efficient and scalable checkpoint/restart. In: 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing; 2015. p. 99–102
39.
Zurück zum Zitat Ashraf RA, Gioiosa R, Kestor G, DeMara RF. Exploring the effect of compiler optimizations on the reliability of HPC applications. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW); 2017. p. 1274–1283 Ashraf RA, Gioiosa R, Kestor G, DeMara RF. Exploring the effect of compiler optimizations on the reliability of HPC applications. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW); 2017. p. 1274–1283
40.
Zurück zum Zitat Mukherjee SS, Weaver C, Emer J, Reinhardt SK, Austin T. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In: Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36.; 2003. p. 29–40 Mukherjee SS, Weaver C, Emer J, Reinhardt SK, Austin T. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In: Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36.; 2003. p. 29–40
41.
Zurück zum Zitat Mukherjee SS, Weaver CT, Emer J, Reinhardt SK, Austin T (2003) Measuring architectural vulnerability factors. IEEE Micro 23(6):70–75CrossRef Mukherjee SS, Weaver CT, Emer J, Reinhardt SK, Austin T (2003) Measuring architectural vulnerability factors. IEEE Micro 23(6):70–75CrossRef
42.
Zurück zum Zitat Zhang W. Computing cache vulnerability to transient errors and its implication. In: 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT’05); 2005. p. 427–435 Zhang W. Computing cache vulnerability to transient errors and its implication. In: 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT’05); 2005. p. 427–435
43.
Zurück zum Zitat Yan J, Zhang W. Compiler-guided register reliability improvement against soft errors. In: Proceedings of the 5th ACM International Conference on Embedded Software; 2005. p. 203–209 Yan J, Zhang W. Compiler-guided register reliability improvement against soft errors. In: Proceedings of the 5th ACM International Conference on Embedded Software; 2005. p. 203–209
44.
Zurück zum Zitat Jaulmes L, Moreto M, Valero M, Casas M. A vulnerability factor for ECC-protected memory. In: 2019 IEEE 25th International Symposium on On-Line Testing and Robust System Design (IOLTS); 2019. p. 176–181 Jaulmes L, Moreto M, Valero M, Casas M. A vulnerability factor for ECC-protected memory. In: 2019 IEEE 25th International Symposium on On-Line Testing and Robust System Design (IOLTS); 2019. p. 176–181
45.
Zurück zum Zitat Sridharan V, Kaeli DR. Eliminating microarchitectural dependency from architectural vulnerability. In: 2009 IEEE 15th International Symposium on High Performance Computer Architecture; 2009. p. 117–128 Sridharan V, Kaeli DR. Eliminating microarchitectural dependency from architectural vulnerability. In: 2009 IEEE 15th International Symposium on High Performance Computer Architecture; 2009. p. 117–128
46.
Zurück zum Zitat Borodin D, Juurlink BH. Protective redundancy overhead reduction using instruction vulnerability factor. In: Proceedings of the 7th ACM International Conference on Computing Frontiers; 2010. p. 319–326 Borodin D, Juurlink BH. Protective redundancy overhead reduction using instruction vulnerability factor. In: Proceedings of the 7th ACM International Conference on Computing Frontiers; 2010. p. 319–326
47.
Zurück zum Zitat Yu L, Li D, Mittal S, Vetter JS. Quantitatively modeling application resilience with the data vulnerability factor. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis; 2014. p. 695–706 Yu L, Li D, Mittal S, Vetter JS. Quantitatively modeling application resilience with the data vulnerability factor. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis; 2014. p. 695–706
48.
Zurück zum Zitat Oz I, Topcuoglu HR, Kandemir M, Tosun O (2012) Thread vulnerability in parallel applications. J Parallel Distribut Comput 72(10):1171–1185CrossRef Oz I, Topcuoglu HR, Kandemir M, Tosun O (2012) Thread vulnerability in parallel applications. J Parallel Distribut Comput 72(10):1171–1185CrossRef
49.
Zurück zum Zitat Hiller M, Jhumka A, Suri N. On the placement of software mechanisms for detection of data errors. In: Proceedings International Conference on Dependable Systems and Networks; 2002. p. 135–144 Hiller M, Jhumka A, Suri N. On the placement of software mechanisms for detection of data errors. In: Proceedings International Conference on Dependable Systems and Networks; 2002. p. 135–144
50.
Zurück zum Zitat Leeke M, Jhumka A. Towards understanding the importance of variables in dependable software. In: 2010 European Dependable Computing Conference; 2010. p. 85–94 Leeke M, Jhumka A. Towards understanding the importance of variables in dependable software. In: 2010 European Dependable Computing Conference; 2010. p. 85–94
51.
Zurück zum Zitat Utrera G, Gil M, Martorell X. Analyzing data-error propagation effects in high-performance computing. In: 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP); 2016. p. 418–421 Utrera G, Gil M, Martorell X. Analyzing data-error propagation effects in high-performance computing. In: 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP); 2016. p. 418–421
52.
Zurück zum Zitat Ashraf RA, Gioiosa R, Kestor G, DeMara RF, Cher CY, Bose P. Understanding the propagation of transient errors in HPC applications. In: SC’15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis; 2015. p. 1–12 Ashraf RA, Gioiosa R, Kestor G, DeMara RF, Cher CY, Bose P. Understanding the propagation of transient errors in HPC applications. In: SC’15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis; 2015. p. 1–12
53.
Zurück zum Zitat Guo L, Li D. Moard: Modeling application resilience to transient faults on data objects. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS); 2019. p. 878–889 Guo L, Li D. Moard: Modeling application resilience to transient faults on data objects. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS); 2019. p. 878–889
54.
Zurück zum Zitat Shantharam M, Srinivasmurthy S, Raghavan P. Characterizing the impact of soft errors on iterative methods in scientific computing. In: Proceedings of the International Conference on Supercomputing; 2011. p. 152–161 Shantharam M, Srinivasmurthy S, Raghavan P. Characterizing the impact of soft errors on iterative methods in scientific computing. In: Proceedings of the International Conference on Supercomputing; 2011. p. 152–161
55.
Zurück zum Zitat Moríñigo JA, Bustos A, Mayo-García R (2022) Error resilience of three GMRES implementations under fault injection. J Supercomput 78(5):7158–7185CrossRef Moríñigo JA, Bustos A, Mayo-García R (2022) Error resilience of three GMRES implementations under fault injection. J Supercomput 78(5):7158–7185CrossRef
56.
Zurück zum Zitat Guo L, Li D, Laguna I, Schulz M. Fliptracker: Understanding natural error resilience in hpc applications. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis; 2018. p. 94–107 Guo L, Li D, Laguna I, Schulz M. Fliptracker: Understanding natural error resilience in hpc applications. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis; 2018. p. 94–107
57.
Zurück zum Zitat Fiala D, Mueller F, Engelmann C, Riesen R, Ferreira K, Brightwell R. Detection and correction of silent data corruption for large-scale high-performance computing. In: SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis; 2012. p. 1–12 Fiala D, Mueller F, Engelmann C, Riesen R, Ferreira K, Brightwell R. Detection and correction of silent data corruption for large-scale high-performance computing. In: SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis; 2012. p. 1–12
58.
Zurück zum Zitat Guan Q, Hu X, Grove T, Fang B, Jiang H, Yin H, et al. Chaser: An enhanced fault injection tool for tracing soft errors in mpi applications. In: 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN); 2020. p. 355–363 Guan Q, Hu X, Grove T, Fang B, Jiang H, Yin H, et al. Chaser: An enhanced fault injection tool for tracing soft errors in mpi applications. In: 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN); 2020. p. 355–363
59.
Zurück zum Zitat DeFreez D, Bhowmick A, Laguna I, Rubio-González C. Detecting and reproducing error-code propagation bugs in MPI implementations. In: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming; 2020. p. 187–201 DeFreez D, Bhowmick A, Laguna I, Rubio-González C. Detecting and reproducing error-code propagation bugs in MPI implementations. In: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming; 2020. p. 187–201
60.
Zurück zum Zitat Somani AK, Trivedi KS. A cache error propagation model. In: Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems; 1997. p. 15–21 Somani AK, Trivedi KS. A cache error propagation model. In: Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems; 1997. p. 15–21
61.
Zurück zum Zitat Li ML, Ramachandran P, Sahoo SK, Adve SV, Adve VS, Zhou Y (2008) Understanding the propagation of hard errors to software and implications for resilient system design. ACM Sigplan Notice 43(3):265–276CrossRef Li ML, Ramachandran P, Sahoo SK, Adve SV, Adve VS, Zhou Y (2008) Understanding the propagation of hard errors to software and implications for resilient system design. ACM Sigplan Notice 43(3):265–276CrossRef
62.
Zurück zum Zitat Gu J, Zheng W, Zhuang Y, Zhang Q (2019) Vulnerability analysis of instructions for SDC-causing error detection. IEEE Access 7:168885–168898CrossRef Gu J, Zheng W, Zhuang Y, Zhang Q (2019) Vulnerability analysis of instructions for SDC-causing error detection. IEEE Access 7:168885–168898CrossRef
63.
Zurück zum Zitat Li Z, Menon H, Mohror K, Bremer PT, Livant Y, Pascucci V. Understanding a program’s resiliency through error propagation. In: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming; 2021. p. 362–373 Li Z, Menon H, Mohror K, Bremer PT, Livant Y, Pascucci V. Understanding a program’s resiliency through error propagation. In: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming; 2021. p. 362–373
64.
Zurück zum Zitat Li G, Pattabiraman K, Hari SKS, Sullivan M, Tsai T. Modeling soft-error propagation in programs. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN); 2018. p. 27–38 Li G, Pattabiraman K, Hari SKS, Sullivan M, Tsai T. Modeling soft-error propagation in programs. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN); 2018. p. 27–38
65.
Zurück zum Zitat Li G, Pattabiraman K. Modeling input-dependent error propagation in programs. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN); 2018. p. 279–290 Li G, Pattabiraman K. Modeling input-dependent error propagation in programs. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN); 2018. p. 279–290
66.
Zurück zum Zitat Anwer AR, Li G, Pattabiraman K, Sullivan M, Tsai T, Hari SKS. Gpu-trident: efficient modeling of error propagation in gpu programs. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis; 2020. p. 1–15 Anwer AR, Li G, Pattabiraman K, Sullivan M, Tsai T, Hari SKS. Gpu-trident: efficient modeling of error propagation in gpu programs. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis; 2020. p. 1–15
67.
Zurück zum Zitat Li Z, Menon H, Maljovec D, Livnat Y, Liu S, Mohror K et al (2020) Spotsdc: Revealing the silent data corruption propagation in high-performance computing systems. IEEE Trans Visual Comput Graph 27(10):3938–3952CrossRef Li Z, Menon H, Maljovec D, Livnat Y, Liu S, Mohror K et al (2020) Spotsdc: Revealing the silent data corruption propagation in high-performance computing systems. IEEE Trans Visual Comput Graph 27(10):3938–3952CrossRef
68.
Zurück zum Zitat Previlon F, Kalra C, Tiwari D, Kaeli D (2022) Characterizing and exploiting soft error vulnerability phase behavior in gpu applications. IEEE Trans Dependable Secure Comput 19(1):288–300CrossRef Previlon F, Kalra C, Tiwari D, Kaeli D (2022) Characterizing and exploiting soft error vulnerability phase behavior in gpu applications. IEEE Trans Dependable Secure Comput 19(1):288–300CrossRef
69.
Zurück zum Zitat Ko Y, Jeyapaul R, Kim Y, Lee K, Shrivastava A (2017) Protecting caches from soft errors: a microarchitect’s perspective. ACM Trans Embed Comput Sys (TECS) 16(4):1–28CrossRef Ko Y, Jeyapaul R, Kim Y, Lee K, Shrivastava A (2017) Protecting caches from soft errors: a microarchitect’s perspective. ACM Trans Embed Comput Sys (TECS) 16(4):1–28CrossRef
70.
Zurück zum Zitat Mittal S, Vetter JS. Reducing soft-error vulnerability of caches using data compression. In: Proceedings of the 26th Edition on Great Lakes Symposium on VLSI; 2016. p. 197–202 Mittal S, Vetter JS. Reducing soft-error vulnerability of caches using data compression. In: Proceedings of the 26th Edition on Great Lakes Symposium on VLSI; 2016. p. 197–202
71.
Zurück zum Zitat Houssany S, Guibbaud N, Bougerol A, Leveugle R, Miller F, Buard N (2012) Microprocessor soft error rate prediction based on cache memory analysis. IEEE Trans Nucl Sci 59(4):980–987CrossRef Houssany S, Guibbaud N, Bougerol A, Leveugle R, Miller F, Buard N (2012) Microprocessor soft error rate prediction based on cache memory analysis. IEEE Trans Nucl Sci 59(4):980–987CrossRef
72.
Zurück zum Zitat Vijayan A, Koneru A, Ebrahimit M, Chakrabarty K, Tahoori MB. Online soft-error vulnerability estimation for memory arrays. In: 2016 IEEE 34th VLSI Test Symposium (VTS); 2016. p. 1–6 Vijayan A, Koneru A, Ebrahimit M, Chakrabarty K, Tahoori MB. Online soft-error vulnerability estimation for memory arrays. In: 2016 IEEE 34th VLSI Test Symposium (VTS); 2016. p. 1–6
73.
Zurück zum Zitat Mamoutova OV, Antonov AP, Filippov AS. On design of cache with efficient soft error protection. In: 2017 IEEE 37th International Conference on Electronics and Nanotechnology (ELNANO); 2017. p. 57–60 Mamoutova OV, Antonov AP, Filippov AS. On design of cache with efficient soft error protection. In: 2017 IEEE 37th International Conference on Electronics and Nanotechnology (ELNANO); 2017. p. 57–60
74.
Zurück zum Zitat Parasyris K, Tziantzoulis G, Antonopoulos CD, Bellas N. GemFI: A fault injection tool for studying the behavior of applications on unreliable substrates. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks; 2014. p. 622–629 Parasyris K, Tziantzoulis G, Antonopoulos CD, Bellas N. GemFI: A fault injection tool for studying the behavior of applications on unreliable substrates. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks; 2014. p. 622–629
75.
Zurück zum Zitat Sangchoolie B, Pattabiraman K, Karlsson J. One bit is (not) enough: An empirical study of the impact of single and multiple bit-flip errors. In: 2017 47th annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN); 2017. p. 97–108 Sangchoolie B, Pattabiraman K, Karlsson J. One bit is (not) enough: An empirical study of the impact of single and multiple bit-flip errors. In: 2017 47th annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN); 2017. p. 97–108
76.
Zurück zum Zitat Lu Q, Farahani M, Wei J, Thomas A, Pattabiraman K. Llfi: An intermediate code-level fault injection tool for hardware faults. In: 2015 IEEE International Conference on Software Quality, Reliability and Security; 2015. p. 11–16 Lu Q, Farahani M, Wei J, Thomas A, Pattabiraman K. Llfi: An intermediate code-level fault injection tool for hardware faults. In: 2015 IEEE International Conference on Software Quality, Reliability and Security; 2015. p. 11–16
77.
Zurück zum Zitat Davis TA, Hu Y (2011) The University of Florida sparse matrix collection. ACM Trans Mathemat Software (TOMS) 38(1):1–25MathSciNetMATH Davis TA, Hu Y (2011) The University of Florida sparse matrix collection. ACM Trans Mathemat Software (TOMS) 38(1):1–25MathSciNetMATH
Metadaten
Titel
Studying error propagation on application data structure and hardware
verfasst von
Zuhal Ozturk
Haluk Rahmi Topcuoglu
Mahmut Taylan Kandemir
Publikationsdatum
13.06.2022
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 17/2022
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-022-04625-x

Weitere Artikel der Ausgabe 17/2022

The Journal of Supercomputing 17/2022 Zur Ausgabe

Premium Partner