Skip to main content
Erschienen in: The Journal of Supercomputing 12/2016

01.12.2016

Rolex: resilience-oriented language extensions for extreme-scale systems

verfasst von: Saurabh Hukerikar, Robert F. Lucas

Erschienen in: The Journal of Supercomputing | Ausgabe 12/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Future exascale high-performance computing (HPC) systems will be constructed from VLSI devices that will be less reliable than those used today, and faults will become the norm, not the exception. This will pose significant problems for system designers and programmers, who for half-a-century have enjoyed an execution model that assumed correct behaviour by the underlying computing system. The mean time to failure of the system scales inversely to the number of components in the system and, therefore, faults and resultant system level failures will increase, as systems scale in terms of the number of processor cores and memory modules used. However, every error detected need not cause catastrophic failure. Many HPC applications are inherently fault resilient. Yet it is the application programmers who have this knowledge but lack mechanisms to convey it to the system. In this paper, we present new Resilience Oriented Language Extensions (Rolex) which facilitate the incorporation of fault resilience as an intrinsic property of the application code. We describe the syntax and semantics of the language extensions as well as the implementation of the supporting compiler infrastructure and runtime system. Our experiments show that an approach that leverages the programmer’s insight to reason about the context and significance of faults to the application outcome significantly improves the probability that an application runs to a successful conclusion.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
1.
Zurück zum Zitat Ashby S et al (2010) The Opportunities and Challenges of Exascale Computing. Tech. rep., Summary Report of the Advanced Scientific Computing Advisory Committee (ASCAC) Subcommittee, pp 1–77 Ashby S et al (2010) The Opportunities and Challenges of Exascale Computing. Tech. rep., Summary Report of the Advanced Scientific Computing Advisory Committee (ASCAC) Subcommittee, pp 1–77
2.
Zurück zum Zitat Agullo E, Giraud L, Guermouche A, Roman J, Zounon M, Agullo E, Giraud L, Guermouche A, Roman J, Zounon M, Sud-ouest B (2013) Towards resilient parallel linear krylov solvers: recover-restart strategies. Tech. rep, INRIA Agullo E, Giraud L, Guermouche A, Roman J, Zounon M, Agullo E, Giraud L, Guermouche A, Roman J, Zounon M, Sud-ouest B (2013) Towards resilient parallel linear krylov solvers: recover-restart strategies. Tech. rep, INRIA
4.
Zurück zum Zitat Aumann Y, Bender MA (1996) Fault tolerant data structures. In: Proceedings of the 37th Annual Symposium on Foundations of Computer Science. IEEE Computer Society, FOCS ’96, Washington, DC, pp 580–589 Aumann Y, Bender MA (1996) Fault tolerant data structures. In: Proceedings of the 37th Annual Symposium on Foundations of Computer Science. IEEE Computer Society, FOCS ’96, Washington, DC, pp 580–589
5.
Zurück zum Zitat Bosilca G, Delmas R, Dongarra J, Langou J (2008) Algorithmic based fault tolerance applied to high performance computing. CoRR Bosilca G, Delmas R, Dongarra J, Langou J (2008) Algorithmic based fault tolerance applied to high performance computing. CoRR
6.
Zurück zum Zitat Buck I, Foley T, Horn D, Sugerman J, Fatahalian K, Houston M, Hanrahan P (2004) Brook for gpus: stream computing on graphics hardware. In: ACM SIGGRAPH, pp 777–786 Buck I, Foley T, Horn D, Sugerman J, Fatahalian K, Houston M, Hanrahan P (2004) Brook for gpus: stream computing on graphics hardware. In: ACM SIGGRAPH, pp 777–786
7.
Zurück zum Zitat Carlson W, Draper J, Culler D, Yelick K, Brooks E, Warren K (1999) Introduction to upc and language specification Carlson W, Draper J, Culler D, Yelick K, Brooks E, Warren K (1999) Introduction to upc and language specification
8.
Zurück zum Zitat Chung J, Lee I, Sullivan M, Ryoo JH, Kim DW, Yoon DH, Kaplan L, Erez M (2012) Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp 58:1–58:11 Chung J, Lee I, Sullivan M, Ryoo JH, Kim DW, Yoon DH, Kaplan L, Erez M (2012) Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp 58:1–58:11
9.
Zurück zum Zitat van Dam HJJ, Vishnu A, de Jong WA (2013) A case for soft error detection and correction in computational chemistry. J Chem Theory Comput 9:3995–4005CrossRef van Dam HJJ, Vishnu A, de Jong WA (2013) A case for soft error detection and correction in computational chemistry. J Chem Theory Comput 9:3995–4005CrossRef
10.
Zurück zum Zitat Dongarra J et al (2011) The international exascale software project roadmap. Int J High Perform Comput Appl:3–60 Dongarra J et al (2011) The international exascale software project roadmap. Int J High Perform Comput Appl:3–60
11.
Zurück zum Zitat Elnozahy E et al (2009) System resilience at extreme scale, White Paper. Tech. rep, DARPA Elnozahy E et al (2009) System resilience at extreme scale, White Paper. Tech. rep, DARPA
12.
Zurück zum Zitat Fujita H, Schreiber R, Chien AA (2013) It’s time for new programming models for unreliable hardware, provocative ideas session. In: International Conference on Architectural Support for Programming Languages and Operating Systems Fujita H, Schreiber R, Chien AA (2013) It’s time for new programming models for unreliable hardware, provocative ideas session. In: International Conference on Architectural Support for Programming Languages and Operating Systems
13.
Zurück zum Zitat Hoemmen M, Heroux MA (2011) Fault-tolerant iterative methods via selective reliability. Tech. rep Hoemmen M, Heroux MA (2011) Fault-tolerant iterative methods via selective reliability. Tech. rep
14.
Zurück zum Zitat Huang KH, Abraham J (1984) Algorithm-based fault tolerance for matrix operations. IEEE Trans Comput C-33(6):518–528 Huang KH, Abraham J (1984) Algorithm-based fault tolerance for matrix operations. IEEE Trans Comput C-33(6):518–528
15.
Zurück zum Zitat Hukerikar S, Diniz PC, Lucas RF (2012) A programming model for resilience in extreme scale computing. In: IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012), pp 1–6. doi:10.1109/DSNW.2012.6264671 Hukerikar S, Diniz PC, Lucas RF (2012) A programming model for resilience in extreme scale computing. In: IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012), pp 1–6. doi:10.​1109/​DSNW.​2012.​6264671
16.
Zurück zum Zitat Hukerikar S, Diniz PC, Lucas RF (2013) Robust graph traversal: resiliency techniques for data intensive supercomputing. In: IEEE High Performance Extreme Computing Conference (HPEC), pp 1–6. doi:10.1109/HPEC.2013.6670340 Hukerikar S, Diniz PC, Lucas RF (2013) Robust graph traversal: resiliency techniques for data intensive supercomputing. In: IEEE High Performance Extreme Computing Conference (HPEC), pp 1–6. doi:10.​1109/​HPEC.​2013.​6670340
17.
Zurück zum Zitat Hukerikar S, Diniz PC, Lucas RF (2015) Enabling application resilience through programming model based fault amelioration. In: IEEE High Performance Extreme Computing Conference (HPEC), pp 1–6. doi:10.1109/HPEC.2015.7322460 Hukerikar S, Diniz PC, Lucas RF (2015) Enabling application resilience through programming model based fault amelioration. In: IEEE High Performance Extreme Computing Conference (HPEC), pp 1–6. doi:10.​1109/​HPEC.​2015.​7322460
18.
Zurück zum Zitat Kogge P et al (2008) Exascale computing study: technology challenges in achieving exascale systems. Tech. rep, DARPA Kogge P et al (2008) Exascale computing study: technology challenges in achieving exascale systems. Tech. rep, DARPA
19.
Zurück zum Zitat de Kruijf MA, Sankaralingam K, Jha S (2012) Static analysis and compiler design for idempotent processing. In: Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation, PLDI ’12, pp 475–486 de Kruijf MA, Sankaralingam K, Jha S (2012) Static analysis and compiler design for idempotent processing. In: Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation, PLDI ’12, pp 475–486
20.
Zurück zum Zitat Langou J, Chen Z, Bosilca G, Dongarra J (2007) Recovery patterns for iterative methods in a parallel unstable environment. SIAM J Sci Comput 30:102–116MathSciNetCrossRefMATH Langou J, Chen Z, Bosilca G, Dongarra J (2007) Recovery patterns for iterative methods in a parallel unstable environment. SIAM J Sci Comput 30:102–116MathSciNetCrossRefMATH
21.
Zurück zum Zitat Numrich RW, Reid J (1998) Co-array fortran for parallel programming. SIGPLAN Fortran Forum 17(2):1–31CrossRef Numrich RW, Reid J (1998) Co-array fortran for parallel programming. SIGPLAN Fortran Forum 17(2):1–31CrossRef
23.
Zurück zum Zitat Sao P, Vuduc R (2013) Self-stabilizing iterative solvers. In: Proceedings of the Workshop on latest advances in scalable algorithms for large-scale systems, ScalA ’13, pp 4:1–4:8 Sao P, Vuduc R (2013) Self-stabilizing iterative solvers. In: Proceedings of the Workshop on latest advances in scalable algorithms for large-scale systems, ScalA ’13, pp 4:1–4:8
24.
Zurück zum Zitat Sloan J, Kumar R, Bronevetsky G (2012) Algorithmic approaches to low overhead fault detection for sparse linear algebra. In: Proceedings of the 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp 1–12 Sloan J, Kumar R, Bronevetsky G (2012) Algorithmic approaches to low overhead fault detection for sparse linear algebra. In: Proceedings of the 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp 1–12
25.
Zurück zum Zitat Sloan J, Kumar R, Bronevetsky G (2013) An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance. In: Proceedings of the 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp 1–12 Sloan J, Kumar R, Bronevetsky G (2013) An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance. In: Proceedings of the 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp 1–12
26.
Zurück zum Zitat Snir M et al (2013) Addressing failures in exascale computing. Tech. rep., Argonne Report ANL/MCS-TM-332 Snir M et al (2013) Addressing failures in exascale computing. Tech. rep., Argonne Report ANL/MCS-TM-332
27.
Zurück zum Zitat Yajnik S, Jha N (1994) Synthesis of fault tolerant architectures for molecular dynamics. IEEE Int Symp Circuits Syst 4:247–250 Yajnik S, Jha N (1994) Synthesis of fault tolerant architectures for molecular dynamics. IEEE Int Symp Circuits Syst 4:247–250
28.
Zurück zum Zitat Yalcin G, Unsal O, Hur I, Cristal A, Valero M (2010) FaulTM: fault-tolerance using hardware transactional memory. In: Workshop on parallel execution of sequential programs on multi-core architecture. Saint Malo, France Yalcin G, Unsal O, Hur I, Cristal A, Valero M (2010) FaulTM: fault-tolerance using hardware transactional memory. In: Workshop on parallel execution of sequential programs on multi-core architecture. Saint Malo, France
29.
Zurück zum Zitat Zou A, Lipscomb TJ, Cho SS (2012) Single vs. double precision in md simulations: correlation depends on system length-scale. GPU Technology Conference Zou A, Lipscomb TJ, Cho SS (2012) Single vs. double precision in md simulations: correlation depends on system length-scale. GPU Technology Conference
Metadaten
Titel
Rolex: resilience-oriented language extensions for extreme-scale systems
verfasst von
Saurabh Hukerikar
Robert F. Lucas
Publikationsdatum
01.12.2016
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 12/2016
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-016-1752-5

Weitere Artikel der Ausgabe 12/2016

The Journal of Supercomputing 12/2016 Zur Ausgabe