Skip to main content
Erschienen in: International Journal of Parallel Programming 2/2018

11.02.2017

RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading

verfasst von: Saurabh Hukerikar, Keita Teranishi, Pedro C. Diniz, Robert F. Lucas

Erschienen in: International Journal of Parallel Programming | Ausgabe 2/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In the presence of accelerated fault rates, which are projected to be the norm on future exascale systems, it will become increasingly difficult for high-performance computing (HPC) applications to accomplish useful computation. Due to the fault-oblivious nature of current HPC programming paradigms and execution environments, HPC applications are insufficiently equipped to deal with errors. We believe that HPC applications should be enabled with capabilities to actively search for and correct errors in their computations. The redundant multithreading (RMT) approach offers lightweight replicated execution streams of program instructions within the context of a single application process. However, the use of complete redundancy incurs significant overhead to the application performance.
In this paper we present RedThreads, an interface that provides application-level fault detection and correction based on RMT, but applies the thread-level redundancy adaptively. We describe the RedThreads syntax and semantics, and the supporting compiler infrastructure and runtime system. Our approach enables application programmers to scope the extent of redundant computation. Additionally, the runtime system permits the use of RMT to be dynamically enabled, or disabled, based on the resiliency needs of the application and the state of the system. Our experimental results demonstrate how adaptive RMT exploits programmer insight and runtime inference to dynamically navigate the trade-off space between an application’s resilience coverage and the associated performance overhead of redundant computation.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
2.
Zurück zum Zitat Austin, T.M.: Diva: A reliable substrate for deep submicron microarchitecture design. In: Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture, pp. 196–207 (1999) Austin, T.M.: Diva: A reliable substrate for deep submicron microarchitecture design. In: Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture, pp. 196–207 (1999)
3.
Zurück zum Zitat Bernick, D., Bruckert, B., Vigna, P.D., Garcia, D., Jardine, R., Klecka, J., Smullen, J.: Nonstopadvanced architecture. In: Proceedings of the 2005 International Conference on Dependable Systems and Networks, DSN ’05, pp. 12–21 (2005) Bernick, D., Bruckert, B., Vigna, P.D., Garcia, D., Jardine, R., Klecka, J., Smullen, J.: Nonstopadvanced architecture. In: Proceedings of the 2005 International Conference on Dependable Systems and Networks, DSN ’05, pp. 12–21 (2005)
4.
Zurück zum Zitat Borkar, S.: Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro 25(6), 10–16 (2005)CrossRef Borkar, S.: Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro 25(6), 10–16 (2005)CrossRef
5.
Zurück zum Zitat Cheng, E., Mirkhani, S., Szafaryn, L.G., Cher, C.Y., Cho, H., Skadron, K., Stan, M.R., Lilja, K., Abraham, J.A., Bose, P., Mitra, S.: Clear: cross-layer exploration for architecting resilience—combining hardware and software techniques to tolerate soft errors in processor cores. In: Proceedings of the 53rd Annual Design Automation Conference, DAC ’16, pp. 68:1–68:6 (2016) Cheng, E., Mirkhani, S., Szafaryn, L.G., Cher, C.Y., Cho, H., Skadron, K., Stan, M.R., Lilja, K., Abraham, J.A., Bose, P., Mitra, S.: Clear: cross-layer exploration for architecting resilience—combining hardware and software techniques to tolerate soft errors in processor cores. In: Proceedings of the 53rd Annual Design Automation Conference, DAC ’16, pp. 68:1–68:6 (2016)
6.
Zurück zum Zitat Dongarra, J., Beckman, P., Moore, T., et al.: The international exascale software project roadmap. Int. J. High Perform. Comput. Appl. 3–60 (2011) Dongarra, J., Beckman, P., Moore, T., et al.: The international exascale software project roadmap. Int. J. High Perform. Comput. Appl. 3–60 (2011)
7.
Zurück zum Zitat Elnozahy, E., Bianchini, R., El-Ghazawi, T., et al.: System resilience at extreme scale. White Paper. Tech. rep, DARPA (2009) Elnozahy, E., Bianchini, R., El-Ghazawi, T., et al.: System resilience at extreme scale. White Paper. Tech. rep, DARPA (2009)
8.
Zurück zum Zitat Engelmann, C., Ong, H.H., Scott, S.L.: The case for modular redundancy in large-scale high performance computing systems. In: Proceedings of the 27th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN), pp. 189–194 (2009) Engelmann, C., Ong, H.H., Scott, S.L.: The case for modular redundancy in large-scale high performance computing systems. In: Proceedings of the 27th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN), pp. 189–194 (2009)
9.
Zurück zum Zitat Ferreira, K., Stearley, J., Laros III, J.H., et al.: Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2011) Ferreira, K., Stearley, J., Laros III, J.H., et al.: Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2011)
10.
Zurück zum Zitat Gomaa, M.A., Vijaykumar, T.N.: Opportunistic transient-fault detection. In: SIGARCH Computer Architecture News, pp. 172–183 (2005) Gomaa, M.A., Vijaykumar, T.N.: Opportunistic transient-fault detection. In: SIGARCH Computer Architecture News, pp. 172–183 (2005)
11.
Zurück zum Zitat Hoemmen, M., Heroux, M.A.: Fault-tolerant iterative methods via selective reliability. In: Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, vol. 3, p. 9 (2011) Hoemmen, M., Heroux, M.A.: Fault-tolerant iterative methods via selective reliability. In: Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, vol. 3, p. 9 (2011)
12.
Zurück zum Zitat Hukerikar, S., Diniz, P.C., Lucas, R.F., Teranishi, K.: Opportunistic application-level fault detection through adaptive redundant multithreading. In: International Conference on High Performance Computing Simulation (HPCS), pp. 243–250 (2014). doi:10.1109/HPCSim.2014.6903692 Hukerikar, S., Diniz, P.C., Lucas, R.F., Teranishi, K.: Opportunistic application-level fault detection through adaptive redundant multithreading. In: International Conference on High Performance Computing Simulation (HPCS), pp. 243–250 (2014). doi:10.​1109/​HPCSim.​2014.​6903692
14.
Zurück zum Zitat Hukerikar, S., Teranishi, K., Diniz, P.C., Lucas, R.F.: An evaluation of lazy fault detection based on adaptive redundant multithreading. In: IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6 (2014) doi:10.1109/HPEC.2014.7040999 Hukerikar, S., Teranishi, K., Diniz, P.C., Lucas, R.F.: An evaluation of lazy fault detection based on adaptive redundant multithreading. In: IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6 (2014) doi:10.​1109/​HPEC.​2014.​7040999
15.
Zurück zum Zitat Kogge, P., Bergman, K., Borkar, S., et al.: Exascale computing study: technology challenges in achieving exascale systems. Tech. rep, DARPA (2008) Kogge, P., Bergman, K., Borkar, S., et al.: Exascale computing study: technology challenges in achieving exascale systems. Tech. rep, DARPA (2008)
16.
Zurück zum Zitat Liao, C., Quinlan, D.J., Vuduc, R., Panas, T.: Effective source-to-source outlining to support whole program empirical optimization pp. 308–322 (2010) Liao, C., Quinlan, D.J., Vuduc, R., Panas, T.: Effective source-to-source outlining to support whole program empirical optimization pp. 308–322 (2010)
17.
Zurück zum Zitat Lidman, J., Quinlan, D.J., Liao, C., McKee, S.A.: ROSE::FTTransform—a source-to-source translation framework for exascale fault-tolerance research. In: Dependable Systems and Networks Workshops (DSN-W), 2012 IEEE/IFIP 42nd International Conference on, pp. 1–6 (2012). doi:10.1109/DSNW.2012.6264672 Lidman, J., Quinlan, D.J., Liao, C., McKee, S.A.: ROSE::FTTransform—a source-to-source translation framework for exascale fault-tolerance research. In: Dependable Systems and Networks Workshops (DSN-W), 2012 IEEE/IFIP 42nd International Conference on, pp. 1–6 (2012). doi:10.​1109/​DSNW.​2012.​6264672
18.
Zurück zum Zitat Moon, T.K.: Error correction coding: mathematical methods and algorithms. Wiley, New York (2005)CrossRefMATH Moon, T.K.: Error correction coding: mathematical methods and algorithms. Wiley, New York (2005)CrossRefMATH
19.
Zurück zum Zitat Mukherjee, S.S., Kontz, M., Reinhardt, S.K.: Detailed design and evaluation of redundant multithreading alternatives. In: SIGARCH Computer Architecture News, pp. 99–110. Wiley-Interscience, Hoboken, N.J. (2002) Mukherjee, S.S., Kontz, M., Reinhardt, S.K.: Detailed design and evaluation of redundant multithreading alternatives. In: SIGARCH Computer Architecture News, pp. 99–110. Wiley-Interscience, Hoboken, N.J. (2002)
20.
Zurück zum Zitat Oh, N., Shirvani, P.P., McCluskey, E.J.: Error detection by duplicated instructions in super-scalar processors. IEEE Trans. Reliab. pp. 63–75 (2002) Oh, N., Shirvani, P.P., McCluskey, E.J.: Error detection by duplicated instructions in super-scalar processors. IEEE Trans. Reliab. pp. 63–75 (2002)
21.
Zurück zum Zitat Parashar, A., Sivasubramaniam, A., Gurumurthi, S.: Slick: Slice-based locality exploitation for efficient redundant multithreading. SIGOPS Oper. Syst. Rev. 5, 95–105 (2006)CrossRef Parashar, A., Sivasubramaniam, A., Gurumurthi, S.: Slick: Slice-based locality exploitation for efficient redundant multithreading. SIGOPS Oper. Syst. Rev. 5, 95–105 (2006)CrossRef
23.
Zurück zum Zitat Reinhardt, S.K., Mukherjee, S.S.: Transient fault detection via simultaneous multithreading. In: Proceedings of the 27th Annual International Symposium on Computer Architecture, pp. 25–36 (2000) Reinhardt, S.K., Mukherjee, S.S.: Transient fault detection via simultaneous multithreading. In: Proceedings of the 27th Annual International Symposium on Computer Architecture, pp. 25–36 (2000)
24.
Zurück zum Zitat Reis, G., Chang, J., Vachharajani, N., et al.: SWIFT: software implemented fault tolerance. In: International Symposium on Code Generation and Optimization, pp. 243–254 (2005) Reis, G., Chang, J., Vachharajani, N., et al.: SWIFT: software implemented fault tolerance. In: International Symposium on Code Generation and Optimization, pp. 243–254 (2005)
25.
Zurück zum Zitat Sao, P., Vuduc, R.: Self-stabilizing iterative solvers. In: Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA ’13, pp. 4:1–4:8 (2013) Sao, P., Vuduc, R.: Self-stabilizing iterative solvers. In: Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA ’13, pp. 4:1–4:8 (2013)
26.
Zurück zum Zitat Shye, A., Blomstedt, J., Moseley, T., Reddi, V.J., Connors, D.A.: Plr: a software approach to transient fault tolerance for multicore architectures. IEEE Trans. Dependable Secure Comput. 6(2), 135–148 (2009)CrossRef Shye, A., Blomstedt, J., Moseley, T., Reddi, V.J., Connors, D.A.: Plr: a software approach to transient fault tolerance for multicore architectures. IEEE Trans. Dependable Secure Comput. 6(2), 135–148 (2009)CrossRef
27.
Zurück zum Zitat Siddiqua, T., Gurumurthi, S.: Balancing soft error coverage with lifetime reliability in redundantly multithreaded processors. In: 2009 IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems, pp. 1–12 (2009) Siddiqua, T., Gurumurthi, S.: Balancing soft error coverage with lifetime reliability in redundantly multithreaded processors. In: 2009 IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems, pp. 1–12 (2009)
28.
Zurück zum Zitat Slegel, T., Averill R.M., I., Check, M., et. al: IBM’s S/390 G5 Microprocessor Design. In: IEEE Micro, pp. 12–23 (1999) Slegel, T., Averill R.M., I., Check, M., et. al: IBM’s S/390 G5 Microprocessor Design. In: IEEE Micro, pp. 12–23 (1999)
29.
Zurück zum Zitat Somers, J.: Stratus ftserver–intel fault tolerant platform. Intel Developer Forum (2002) Somers, J.: Stratus ftserver–intel fault tolerant platform. Intel Developer Forum (2002)
30.
Zurück zum Zitat Stearley, J., Ferreira, K., Robinson, D., et al.: Does partial replication pay off? In: IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W) (2012) Stearley, J., Ferreira, K., Robinson, D., et al.: Does partial replication pay off? In: IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W) (2012)
31.
Zurück zum Zitat The Opportunities and Challenges of Exascale Computing. Tech. rep., Summary Report of the Advanced Scientific Computing Advisory Committee (ASCAC) Subcommittee (2010) The Opportunities and Challenges of Exascale Computing. Tech. rep., Summary Report of the Advanced Scientific Computing Advisory Committee (ASCAC) Subcommittee (2010)
33.
Zurück zum Zitat Vadlamani, R., Zhao, J., Burleson, W., Tessier, R.: Multicore soft error rate stabilization using adaptive dual modular redundancy. In: Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’10, pp. 27–32 (2010) Vadlamani, R., Zhao, J., Burleson, W., Tessier, R.: Multicore soft error rate stabilization using adaptive dual modular redundancy. In: Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’10, pp. 27–32 (2010)
34.
Zurück zum Zitat Vijaykumar, T., Pomeranz, I., Cheng, K.: Transient-fault recovery using simultaneous multithreading. In: 29th Annual International Symposium on Computer Architecture, pp. 87–98 (2002) Vijaykumar, T., Pomeranz, I., Cheng, K.: Transient-fault recovery using simultaneous multithreading. In: 29th Annual International Symposium on Computer Architecture, pp. 87–98 (2002)
35.
Zurück zum Zitat von Neumann, J.: Probabilistic logics and the synthesis of reliable organisms from unreliable components. In Automata Studies, pp. 43–98. ACM, New York, NY (1956) von Neumann, J.: Probabilistic logics and the synthesis of reliable organisms from unreliable components. In Automata Studies, pp. 43–98. ACM, New York, NY (1956)
36.
Zurück zum Zitat Wang, C., Kim, H., Wu, Y., Ying, V.: Compiler-managed software-based redundant multi-threading for transient fault detection. In: International Symposium on Code Generation and Optimization, pp. 244–258 (2007). doi:10.1109/CGO.2007.7 Wang, C., Kim, H., Wu, Y., Ying, V.: Compiler-managed software-based redundant multi-threading for transient fault detection. In: International Symposium on Code Generation and Optimization, pp. 244–258 (2007). doi:10.​1109/​CGO.​2007.​7
37.
Zurück zum Zitat Zhang, Y., Lee, J.W., Johnson, N.P., August, D.I.: DAFT: Decoupled acyclic fault tolerance. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10, pp. 87–98 (2010) Zhang, Y., Lee, J.W., Johnson, N.P., August, D.I.: DAFT: Decoupled acyclic fault tolerance. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10, pp. 87–98 (2010)
Metadaten
Titel
RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading
verfasst von
Saurabh Hukerikar
Keita Teranishi
Pedro C. Diniz
Robert F. Lucas
Publikationsdatum
11.02.2017
Verlag
Springer US
Erschienen in
International Journal of Parallel Programming / Ausgabe 2/2018
Print ISSN: 0885-7458
Elektronische ISSN: 1573-7640
DOI
https://doi.org/10.1007/s10766-017-0492-3

Weitere Artikel der Ausgabe 2/2018

International Journal of Parallel Programming 2/2018 Zur Ausgabe