nach oben

International Journal of Parallel Programming

Erschienen in:

11.02.2017

RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading

verfasst von: Saurabh Hukerikar, Keita Teranishi, Pedro C. Diniz, Robert F. Lucas

Erschienen in: International Journal of Parallel Programming | Ausgabe 2/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

In the presence of accelerated fault rates, which are projected to be the norm on future exascale systems, it will become increasingly difficult for high-performance computing (HPC) applications to accomplish useful computation. Due to the fault-oblivious nature of current HPC programming paradigms and execution environments, HPC applications are insufficiently equipped to deal with errors. We believe that HPC applications should be enabled with capabilities to actively search for and correct errors in their computations. The redundant multithreading (RMT) approach offers lightweight replicated execution streams of program instructions within the context of a single application process. However, the use of complete redundancy incurs significant overhead to the application performance.

In this paper we present RedThreads, an interface that provides application-level fault detection and correction based on RMT, but applies the thread-level redundancy adaptively. We describe the RedThreads syntax and semantics, and the supporting compiler infrastructure and runtime system. Our approach enables application programmers to scope the extent of redundant computation. Additionally, the runtime system permits the use of RMT to be dynamically enabled, or disabled, based on the resiliency needs of the application and the state of the system. Our experimental results demonstrate how adaptive RMT exploits programmer insight and runtime inference to dynamically navigate the trade-off space between an application’s resilience coverage and the associated performance overhead of redundant computation.

Vorheriger Artikel Data-Driven Thread Execution on Heterogeneous Processors

Nächster Artikel Parallel Asynchronous Strategies for the Execution of Feature Selection Algorithms

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Advanced configuration and power interface (ACPI). http://www.uefi.org/acpi/specs (2013)

Austin, T.M.: Diva: A reliable substrate for deep submicron microarchitecture design. In: Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture, pp. 196–207 (1999)

Bernick, D., Bruckert, B., Vigna, P.D., Garcia, D., Jardine, R., Klecka, J., Smullen, J.: Nonstopadvanced architecture. In: Proceedings of the 2005 International Conference on Dependable Systems and Networks, DSN ’05, pp. 12–21 (2005)

Borkar, S.: Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro 25(6), 10–16 (2005)CrossRef

Cheng, E., Mirkhani, S., Szafaryn, L.G., Cher, C.Y., Cho, H., Skadron, K., Stan, M.R., Lilja, K., Abraham, J.A., Bose, P., Mitra, S.: Clear: cross-layer exploration for architecting resilience—combining hardware and software techniques to tolerate soft errors in processor cores. In: Proceedings of the 53rd Annual Design Automation Conference, DAC ’16, pp. 68:1–68:6 (2016)

Dongarra, J., Beckman, P., Moore, T., et al.: The international exascale software project roadmap. Int. J. High Perform. Comput. Appl. 3–60 (2011)

Elnozahy, E., Bianchini, R., El-Ghazawi, T., et al.: System resilience at extreme scale. White Paper. Tech. rep, DARPA (2009)

Engelmann, C., Ong, H.H., Scott, S.L.: The case for modular redundancy in large-scale high performance computing systems. In: Proceedings of the 27th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN), pp. 189–194 (2009)

Ferreira, K., Stearley, J., Laros III, J.H., et al.: Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2011)

10.

Gomaa, M.A., Vijaykumar, T.N.: Opportunistic transient-fault detection. In: SIGARCH Computer Architecture News, pp. 172–183 (2005)

11.

Hoemmen, M., Heroux, M.A.: Fault-tolerant iterative methods via selective reliability. In: Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, vol. 3, p. 9 (2011)

12.

Hukerikar, S., Diniz, P.C., Lucas, R.F., Teranishi, K.: Opportunistic application-level fault detection through adaptive redundant multithreading. In: International Conference on High Performance Computing Simulation (HPCS), pp. 243–250 (2014). doi:10.1109/HPCSim.2014.6903692

13.

Hukerikar, S., Lucas, R.F.: Rolex: resilience-oriented language extensions for extreme-scale systems. J. Supercomput. 72, 1–33 (2016). doi:10.1007/s11227-016-1752-5 CrossRef

14.

Hukerikar, S., Teranishi, K., Diniz, P.C., Lucas, R.F.: An evaluation of lazy fault detection based on adaptive redundant multithreading. In: IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6 (2014) doi:10.1109/HPEC.2014.7040999

15.

Kogge, P., Bergman, K., Borkar, S., et al.: Exascale computing study: technology challenges in achieving exascale systems. Tech. rep, DARPA (2008)

16.

Liao, C., Quinlan, D.J., Vuduc, R., Panas, T.: Effective source-to-source outlining to support whole program empirical optimization pp. 308–322 (2010)

17.

Lidman, J., Quinlan, D.J., Liao, C., McKee, S.A.: ROSE::FTTransform—a source-to-source translation framework for exascale fault-tolerance research. In: Dependable Systems and Networks Workshops (DSN-W), 2012 IEEE/IFIP 42nd International Conference on, pp. 1–6 (2012). doi:10.1109/DSNW.2012.6264672

18.

Moon, T.K.: Error correction coding: mathematical methods and algorithms. Wiley, New York (2005)CrossRefMATH

19.

Mukherjee, S.S., Kontz, M., Reinhardt, S.K.: Detailed design and evaluation of redundant multithreading alternatives. In: SIGARCH Computer Architecture News, pp. 99–110. Wiley-Interscience, Hoboken, N.J. (2002)

20.

Oh, N., Shirvani, P.P., McCluskey, E.J.: Error detection by duplicated instructions in super-scalar processors. IEEE Trans. Reliab. pp. 63–75 (2002)

21.

Parashar, A., Sivasubramaniam, A., Gurumurthi, S.: Slick: Slice-based locality exploitation for efficient redundant multithreading. SIGOPS Oper. Syst. Rev. 5, 95–105 (2006)CrossRef

22.

Quinlan, D., et al.: Rose Compiler (2000) http://www.rosecompiler.org

23.

Reinhardt, S.K., Mukherjee, S.S.: Transient fault detection via simultaneous multithreading. In: Proceedings of the 27th Annual International Symposium on Computer Architecture, pp. 25–36 (2000)

24.

Reis, G., Chang, J., Vachharajani, N., et al.: SWIFT: software implemented fault tolerance. In: International Symposium on Code Generation and Optimization, pp. 243–254 (2005)

25.

Sao, P., Vuduc, R.: Self-stabilizing iterative solvers. In: Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA ’13, pp. 4:1–4:8 (2013)

26.

Shye, A., Blomstedt, J., Moseley, T., Reddi, V.J., Connors, D.A.: Plr: a software approach to transient fault tolerance for multicore architectures. IEEE Trans. Dependable Secure Comput. 6(2), 135–148 (2009)CrossRef

27.

Siddiqua, T., Gurumurthi, S.: Balancing soft error coverage with lifetime reliability in redundantly multithreaded processors. In: 2009 IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems, pp. 1–12 (2009)

28.

Slegel, T., Averill R.M., I., Check, M., et. al: IBM’s S/390 G5 Microprocessor Design. In: IEEE Micro, pp. 12–23 (1999)

29.

Somers, J.: Stratus ftserver–intel fault tolerant platform. Intel Developer Forum (2002)

30.

Stearley, J., Ferreira, K., Robinson, D., et al.: Does partial replication pay off? In: IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W) (2012)

31.

The Opportunities and Challenges of Exascale Computing. Tech. rep., Summary Report of the Advanced Scientific Computing Advisory Committee (ASCAC) Subcommittee (2010)

32.

USC: Center for high-performance computing. https://hpcc.usc.edu/

33.

Vadlamani, R., Zhao, J., Burleson, W., Tessier, R.: Multicore soft error rate stabilization using adaptive dual modular redundancy. In: Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’10, pp. 27–32 (2010)

34.

Vijaykumar, T., Pomeranz, I., Cheng, K.: Transient-fault recovery using simultaneous multithreading. In: 29th Annual International Symposium on Computer Architecture, pp. 87–98 (2002)

35.

von Neumann, J.: Probabilistic logics and the synthesis of reliable organisms from unreliable components. In Automata Studies, pp. 43–98. ACM, New York, NY (1956)

36.

Wang, C., Kim, H., Wu, Y., Ying, V.: Compiler-managed software-based redundant multi-threading for transient fault detection. In: International Symposium on Code Generation and Optimization, pp. 244–258 (2007). doi:10.1109/CGO.2007.7

37.

Zhang, Y., Lee, J.W., Johnson, N.P., August, D.I.: DAFT: Decoupled acyclic fault tolerance. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10, pp. 87–98 (2010)

Titel: RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading
verfasst von: Saurabh Hukerikar
Keita Teranishi
Pedro C. Diniz
Robert F. Lucas
Publikationsdatum: 11.02.2017
Verlag: Springer US
Erschienen in: International Journal of Parallel Programming / Ausgabe 2/2018
Print ISSN: 0885-7458
Elektronische ISSN: 1573-7640
DOI: https://doi.org/10.1007/s10766-017-0492-3

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 2/2018

Accelerating Data Analytics on Integrated GPU Platforms via Runtime Specialization

Software Speculation on Caching DSMs

Data-Driven Thread Execution on Heterogeneous Processors

Fast Automated Processing and Evaluation of Identity Leaks

Hierarchical Pattern Mining with the Automata Processor

Parallel Asynchronous Strategies for the Execution of Feature Selection Algorithms