nach oben

International Journal of Parallel Programming

Erschienen in:

30.09.2017

A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

verfasst von: Jian Gao, Hongmei Wei, Kang Yu, Peng Qing

Erschienen in: International Journal of Parallel Programming | Ausgabe 4/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Fault localization has become an increasingly challenging issue in high-performance computing (HPC) systems. Various techniques have been used for HPC systems. However, as the HPC systems scale out, resulting in the rapid deterioration of the existing techniques. In this context, we propose a message-passing based fault localization framework, namely MPFL, which provides a light-weight distributed service using tree-based fault detection (TFD) and fault analysis (TFA) algorithms. In essence, MPFL serves as a fault localization engine within message-passing libraries by enabling several system middleware such as job scheduler to provide abnormal information. We present details of the MPFL framework, including the implementation of TFD and TFA. Further, we develop the fault localization engine prototype within MVAPICH2. The experimental evaluation is performed on a typical HPC cluster with 10 computing nodes, which demonstrate the capability of MPFL and show that the MPFL service does not affect the performance of an application in practice.

Vorheriger Artikel Partial-PreSET: Enhancing Lifetime of PCM-Based Main Memory with Fine-Grained SET Operations

Nächster Artikel Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Siewiorek, D., Swarz, R.: Reliable Computer Systems: Design and Evaluatuion. Digital Press, Newton (2017)MATH

Liu, G., Mok, A.K., Yang, E.J.: Composite events for network event correlation. Integrated Network Management, 1999. In: Proceedings of the Sixth IFIP/IEEE International Symposium on Distributed Management for the Networked Millennium, pp. 247–260. IEEE (1999)

Rish, I., Brodie, M., Odintsova, N.: Real-time problem determination in distributed systems using active probing. In: Network Operations and Management Symposium, et al.: NOMS 2004. IEEE/IFIP, vol. 1, pp. 133–146. IEEE (2004)

łgorznder, M., Sethi, A.S.: A survey of fault localization techniques in computer networks. Sci. Comput. Program. 53(2), 165–194 (2004)MathSciNetCrossRefMATH

Ficco, M.: Security event correlation approach for cloud computing. Int. J. High Perform. Comput. Netw. 1 7(3), 173–185 (2013)CrossRef

Natu, M., Sethi, A.S.: Active probing approach for fault localization in computer networks. In: 4th IEEE/IFIP Workshop on End-to-End Monitoring Techniques and Services, vol. 2006, pp. 25–33. IEEE (2006)

Patil, B.M., Pathak, V.K.: Survey of probe set and probe station selection algorithms for fault detection and localization in computer networks. Trans. Netw. Commun. 3(4), 57 (2015)

Panda, D.K.: InfiniBand Architecture[C]//Proceedings of the Ninth Symposium on High Performance Interconnects, p. 159. IEEE Computer Society (2001)

Peng, J., Lu, J., Law, K.H., et al.: ParCYCLIC: finite element modelling of earthquake liquefaction response on parallel computers. Int. J. Numer. Anal. Methods Geomech. 28(12), 1207–1232 (2004)CrossRefMATH

10.

Schroeder, B., Gibson, G.: A large-scale study of failures in high-performance computing systems. IEEE Trans. Dependable Secur. Comput. 7(4), 337–350 (2010)CrossRef

11.

Jakobson, G., Weissman, M.: Real-time Telecommunication Network Management: Extending Event Correlation with Temporal Constraints[M]//Integrated Network Manage- ment IV, pp. 290–301. Springer, Berlin (1995)

12.

Schroeder, B., Gibson, G.A.: Understanding failures in petascale computers. J. Phys. Conf. Ser. IOP Publ. 78(1), 012022 (2007)CrossRef

13.

Wu, L., Meng, D., Liang, Y., et al.: LUNF-A cluster job scheduling strategy using characterization of nodes’ failure. Jisuanji Yanjiu yu Fazhan (Comput. Res. Dev.) 42(6), 1000–1005 (2005)

14.

Bailey, D.H., Barszcz, E., Barton, J.T., et al.: The NAS parallel benchmarks. Int. J. Supercomput. Appl. 5(3), 63–73 (1991)CrossRef

15.

Anderson, J.D., Wendt, J.: Computational Fluid Dynamics[M]. McGraw-Hill, New York (1995)

Titel: A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems
verfasst von: Jian Gao
Hongmei Wei
Kang Yu
Peng Qing
Publikationsdatum: 30.09.2017
Verlag: Springer US
Erschienen in: International Journal of Parallel Programming / Ausgabe 4/2018
Print ISSN: 0885-7458
Elektronische ISSN: 1573-7640
DOI: https://doi.org/10.1007/s10766-017-0526-x

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 4/2018

UniCNN: A Pipelined Accelerator Towards Uniformed Computing for CNNs

Partial-PreSET: Enhancing Lifetime of PCM-Based Main Memory with Fine-Grained SET Operations

Enabling Realistic Logical Device Interface and Driver for NVM Express Enabled Full System Simulations

Editor’s Note: Special Issue on Network and Parallel Computing for New Architectures and Applications

Accelerating Deep Learning with a Parallel Mechanism Using CPU + MIC

RollSec: Automatically Secure Software States Against General Rollback

Premium Partner