Skip to main content
Erschienen in: International Journal of Parallel Programming 4/2018

30.09.2017

A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

verfasst von: Jian Gao, Hongmei Wei, Kang Yu, Peng Qing

Erschienen in: International Journal of Parallel Programming | Ausgabe 4/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Fault localization has become an increasingly challenging issue in high-performance computing (HPC) systems. Various techniques have been used for HPC systems. However, as the HPC systems scale out, resulting in the rapid deterioration of the existing techniques. In this context, we propose a message-passing based fault localization framework, namely MPFL, which provides a light-weight distributed service using tree-based fault detection (TFD) and fault analysis (TFA) algorithms. In essence, MPFL serves as a fault localization engine within message-passing libraries by enabling several system middleware such as job scheduler to provide abnormal information. We present details of the MPFL framework, including the implementation of TFD and TFA. Further, we develop the fault localization engine prototype within MVAPICH2. The experimental evaluation is performed on a typical HPC cluster with 10 computing nodes, which demonstrate the capability of MPFL and show that the MPFL service does not affect the performance of an application in practice.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Siewiorek, D., Swarz, R.: Reliable Computer Systems: Design and Evaluatuion. Digital Press, Newton (2017)MATH Siewiorek, D., Swarz, R.: Reliable Computer Systems: Design and Evaluatuion. Digital Press, Newton (2017)MATH
2.
Zurück zum Zitat Liu, G., Mok, A.K., Yang, E.J.: Composite events for network event correlation. Integrated Network Management, 1999. In: Proceedings of the Sixth IFIP/IEEE International Symposium on Distributed Management for the Networked Millennium, pp. 247–260. IEEE (1999) Liu, G., Mok, A.K., Yang, E.J.: Composite events for network event correlation. Integrated Network Management, 1999. In: Proceedings of the Sixth IFIP/IEEE International Symposium on Distributed Management for the Networked Millennium, pp. 247–260. IEEE (1999)
3.
Zurück zum Zitat Rish, I., Brodie, M., Odintsova, N.: Real-time problem determination in distributed systems using active probing. In: Network Operations and Management Symposium, et al.: NOMS 2004. IEEE/IFIP, vol. 1, pp. 133–146. IEEE (2004) Rish, I., Brodie, M., Odintsova, N.: Real-time problem determination in distributed systems using active probing. In: Network Operations and Management Symposium, et al.: NOMS 2004. IEEE/IFIP, vol. 1, pp. 133–146. IEEE (2004)
4.
Zurück zum Zitat łgorznder, M., Sethi, A.S.: A survey of fault localization techniques in computer networks. Sci. Comput. Program. 53(2), 165–194 (2004)MathSciNetCrossRefMATH łgorznder, M., Sethi, A.S.: A survey of fault localization techniques in computer networks. Sci. Comput. Program. 53(2), 165–194 (2004)MathSciNetCrossRefMATH
5.
Zurück zum Zitat Ficco, M.: Security event correlation approach for cloud computing. Int. J. High Perform. Comput. Netw. 1 7(3), 173–185 (2013)CrossRef Ficco, M.: Security event correlation approach for cloud computing. Int. J. High Perform. Comput. Netw. 1 7(3), 173–185 (2013)CrossRef
6.
Zurück zum Zitat Natu, M., Sethi, A.S.: Active probing approach for fault localization in computer networks. In: 4th IEEE/IFIP Workshop on End-to-End Monitoring Techniques and Services, vol. 2006, pp. 25–33. IEEE (2006) Natu, M., Sethi, A.S.: Active probing approach for fault localization in computer networks. In: 4th IEEE/IFIP Workshop on End-to-End Monitoring Techniques and Services, vol. 2006, pp. 25–33. IEEE (2006)
7.
Zurück zum Zitat Patil, B.M., Pathak, V.K.: Survey of probe set and probe station selection algorithms for fault detection and localization in computer networks. Trans. Netw. Commun. 3(4), 57 (2015) Patil, B.M., Pathak, V.K.: Survey of probe set and probe station selection algorithms for fault detection and localization in computer networks. Trans. Netw. Commun. 3(4), 57 (2015)
8.
Zurück zum Zitat Panda, D.K.: InfiniBand Architecture[C]//Proceedings of the Ninth Symposium on High Performance Interconnects, p. 159. IEEE Computer Society (2001) Panda, D.K.: InfiniBand Architecture[C]//Proceedings of the Ninth Symposium on High Performance Interconnects, p. 159. IEEE Computer Society (2001)
9.
Zurück zum Zitat Peng, J., Lu, J., Law, K.H., et al.: ParCYCLIC: finite element modelling of earthquake liquefaction response on parallel computers. Int. J. Numer. Anal. Methods Geomech. 28(12), 1207–1232 (2004)CrossRefMATH Peng, J., Lu, J., Law, K.H., et al.: ParCYCLIC: finite element modelling of earthquake liquefaction response on parallel computers. Int. J. Numer. Anal. Methods Geomech. 28(12), 1207–1232 (2004)CrossRefMATH
10.
Zurück zum Zitat Schroeder, B., Gibson, G.: A large-scale study of failures in high-performance computing systems. IEEE Trans. Dependable Secur. Comput. 7(4), 337–350 (2010)CrossRef Schroeder, B., Gibson, G.: A large-scale study of failures in high-performance computing systems. IEEE Trans. Dependable Secur. Comput. 7(4), 337–350 (2010)CrossRef
11.
Zurück zum Zitat Jakobson, G., Weissman, M.: Real-time Telecommunication Network Management: Extending Event Correlation with Temporal Constraints[M]//Integrated Network Manage- ment IV, pp. 290–301. Springer, Berlin (1995) Jakobson, G., Weissman, M.: Real-time Telecommunication Network Management: Extending Event Correlation with Temporal Constraints[M]//Integrated Network Manage- ment IV, pp. 290–301. Springer, Berlin (1995)
12.
Zurück zum Zitat Schroeder, B., Gibson, G.A.: Understanding failures in petascale computers. J. Phys. Conf. Ser. IOP Publ. 78(1), 012022 (2007)CrossRef Schroeder, B., Gibson, G.A.: Understanding failures in petascale computers. J. Phys. Conf. Ser. IOP Publ. 78(1), 012022 (2007)CrossRef
13.
Zurück zum Zitat Wu, L., Meng, D., Liang, Y., et al.: LUNF-A cluster job scheduling strategy using characterization of nodes’ failure. Jisuanji Yanjiu yu Fazhan (Comput. Res. Dev.) 42(6), 1000–1005 (2005) Wu, L., Meng, D., Liang, Y., et al.: LUNF-A cluster job scheduling strategy using characterization of nodes’ failure. Jisuanji Yanjiu yu Fazhan (Comput. Res. Dev.) 42(6), 1000–1005 (2005)
14.
Zurück zum Zitat Bailey, D.H., Barszcz, E., Barton, J.T., et al.: The NAS parallel benchmarks. Int. J. Supercomput. Appl. 5(3), 63–73 (1991)CrossRef Bailey, D.H., Barszcz, E., Barton, J.T., et al.: The NAS parallel benchmarks. Int. J. Supercomput. Appl. 5(3), 63–73 (1991)CrossRef
15.
Zurück zum Zitat Anderson, J.D., Wendt, J.: Computational Fluid Dynamics[M]. McGraw-Hill, New York (1995) Anderson, J.D., Wendt, J.: Computational Fluid Dynamics[M]. McGraw-Hill, New York (1995)
Metadaten
Titel
A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems
verfasst von
Jian Gao
Hongmei Wei
Kang Yu
Peng Qing
Publikationsdatum
30.09.2017
Verlag
Springer US
Erschienen in
International Journal of Parallel Programming / Ausgabe 4/2018
Print ISSN: 0885-7458
Elektronische ISSN: 1573-7640
DOI
https://doi.org/10.1007/s10766-017-0526-x

Weitere Artikel der Ausgabe 4/2018

International Journal of Parallel Programming 4/2018 Zur Ausgabe

Premium Partner