Top

The Journal of Supercomputing

Published in:

24-06-2023

Analyzing and predicting job failures from HPC system log

Authors: Ju-Won Park, Xin Huang, Chul-Ho Lee

Published in: The Journal of Supercomputing | Issue 1/2024

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

In this paper, we analyze the scheduler log of a production supercomputer that contains complete job information, which is in contrast to many existing (publicly available) HPC logs that only have largely limited job information. We not only provide an in-depth statistical analysis of failed jobs from the scheduler log, but also demonstrate how the scheduler log, which is available in a detailed form, can be leveraged to predict job failures. For the latter, we first conduct a feature analysis based on the framework of ‘weight of evidence’ and ‘information value’ to uncover the impact of each workload attribute (feature) on the failure or success of a job, thereby enabling us to identify key features. We then conduct a comparative performance study of six data-driven machine learning models for predicting job failures in a HPC system based on the scheduler log. Our experiment results show that tree-based models exhibit superior performance in terms of both prediction accuracy and computational cost. We also demonstrate that our feature analysis improves the computational efficiency of each machine learning model without losing its prediction performance.

previous article Polynomial linear discriminant analysis

next article Deep Gaussian convolutional neural network model in classification of cassava diseases using spectral data

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Available only for authorised users

The Tachyon is the fourth supercomputer at the National Supercomputing Center in the Korean Institute of Science and Technology Information, which has provided computing resources to support the large-scale national research works until 2017. While the fifth supercomputer, Nurion, is currently in operation, the logs of Nurion are not open to the public yet due to security reasons. Thus, we focus on the log of Tachyon in this work.

Note that a job can be running on multiple nodes simultaneously, so it can be associated with multiple hostname’s. Thus, when computing the IV value of hostname, we use the hostname of the first node associated with each job.

In general, the values of IV have the following implications [4]: \(\text {IV} < 0.03\) (poor predictor), \(0.03< \text {IV} < 0.1\) (weak predictor), \(0.1< \text {IV} < 0.3\) (average predictor), \(0.3< \text {IV} < 0.5\) (strong predictor), and \(0.5 < \text {IV}\) (very strong predictor).

Abdou HA, Pointon J (2011) Credit scoring, statistical techniques and evaluation criteria: a review of the literature. Intell Syst Accounting Financ Manag 18(2–3):59–88CrossRef

Abeyratne N, Chen HM, Oh B, et al (2016) Checkpointing exascale memory systems with existing memory technologies. In: International Symposium on Memory Systems (MEMSYS’16), ACM, pp 18–29

Alharthi KA, Jhumka A, Di S, et al (2022) Clairvoyant: a log-based transformer-decoder for failure prediction in large-scale systems. In: Proceedings of the 36th ACM International Conference on Supercomputing, pp 1–14

Bailey M (2001) Credit scoring: the principles and practicalities. White Box Publishing, Bristol

Benoit A, Le Fèvre V, Raghavan P, et al (2020) Design and comparison of resilient scheduling heuristics for parallel jobs. In: 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), IEEE, pp 567–576

Bishop CM (1995) Neural networks for pattern recognition. Oxford University Press, OxfordCrossRef

Bishop CM, Nasrabadi NM (2006) Pattern recognition and machine learning. Springer, Berlin

Borges G, David M, Gomes J, et al (2007) Sun Grid Engine, a new scheduler for EGEE middleware. In: IBERGRID–Iberian Grid Infrastructure Conference

Breiman L (2001) Random forests. Mach Learn 45(1):5–32CrossRef

10.

Burkov A (2019) The hundred-page machine learning book. Andriy Burkov, Quebec City

11.

Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 785–794

12.

Cirne W, Berman F (2001) A comprehensive model of the supercomputer workload. In: IEEE International Workshop on Workload Characterization, pp 140–148

13.

Das A, Mueller F, Rountree B (2020) Aarohi: making real-time node failure prediction feasible. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), IEEE, pp 1092–1101

14.

Di S, Gupta R, Snir M, et al (2017) Logaider: a tool for mining potential correlations of HPC log events. In: IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’17), IEEE, pp 442–451

15.

Dongarra J, Herault T, Robert Y (2015) Fault tolerance techniques for high-performance computing. Springer, Cham, pp 3–85CrossRef

16.

Egwutuoha IP, Levy D, Selic B et al (2013) A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J Supercomput 65(3):1302–1326CrossRef

17.

Feitelson D (2022) Parallel workloads archive and standard workload format. http://www.cs.huji.ac.il/labs/parallel/workload, Accessed Nov. 25, 2022

18.

Feitelson DG, Tsafrir D, Krakov D (2014) Experience with using the parallel workloads archive. J Parallel Distrib Comput 74(10):2967–2982CrossRef

19.

Foss S, Korshunov D, Zachary S (2013) An introduction to heavy-tailed and subexponential distributions. Springer series in operations research and financial engineering, 2nd edn. Springer, New York

20.

Gainaru A, Cappello F, Snir M, et al (2012) Fault prediction under the microscope: a closer look into HPC systems. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC’12), IEEE, pp 1–11

21.

Gotoda S, Ito M, Shibata N (2012) Task scheduling algorithm for multicore processor system for minimizing recovery time in case of single node fault. In: IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’12), IEEE, pp 260–267

22.

Gupta S, Tiwari D, Jantzi C, et al (2015) Understanding and exploiting spatial properties of system failures on extreme-scale HPC systems. In: IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’15), IEEE, pp 37–44

23.

He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284CrossRef

24.

Heien E, LaPine D, Kondo D, et al (2011) Modeling and tolerating heterogeneous failures in large parallel systems. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11)

25.

Hothorn T, Zeileis A (2015) partykit: a modular toolkit for recursive partytioning in R. J Mach Learn Res 16(1):3905–3909MathSciNet

26.

Huang S, Liu Y, Fung C et al (2020) Hitanomaly: hierarchical transformers for anomaly detection in system log. IEEE Trans Netw Serv Manage 17(4):2064–2076CrossRef

27.

Jin H, Ke T, Chen Y, et al (2012) Checkpointing orchestration: toward a scalable HPC fault-tolerant environment. In: IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’12), IEEE, pp 276–283

28.

Lai CD, Xie M, Barlow RE (2006) Stochastic ageing and dependence for reliability. Springer-Verlag, New York

29.

León B, Franco D, Rexachs D et al (2021) Analysis of parallel application checkpoint storage for system configuration. J Supercomput 77(5):4582–4617CrossRef

30.

León B, Méndez S, Franco D et al (2022) A model of checkpoint behavior for applications that have i/o. J Supercomput 78(13):15404–15436CrossRef

31.

Li H, Groep D, Wolters L (2004) Workload characteristics of a multi-cluster supercomputer. In: Workshop on Job Scheduling Strategies for Parallel Processing, pp 176–193

32.

Li H, Groep D, Wolters L, et al (2006) Job failure analysis and its implications in a large-scale production grid. In: IEEE International Conference on E-Science and Grid Computing (E-Science’06), IEEE, pp 27–27

33.

Loh WY (2011) Classification and regression trees. Wiley Interdiscip Rev Data Min Knowl Discov 1(1):14–23CrossRef

34.

Meng W, Liu Y, Zhang S et al (2021) Logclass: anomalous log identification and classification with partial labels. IEEE Trans Netw Serv Manage 18(2):1870–1884CrossRef

35.

Min JH, Lee YC (2008) A practical approach to credit scoring. Expert Syst Appl 35(4):1762–1770CrossRef

36.

Naksinehaboon N, Liu Y, Leangsuksun C, et al (2008) Reliability-aware approach: An incremental checkpoint/restart model in HPC environments. In: IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid’08), IEEE, pp 783–788

37.

Nanni L, Lumini A (2009) An experimental comparison of ensemble of classifiers for bankruptcy prediction and credit scoring. Expert Syst Appl 36(2):3028–3033CrossRef

38.

Nguyen AT, Reiter S, Rigo P (2014) A review on simulation-based optimization methods applied to building performance analysis. Appl Energy 113:1043–1058CrossRef

39.

Oliner A, Stearley J (2007) What supercomputers say: a study of five system logs. In: IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’07), pp 575–584

40.

Parasyris K, Keller K, Bautista-Gomez L, et al (2020) Checkpoint restart support for heterogeneous hpc applications. In: 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), pp 242–251

41.

Park JW (2019) Queue waiting time prediction for large-scale high-performance computing system. In: 2019 International Conference on High Performance Computing & Simulation (HPCS), IEEE, pp 850–855

42.

Park JW, Kim E (2017) Runtime prediction of parallel applications with workload-aware clustering. J Supercomput 73(11):4635–4651CrossRef

43.

Park JW, Kim E (2018) Exploiting the behavior of the failed job in high performance computing system. In: 2018 18th International Conference on Computational Science and Applications (ICCSA), IEEE, pp 1–3

44.

Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830MathSciNet

45.

Rodrigo Álvarez GP, Östberg PO, Elmroth E, et al (2015) HPC system lifetime story: Workload characterization and evolutionary analyses on NERSC systems. In: ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC’15), pp 57–60

46.

Roux NL, Schmidt M, Bach F (2012) A stochastic gradient method with an exponential convergence rate for finite training sets. In: NIPS’12 - 26th Annual Conference on Neural Information Processing Systems

47.

Schneider D (2022) The exascale era is upon us: the frontier supercomputer may be the first to reach 1,000,000,000,000,000,000 operations per second. IEEE Spectr 59(1):34–35CrossRef

48.

Schroeder B, Gibson G (2010) A large-scale study of failures in high-performance computing systems. IEEE Trans Depend Secur Comput 7(4):337–350CrossRef

49.

Tiwari D, Gupta S, Vazhkudai SS (2014) Lazy checkpointing: Exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems. In: Proceedings of IEEE/IFIP DSN, pp 25–36

50.

Wu M, Sun XH, Jin H (2007) Performance under failures of high-end computing. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC’07), ACM, p 48

51.

Yoon J, Hong T, Park C et al (2015) Stable HPC cluster management scheme through performance evaluation. In: Park JJJH, Stojmenovic I, Jeong HY et al (eds) Computer science and its applications. Springer, Berlin, pp 1017–1023CrossRef

52.

You H, Zhang H (2012) Comprehensive workload analysis and modeling of a petascale supercomputer. In: Workshop on Job Scheduling Strategies for Parallel Processing, pp 253–271

53.

Yuan Y, Wu Y, Wang Q et al (2012) Job failures in high performance computing systems: a large-scale empirical study. Comput Math Appl 63(2):365–377CrossRef

54.

Zheng Z, Yu L, Tang W, et al (2011) Co-analysis of RAS log and job log on Blue Gene/P. In: IEEE International Parallel & Distributed Processing Symposium (IPDPS’11), pp 840–851

Title: Analyzing and predicting job failures from HPC system log
Authors: Ju-Won Park
Xin Huang
Chul-Ho Lee
Publication date: 24-06-2023
Publisher: Springer US
Published in: The Journal of Supercomputing / Issue 1/2024
Print ISSN: 0920-8542
Electronic ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-023-05482-y

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Other articles of this Issue 1/2024

Stability and agent dynamics of artificial electric field algorithm

A generalized approach to construct node probability table for Bayesian belief network using fuzzy logic

A novel fuzzy control path planning algorithm for intelligent ship based on scale factors

Correction to: Multi‑head attention‑based model for reconstructing continuous missing time series data

JCF: joint coarse- and fine-grained similarity comparison for plagiarism detection based on NLP

Path planning for intelligent vehicles based on improved D* Lite

Premium Partner