Skip to main content
Top
Published in: The Journal of Supercomputing 1/2024

24-06-2023

Analyzing and predicting job failures from HPC system log

Authors: Ju-Won Park, Xin Huang, Chul-Ho Lee

Published in: The Journal of Supercomputing | Issue 1/2024

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In this paper, we analyze the scheduler log of a production supercomputer that contains complete job information, which is in contrast to many existing (publicly available) HPC logs that only have largely limited job information. We not only provide an in-depth statistical analysis of failed jobs from the scheduler log, but also demonstrate how the scheduler log, which is available in a detailed form, can be leveraged to predict job failures. For the latter, we first conduct a feature analysis based on the framework of ‘weight of evidence’ and ‘information value’ to uncover the impact of each workload attribute (feature) on the failure or success of a job, thereby enabling us to identify key features. We then conduct a comparative performance study of six data-driven machine learning models for predicting job failures in a HPC system based on the scheduler log. Our experiment results show that tree-based models exhibit superior performance in terms of both prediction accuracy and computational cost. We also demonstrate that our feature analysis improves the computational efficiency of each machine learning model without losing its prediction performance.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Footnotes
1
The Tachyon is the fourth supercomputer at the National Supercomputing Center in the Korean Institute of Science and Technology Information, which has provided computing resources to support the large-scale national research works until 2017. While the fifth supercomputer, Nurion, is currently in operation, the logs of Nurion are not open to the public yet due to security reasons. Thus, we focus on the log of Tachyon in this work.
 
2
Note that a job can be running on multiple nodes simultaneously, so it can be associated with multiple hostname’s. Thus, when computing the IV value of hostname, we use the hostname of the first node associated with each job.
 
3
In general, the values of IV have the following implications [4]: \(\text {IV} < 0.03\) (poor predictor), \(0.03< \text {IV} < 0.1\) (weak predictor), \(0.1< \text {IV} < 0.3\) (average predictor), \(0.3< \text {IV} < 0.5\) (strong predictor), and \(0.5 < \text {IV}\) (very strong predictor).
 
Literature
1.
go back to reference Abdou HA, Pointon J (2011) Credit scoring, statistical techniques and evaluation criteria: a review of the literature. Intell Syst Accounting Financ Manag 18(2–3):59–88CrossRef Abdou HA, Pointon J (2011) Credit scoring, statistical techniques and evaluation criteria: a review of the literature. Intell Syst Accounting Financ Manag 18(2–3):59–88CrossRef
2.
go back to reference Abeyratne N, Chen HM, Oh B, et al (2016) Checkpointing exascale memory systems with existing memory technologies. In: International Symposium on Memory Systems (MEMSYS’16), ACM, pp 18–29 Abeyratne N, Chen HM, Oh B, et al (2016) Checkpointing exascale memory systems with existing memory technologies. In: International Symposium on Memory Systems (MEMSYS’16), ACM, pp 18–29
3.
go back to reference Alharthi KA, Jhumka A, Di S, et al (2022) Clairvoyant: a log-based transformer-decoder for failure prediction in large-scale systems. In: Proceedings of the 36th ACM International Conference on Supercomputing, pp 1–14 Alharthi KA, Jhumka A, Di S, et al (2022) Clairvoyant: a log-based transformer-decoder for failure prediction in large-scale systems. In: Proceedings of the 36th ACM International Conference on Supercomputing, pp 1–14
4.
go back to reference Bailey M (2001) Credit scoring: the principles and practicalities. White Box Publishing, Bristol Bailey M (2001) Credit scoring: the principles and practicalities. White Box Publishing, Bristol
5.
go back to reference Benoit A, Le Fèvre V, Raghavan P, et al (2020) Design and comparison of resilient scheduling heuristics for parallel jobs. In: 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), IEEE, pp 567–576 Benoit A, Le Fèvre V, Raghavan P, et al (2020) Design and comparison of resilient scheduling heuristics for parallel jobs. In: 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), IEEE, pp 567–576
6.
go back to reference Bishop CM (1995) Neural networks for pattern recognition. Oxford University Press, OxfordCrossRef Bishop CM (1995) Neural networks for pattern recognition. Oxford University Press, OxfordCrossRef
7.
go back to reference Bishop CM, Nasrabadi NM (2006) Pattern recognition and machine learning. Springer, Berlin Bishop CM, Nasrabadi NM (2006) Pattern recognition and machine learning. Springer, Berlin
8.
go back to reference Borges G, David M, Gomes J, et al (2007) Sun Grid Engine, a new scheduler for EGEE middleware. In: IBERGRID–Iberian Grid Infrastructure Conference Borges G, David M, Gomes J, et al (2007) Sun Grid Engine, a new scheduler for EGEE middleware. In: IBERGRID–Iberian Grid Infrastructure Conference
10.
go back to reference Burkov A (2019) The hundred-page machine learning book. Andriy Burkov, Quebec City Burkov A (2019) The hundred-page machine learning book. Andriy Burkov, Quebec City
11.
go back to reference Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 785–794 Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 785–794
12.
go back to reference Cirne W, Berman F (2001) A comprehensive model of the supercomputer workload. In: IEEE International Workshop on Workload Characterization, pp 140–148 Cirne W, Berman F (2001) A comprehensive model of the supercomputer workload. In: IEEE International Workshop on Workload Characterization, pp 140–148
13.
go back to reference Das A, Mueller F, Rountree B (2020) Aarohi: making real-time node failure prediction feasible. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), IEEE, pp 1092–1101 Das A, Mueller F, Rountree B (2020) Aarohi: making real-time node failure prediction feasible. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), IEEE, pp 1092–1101
14.
go back to reference Di S, Gupta R, Snir M, et al (2017) Logaider: a tool for mining potential correlations of HPC log events. In: IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’17), IEEE, pp 442–451 Di S, Gupta R, Snir M, et al (2017) Logaider: a tool for mining potential correlations of HPC log events. In: IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’17), IEEE, pp 442–451
15.
go back to reference Dongarra J, Herault T, Robert Y (2015) Fault tolerance techniques for high-performance computing. Springer, Cham, pp 3–85CrossRef Dongarra J, Herault T, Robert Y (2015) Fault tolerance techniques for high-performance computing. Springer, Cham, pp 3–85CrossRef
16.
go back to reference Egwutuoha IP, Levy D, Selic B et al (2013) A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J Supercomput 65(3):1302–1326CrossRef Egwutuoha IP, Levy D, Selic B et al (2013) A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J Supercomput 65(3):1302–1326CrossRef
18.
go back to reference Feitelson DG, Tsafrir D, Krakov D (2014) Experience with using the parallel workloads archive. J Parallel Distrib Comput 74(10):2967–2982CrossRef Feitelson DG, Tsafrir D, Krakov D (2014) Experience with using the parallel workloads archive. J Parallel Distrib Comput 74(10):2967–2982CrossRef
19.
go back to reference Foss S, Korshunov D, Zachary S (2013) An introduction to heavy-tailed and subexponential distributions. Springer series in operations research and financial engineering, 2nd edn. Springer, New York Foss S, Korshunov D, Zachary S (2013) An introduction to heavy-tailed and subexponential distributions. Springer series in operations research and financial engineering, 2nd edn. Springer, New York
20.
go back to reference Gainaru A, Cappello F, Snir M, et al (2012) Fault prediction under the microscope: a closer look into HPC systems. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC’12), IEEE, pp 1–11 Gainaru A, Cappello F, Snir M, et al (2012) Fault prediction under the microscope: a closer look into HPC systems. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC’12), IEEE, pp 1–11
21.
go back to reference Gotoda S, Ito M, Shibata N (2012) Task scheduling algorithm for multicore processor system for minimizing recovery time in case of single node fault. In: IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’12), IEEE, pp 260–267 Gotoda S, Ito M, Shibata N (2012) Task scheduling algorithm for multicore processor system for minimizing recovery time in case of single node fault. In: IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’12), IEEE, pp 260–267
22.
go back to reference Gupta S, Tiwari D, Jantzi C, et al (2015) Understanding and exploiting spatial properties of system failures on extreme-scale HPC systems. In: IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’15), IEEE, pp 37–44 Gupta S, Tiwari D, Jantzi C, et al (2015) Understanding and exploiting spatial properties of system failures on extreme-scale HPC systems. In: IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’15), IEEE, pp 37–44
23.
go back to reference He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284CrossRef He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284CrossRef
24.
go back to reference Heien E, LaPine D, Kondo D, et al (2011) Modeling and tolerating heterogeneous failures in large parallel systems. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11) Heien E, LaPine D, Kondo D, et al (2011) Modeling and tolerating heterogeneous failures in large parallel systems. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11)
25.
go back to reference Hothorn T, Zeileis A (2015) partykit: a modular toolkit for recursive partytioning in R. J Mach Learn Res 16(1):3905–3909MathSciNet Hothorn T, Zeileis A (2015) partykit: a modular toolkit for recursive partytioning in R. J Mach Learn Res 16(1):3905–3909MathSciNet
26.
go back to reference Huang S, Liu Y, Fung C et al (2020) Hitanomaly: hierarchical transformers for anomaly detection in system log. IEEE Trans Netw Serv Manage 17(4):2064–2076CrossRef Huang S, Liu Y, Fung C et al (2020) Hitanomaly: hierarchical transformers for anomaly detection in system log. IEEE Trans Netw Serv Manage 17(4):2064–2076CrossRef
27.
go back to reference Jin H, Ke T, Chen Y, et al (2012) Checkpointing orchestration: toward a scalable HPC fault-tolerant environment. In: IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’12), IEEE, pp 276–283 Jin H, Ke T, Chen Y, et al (2012) Checkpointing orchestration: toward a scalable HPC fault-tolerant environment. In: IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’12), IEEE, pp 276–283
28.
go back to reference Lai CD, Xie M, Barlow RE (2006) Stochastic ageing and dependence for reliability. Springer-Verlag, New York Lai CD, Xie M, Barlow RE (2006) Stochastic ageing and dependence for reliability. Springer-Verlag, New York
29.
go back to reference León B, Franco D, Rexachs D et al (2021) Analysis of parallel application checkpoint storage for system configuration. J Supercomput 77(5):4582–4617CrossRef León B, Franco D, Rexachs D et al (2021) Analysis of parallel application checkpoint storage for system configuration. J Supercomput 77(5):4582–4617CrossRef
30.
go back to reference León B, Méndez S, Franco D et al (2022) A model of checkpoint behavior for applications that have i/o. J Supercomput 78(13):15404–15436CrossRef León B, Méndez S, Franco D et al (2022) A model of checkpoint behavior for applications that have i/o. J Supercomput 78(13):15404–15436CrossRef
31.
go back to reference Li H, Groep D, Wolters L (2004) Workload characteristics of a multi-cluster supercomputer. In: Workshop on Job Scheduling Strategies for Parallel Processing, pp 176–193 Li H, Groep D, Wolters L (2004) Workload characteristics of a multi-cluster supercomputer. In: Workshop on Job Scheduling Strategies for Parallel Processing, pp 176–193
32.
go back to reference Li H, Groep D, Wolters L, et al (2006) Job failure analysis and its implications in a large-scale production grid. In: IEEE International Conference on E-Science and Grid Computing (E-Science’06), IEEE, pp 27–27 Li H, Groep D, Wolters L, et al (2006) Job failure analysis and its implications in a large-scale production grid. In: IEEE International Conference on E-Science and Grid Computing (E-Science’06), IEEE, pp 27–27
33.
go back to reference Loh WY (2011) Classification and regression trees. Wiley Interdiscip Rev Data Min Knowl Discov 1(1):14–23CrossRef Loh WY (2011) Classification and regression trees. Wiley Interdiscip Rev Data Min Knowl Discov 1(1):14–23CrossRef
34.
go back to reference Meng W, Liu Y, Zhang S et al (2021) Logclass: anomalous log identification and classification with partial labels. IEEE Trans Netw Serv Manage 18(2):1870–1884CrossRef Meng W, Liu Y, Zhang S et al (2021) Logclass: anomalous log identification and classification with partial labels. IEEE Trans Netw Serv Manage 18(2):1870–1884CrossRef
35.
go back to reference Min JH, Lee YC (2008) A practical approach to credit scoring. Expert Syst Appl 35(4):1762–1770CrossRef Min JH, Lee YC (2008) A practical approach to credit scoring. Expert Syst Appl 35(4):1762–1770CrossRef
36.
go back to reference Naksinehaboon N, Liu Y, Leangsuksun C, et al (2008) Reliability-aware approach: An incremental checkpoint/restart model in HPC environments. In: IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid’08), IEEE, pp 783–788 Naksinehaboon N, Liu Y, Leangsuksun C, et al (2008) Reliability-aware approach: An incremental checkpoint/restart model in HPC environments. In: IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid’08), IEEE, pp 783–788
37.
go back to reference Nanni L, Lumini A (2009) An experimental comparison of ensemble of classifiers for bankruptcy prediction and credit scoring. Expert Syst Appl 36(2):3028–3033CrossRef Nanni L, Lumini A (2009) An experimental comparison of ensemble of classifiers for bankruptcy prediction and credit scoring. Expert Syst Appl 36(2):3028–3033CrossRef
38.
go back to reference Nguyen AT, Reiter S, Rigo P (2014) A review on simulation-based optimization methods applied to building performance analysis. Appl Energy 113:1043–1058CrossRef Nguyen AT, Reiter S, Rigo P (2014) A review on simulation-based optimization methods applied to building performance analysis. Appl Energy 113:1043–1058CrossRef
39.
go back to reference Oliner A, Stearley J (2007) What supercomputers say: a study of five system logs. In: IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’07), pp 575–584 Oliner A, Stearley J (2007) What supercomputers say: a study of five system logs. In: IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’07), pp 575–584
40.
go back to reference Parasyris K, Keller K, Bautista-Gomez L, et al (2020) Checkpoint restart support for heterogeneous hpc applications. In: 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), pp 242–251 Parasyris K, Keller K, Bautista-Gomez L, et al (2020) Checkpoint restart support for heterogeneous hpc applications. In: 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), pp 242–251
41.
go back to reference Park JW (2019) Queue waiting time prediction for large-scale high-performance computing system. In: 2019 International Conference on High Performance Computing & Simulation (HPCS), IEEE, pp 850–855 Park JW (2019) Queue waiting time prediction for large-scale high-performance computing system. In: 2019 International Conference on High Performance Computing & Simulation (HPCS), IEEE, pp 850–855
42.
go back to reference Park JW, Kim E (2017) Runtime prediction of parallel applications with workload-aware clustering. J Supercomput 73(11):4635–4651CrossRef Park JW, Kim E (2017) Runtime prediction of parallel applications with workload-aware clustering. J Supercomput 73(11):4635–4651CrossRef
43.
go back to reference Park JW, Kim E (2018) Exploiting the behavior of the failed job in high performance computing system. In: 2018 18th International Conference on Computational Science and Applications (ICCSA), IEEE, pp 1–3 Park JW, Kim E (2018) Exploiting the behavior of the failed job in high performance computing system. In: 2018 18th International Conference on Computational Science and Applications (ICCSA), IEEE, pp 1–3
44.
go back to reference Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830MathSciNet Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830MathSciNet
45.
go back to reference Rodrigo Álvarez GP, Östberg PO, Elmroth E, et al (2015) HPC system lifetime story: Workload characterization and evolutionary analyses on NERSC systems. In: ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC’15), pp 57–60 Rodrigo Álvarez GP, Östberg PO, Elmroth E, et al (2015) HPC system lifetime story: Workload characterization and evolutionary analyses on NERSC systems. In: ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC’15), pp 57–60
46.
go back to reference Roux NL, Schmidt M, Bach F (2012) A stochastic gradient method with an exponential convergence rate for finite training sets. In: NIPS’12 - 26th Annual Conference on Neural Information Processing Systems Roux NL, Schmidt M, Bach F (2012) A stochastic gradient method with an exponential convergence rate for finite training sets. In: NIPS’12 - 26th Annual Conference on Neural Information Processing Systems
47.
go back to reference Schneider D (2022) The exascale era is upon us: the frontier supercomputer may be the first to reach 1,000,000,000,000,000,000 operations per second. IEEE Spectr 59(1):34–35CrossRef Schneider D (2022) The exascale era is upon us: the frontier supercomputer may be the first to reach 1,000,000,000,000,000,000 operations per second. IEEE Spectr 59(1):34–35CrossRef
48.
go back to reference Schroeder B, Gibson G (2010) A large-scale study of failures in high-performance computing systems. IEEE Trans Depend Secur Comput 7(4):337–350CrossRef Schroeder B, Gibson G (2010) A large-scale study of failures in high-performance computing systems. IEEE Trans Depend Secur Comput 7(4):337–350CrossRef
49.
go back to reference Tiwari D, Gupta S, Vazhkudai SS (2014) Lazy checkpointing: Exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems. In: Proceedings of IEEE/IFIP DSN, pp 25–36 Tiwari D, Gupta S, Vazhkudai SS (2014) Lazy checkpointing: Exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems. In: Proceedings of IEEE/IFIP DSN, pp 25–36
50.
go back to reference Wu M, Sun XH, Jin H (2007) Performance under failures of high-end computing. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC’07), ACM, p 48 Wu M, Sun XH, Jin H (2007) Performance under failures of high-end computing. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC’07), ACM, p 48
51.
go back to reference Yoon J, Hong T, Park C et al (2015) Stable HPC cluster management scheme through performance evaluation. In: Park JJJH, Stojmenovic I, Jeong HY et al (eds) Computer science and its applications. Springer, Berlin, pp 1017–1023CrossRef Yoon J, Hong T, Park C et al (2015) Stable HPC cluster management scheme through performance evaluation. In: Park JJJH, Stojmenovic I, Jeong HY et al (eds) Computer science and its applications. Springer, Berlin, pp 1017–1023CrossRef
52.
go back to reference You H, Zhang H (2012) Comprehensive workload analysis and modeling of a petascale supercomputer. In: Workshop on Job Scheduling Strategies for Parallel Processing, pp 253–271 You H, Zhang H (2012) Comprehensive workload analysis and modeling of a petascale supercomputer. In: Workshop on Job Scheduling Strategies for Parallel Processing, pp 253–271
53.
go back to reference Yuan Y, Wu Y, Wang Q et al (2012) Job failures in high performance computing systems: a large-scale empirical study. Comput Math Appl 63(2):365–377CrossRef Yuan Y, Wu Y, Wang Q et al (2012) Job failures in high performance computing systems: a large-scale empirical study. Comput Math Appl 63(2):365–377CrossRef
54.
go back to reference Zheng Z, Yu L, Tang W, et al (2011) Co-analysis of RAS log and job log on Blue Gene/P. In: IEEE International Parallel & Distributed Processing Symposium (IPDPS’11), pp 840–851 Zheng Z, Yu L, Tang W, et al (2011) Co-analysis of RAS log and job log on Blue Gene/P. In: IEEE International Parallel & Distributed Processing Symposium (IPDPS’11), pp 840–851
Metadata
Title
Analyzing and predicting job failures from HPC system log
Authors
Ju-Won Park
Xin Huang
Chul-Ho Lee
Publication date
24-06-2023
Publisher
Springer US
Published in
The Journal of Supercomputing / Issue 1/2024
Print ISSN: 0920-8542
Electronic ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-023-05482-y

Other articles of this Issue 1/2024

The Journal of Supercomputing 1/2024 Go to the issue

Premium Partner