Skip to main content
Top
Published in: The Journal of Supercomputing 16/2023

15-05-2023

An empirical study of major page faults for failure diagnosis in cluster systems

Authors: Edward Chuah, Arshad Jhumka, Sai Narasimhamurthy

Published in: The Journal of Supercomputing | Issue 16/2023

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

High-performance computing systems conduct extensive logging of resource usage data and system logs, and parsing these data is an often advocated basis for failure diagnosis. Major page faults are known to be one of the most common cause of performance problems in large cluster systems. We conduct an empirical study of major page faults on two large cluster systems. We set up three regression algorithms including the LASSO, Ridge and Elastic Net regression techniques. To the best of our knowledge, there is no work that studied different regression models to diagnose major page faults in a large cluster system. In this paper, we (a) propose an approach for diagnosing major page faults, and (b) evaluate the LASSO, Ridge and Elastic Net regression algorithms on real resource use data and system logs. As part of our contributions, we (a) compare the accuracy of the three regression algorithms, (b) identify the resource use counters which are correlated to major page faults and the system events which are correlated to page fault events, and (c) provide insights into major page faults and page fault events. Our work highlights empirical observations that could facilitate better handling of node failures in cluster systems.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literature
4.
go back to reference ...Snir M, Wisniewski RW, Abraham JA, Adve SV, Bagchi S, Balaji P, Belak J, Bose P, Cappello F, Carlson B, Chien AA, Coteus P, Debardeleben NA, Diniz PC, Engelmann C, Erez M, Fazzari S, Geist A, Gupta R, Johnson F, Krishnamoorthy S, Leyffer S, Liberty D, Mitra S, Munson T, Schreiber R, Stearley J, Hensbergen EV (2014) Addressing failures in exascale computing. Int J High Perform Comput Appl. https://doi.org/10.1177/1094342014522573CrossRef ...Snir M, Wisniewski RW, Abraham JA, Adve SV, Bagchi S, Balaji P, Belak J, Bose P, Cappello F, Carlson B, Chien AA, Coteus P, Debardeleben NA, Diniz PC, Engelmann C, Erez M, Fazzari S, Geist A, Gupta R, Johnson F, Krishnamoorthy S, Leyffer S, Liberty D, Mitra S, Munson T, Schreiber R, Stearley J, Hensbergen EV (2014) Addressing failures in exascale computing. Int J High Perform Comput Appl. https://​doi.​org/​10.​1177/​1094342014522573​CrossRef
5.
go back to reference Martino CD, Baccanico F, Fullop J, Kramer W, Kalbaczyk Z, Iyer R. Lessons learned from the analysis of system failures at petascale: The case of blue waters. In: Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), p. 2014. https://doi.org/10.1109/DSN.2014.62 Martino CD, Baccanico F, Fullop J, Kramer W, Kalbaczyk Z, Iyer R. Lessons learned from the analysis of system failures at petascale: The case of blue waters. In: Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), p. 2014. https://​doi.​org/​10.​1109/​DSN.​2014.​62
6.
go back to reference Mitra S, Javagal S, Maji AK, Gamblin T, Moody A, Harrell S, Bagchi S (2016) A study of failures in community clusters: The case of conte. In: Proceedings of the 2016 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), pp. 189–196. https://doi.org/10.1109/ISSREW.2016.7 Mitra S, Javagal S, Maji AK, Gamblin T, Moody A, Harrell S, Bagchi S (2016) A study of failures in community clusters: The case of conte. In: Proceedings of the 2016 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), pp. 189–196. https://​doi.​org/​10.​1109/​ISSREW.​2016.​7
7.
go back to reference Gupta S, Patel T, Engelmann C, Tiwari D (2017) Failures in large scale systems: Long-term measurement, analysis, and implications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). https://doi.org/10.1145/3126908.3126937 Gupta S, Patel T, Engelmann C, Tiwari D (2017) Failures in large scale systems: Long-term measurement, analysis, and implications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). https://​doi.​org/​10.​1145/​3126908.​3126937
8.
go back to reference Rojas E, Meneses E, Jones T, Maxwell D (2019) Analyzing a five-year failure record of a leadership-class supercomputer. In: Proceedings of the 31st International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 196–203. https://doi.org/10.1109/SBAC-PAD.2019.00040. IEEE Rojas E, Meneses E, Jones T, Maxwell D (2019) Analyzing a five-year failure record of a leadership-class supercomputer. In: Proceedings of the 31st International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 196–203. https://​doi.​org/​10.​1109/​SBAC-PAD.​2019.​00040. IEEE
9.
go back to reference Kumar R, Jha S, Mahgoub A, Kalyanam R, Harrell S, Song XC, Kalbarczyk Z, Kramer W, Iyer R, Bagchi S (2020) The mystery of the failing jobs: Insights from operational data from two university-wide computing systems. In: Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). https://doi.org/10.1109/DSN48063.2020.00034 Kumar R, Jha S, Mahgoub A, Kalyanam R, Harrell S, Song XC, Kalbarczyk Z, Kramer W, Iyer R, Bagchi S (2020) The mystery of the failing jobs: Insights from operational data from two university-wide computing systems. In: Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). https://​doi.​org/​10.​1109/​DSN48063.​2020.​00034
10.
go back to reference Liu Z, Lewis R, Kettimuthu R, Harms K, Carns P, Rao N, Foster I, Papka ME (2020) Characterization and identification of HPC applications at leadership computing facility. In: Proceedings of the 34th ACM International Conference on Supercomputing (ICS). Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3392717.3392774 Liu Z, Lewis R, Kettimuthu R, Harms K, Carns P, Rao N, Foster I, Papka ME (2020) Characterization and identification of HPC applications at leadership computing facility. In: Proceedings of the 34th ACM International Conference on Supercomputing (ICS). Association for Computing Machinery, New York, NY, USA. https://​doi.​org/​10.​1145/​3392717.​3392774
12.
go back to reference Ferreira KB, Levy S, Hemmert J, Pedretti K (2022) Understanding memory failures on a petascale Arm system. In: Proceedings of the 31st ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), pp. 84–96. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3502181.3531465 Ferreira KB, Levy S, Hemmert J, Pedretti K (2022) Understanding memory failures on a petascale Arm system. In: Proceedings of the 31st ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), pp. 84–96. Association for Computing Machinery, New York, NY, USA. https://​doi.​org/​10.​1145/​3502181.​3531465
16.
go back to reference Chuah E, Jhumka A, Narasimharmuthy S, Hammond J, Browne JC, Barth B (2013) Linking resource usage anomalies with system failures from cluster log data. In: Proceedings of IEEE International Symposium on Reliable Distributed Systems (SRDS). https://doi.org/10.1109/SRDS.2013.20 Chuah E, Jhumka A, Narasimharmuthy S, Hammond J, Browne JC, Barth B (2013) Linking resource usage anomalies with system failures from cluster log data. In: Proceedings of IEEE International Symposium on Reliable Distributed Systems (SRDS). https://​doi.​org/​10.​1109/​SRDS.​2013.​20
17.
go back to reference Chuah E, Jhumka A, Browne JC, Gurumdimma N, Narasimharmuthy S, Barth B (2016) Using message logs and resource use data for cluster failure diagnosis. In: Proceedings of IEEE International Conference on High Performance Computing (HiPC). https://doi.org/10.1109/HiPC.2016.035 Chuah E, Jhumka A, Browne JC, Gurumdimma N, Narasimharmuthy S, Barth B (2016) Using message logs and resource use data for cluster failure diagnosis. In: Proceedings of IEEE International Conference on High Performance Computing (HiPC). https://​doi.​org/​10.​1109/​HiPC.​2016.​035
18.
22.
go back to reference Mano MM (1993) Computer system architecture. Prentice Hall International Edition, BostonMATH Mano MM (1993) Computer system architecture. Prentice Hall International Edition, BostonMATH
23.
go back to reference Tan P-N, Steinbach M, Kumar V (2006) Introduction to data mining. Addison-Wesley, Boston Tan P-N, Steinbach M, Kumar V (2006) Introduction to data mining. Addison-Wesley, Boston
25.
go back to reference Palmer JT, Gallo SM, Furlani TR, Jones MD, DeLeon RL, White JP, Simakov N, Patra AK, Sperhac J, Yearke T, Rathsam R, Innus M, Cornelius CD, Browne JC, Barth WL, Evans RT (2015) Open XDMoD: a tool for the comprehensive management of high-performance computing resources. Comput Sci Eng. https://doi.org/10.1109/MCSE.2015.68CrossRef Palmer JT, Gallo SM, Furlani TR, Jones MD, DeLeon RL, White JP, Simakov N, Patra AK, Sperhac J, Yearke T, Rathsam R, Innus M, Cornelius CD, Browne JC, Barth WL, Evans RT (2015) Open XDMoD: a tool for the comprehensive management of high-performance computing resources. Comput Sci Eng. https://​doi.​org/​10.​1109/​MCSE.​2015.​68CrossRef
26.
go back to reference Agresti A, Franklin C (2009) Statistics: the art and science of learning from data. Prentice Hall International, Boston Agresti A, Franklin C (2009) Statistics: the art and science of learning from data. Prentice Hall International, Boston
27.
go back to reference Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Roy Stat Soc: Ser B (Methodol) 58(1):267–288MathSciNetMATH Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Roy Stat Soc: Ser B (Methodol) 58(1):267–288MathSciNetMATH
28.
go back to reference Hoerl AE, Kennard RW (2000) Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 42(1):80–86CrossRefMATH Hoerl AE, Kennard RW (2000) Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 42(1):80–86CrossRefMATH
29.
go back to reference Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J Royal Stat Soc Ser B (Stat Methodol) 67(2):301–320MathSciNetCrossRefMATH Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J Royal Stat Soc Ser B (Stat Methodol) 67(2):301–320MathSciNetCrossRefMATH
30.
go back to reference Walpole RE, Myers RH, Myers SL (1998) Probab Stat Eng Sci. Prentice Hall International, Boston Walpole RE, Myers RH, Myers SL (1998) Probab Stat Eng Sci. Prentice Hall International, Boston
Metadata
Title
An empirical study of major page faults for failure diagnosis in cluster systems
Authors
Edward Chuah
Arshad Jhumka
Sai Narasimhamurthy
Publication date
15-05-2023
Publisher
Springer US
Published in
The Journal of Supercomputing / Issue 16/2023
Print ISSN: 0920-8542
Electronic ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-023-05366-1

Other articles of this Issue 16/2023

The Journal of Supercomputing 16/2023 Go to the issue

Premium Partner