nach oben

Cluster Computing

Erschienen in:

21.03.2019

Failure prediction using machine learning in a virtualised HPC system and application

verfasst von: Bashir Mohammed, Irfan Awan, Hassan Ugail, Muhammad Younas

Erschienen in: Cluster Computing | Ausgabe 2/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Failure is an increasingly important issue in high performance computing and cloud systems. As large-scale systems continue to grow in scale and complexity, mitigating the impact of failure and providing accurate predictions with sufficient lead time remains a challenging research problem. Traditional existing fault-tolerance strategies such as regular check-pointing and replication are not adequate because of the emerging complexities of high performance computing systems. This necessitates the importance of having an effective as well as proactive failure management approach in place aimed at minimizing the effect of failure within the system. With the advent of machine learning techniques, the ability to learn from past information to predict future pattern of behaviours makes it possible to predict potential system failure more accurately. Thus, in this paper, we explore the predictive abilities of machine learning by applying a number of algorithms to improve the accuracy of failure prediction. We have developed a failure prediction model using time series and machine learning, and performed comparison based tests on the prediction accuracy. The primary algorithms we considered are the support vector machine (SVM), random forest (RF), k-nearest neighbors (KNN), classification and regression trees (CART) and linear discriminant analysis (LDA). Experimental results indicates that the average prediction accuracy of our model using SVM when predicting failure is 90% accurate and effective compared to other algorithms. This finding implies that our method can effectively predict all possible future system and application failures within the system.

Vorheriger Artikel Crowdcloud: a crowdsourced system for cloud infrastructure

Nächster Artikel Spark-IDPP: high-throughput and scalable prediction of intrinsically disordered protein regions with Spark clusters on the Cloud

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Beaumont, O., Eyraud-Dubois, L., Lorenzo-Del-Castillo, J.A.: Analyzing real cluster data for formulating allocation algorithms in cloud platforms. Parallel Comput. 54, 83–96 (2016)MathSciNetCrossRef

Singh, K., Smallen, S., Tilak, S., Saul, L.: Failure analysis and prediction for the CIPRES science gateway Kritika. Concurr. Comput. Pract. Exp. 22(6), 685–701 (2016)

Garraghan, P., Townend, P., Xu, J.: An empirical failure-analysis of a large-scale cloud computing environment. In: Proceedings of IEEE 15th International Symposium on High Assurance Systems Engineering HASE 2014, pp. 113–120 (2014)

Elliott, J., Kharbas, K., Fiala, D., Mueller, F., Ferreira, K., Engelmann, C.: Combining partial redundancy and checkpointing for HPC. In: Proceedings of International Conference on Distributed Computing Systems, pp. 615–626 (2012)

Mohammed, B., Kiran, M., Maiyama, K.M., Kamala, M.M., Awan, I.-U.: Failover strategy for fault tolerance in cloud computing environment. Softw. Pract. Exp. 47(9), 1243–1247 (2017)CrossRef

Pantic, Z., Babar, M.: Guidelines for building a private cloud infrastructure. In: ITU Tech. Rep.—TR-2012-153TR-2012-153 (2012)

Sefraoui, O., Aissaoui, M., Eleuldj, M.: Cloud computing migration and IT resources rationalization. In: International Conference on Multimedia Computing and Systems, pp. 1164–1168 (2014)

Sen, A., Madria, S.: Off-line risk assessment of cloud service provider. In: 2014 IEEE World Congress on Services, pp. 58–65 (2014)

Yadav, S.: Comparative study on open source software for cloud computing platform: eucalyptus. In: Openstack and Opennebula, Res. Inven. Int. J. Eng. Sci. vol. 3, no. 10, pp. 51–54 (2013)

10.

Bontempi, G., Ben Taieb, S., Le Borgne, Y.A.: Machine learning strategies for time series forecasting. In: Lecture Notes in Business Information Processing (LNBIP), vol. 138, pp. 62–77 (2013)

11.

Chigurupati, A., Thibaux, R., Lassar, N.: Predicting hardware failure using machine learning. In: 2016 Annual Reliability and Maintainability Symposium, p. 16 (2016)

12.

Fulp, E., Fink, G., Haack, J.: Predicting computer system failures using support vector machines. In: Proceedings of First USENIX Conference Anal. Syst. logs, p. 55 (2008)

13.

Schroeder, B., Gibson, G.: A large-scale study of failures in high-performance computing systems. IEEE Trans. Dependable Secur. Comput. 7(4), 337–350 (2010)CrossRef

14.

Sahoo, R.K., Squillante, M.S., Sivasubramaniam, A., Zhang, Y.Z.Y.: Failure data analysis of a large-scale heterogeneous server environment. Int. Conf. Dependable Syst. Netw. 2004, 110 (2004)

15.

Vishwanath, K.V., Nagappan, N.: Characterizing cloud computing hardware reliability. In: Proceedings of the 1st ACM symposium on cloud computing–SoCC 10, p. 193 (2010)

16.

Kavulya, S., Tany, J., Gandhi, R., Narasimhan, P.: An analysis of traces from a production MapReduce cluster. In: CCGrid 2010—10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp. 94–103 (2010)

17.

Abu-Samah, A., Shahzad, M. K., Zamai, E., Ben Said, A.: Failure prediction methodology for improved proactive maintenance using Bayesian approach. In: IFAC Proceedings, vol. 48, no. 21, pp. 844–851 (2015)

18.

Khan, A., Bussone, B., Richards, J., Miguel, A.: A practical approach to hard disk failure prediction in cloud platforms. In: 2016 IEEE Second International Conference on Big Data Computing Service and Applications, pp. 105–116 (2016)

19.

Thomas, G.H., Gungl, K.P.: Patent US9319030—integrated circuit failure prediction using clock duty cycle recording (2016)

20.

Choi, J., Kim, Y.: Adaptive resource provisioning method using application-aware machine learning based on job history in heterogeneous infrastructures. Clust. Comput. 20(4), 35373549 (2017)CrossRef

21.

Li, Z.: An adaptive overload threshold selection process using Markov decision processes of virtual machine in cloud data center. Cluster Comput. 1–13 (2018)

22.

Jayanthi, R., Florence, L.: Software defect prediction techniques using metrics based on neural network classifier. Cluster Comput. 1–12 (2018)

23.

Kumaresan, K., Ganeshkumar, P.: Software reliability modeling using increased failure interval. Clust. Comput. 1–18 (2018)

24.

Padhy, N., Singh, R.P., Satapathy, S.C.: Cost-effective and fault-resilient reusability prediction model by using adaptive genetic algorithm based neural network for web-of-service applications. Clust. Comput. 9, 1–23 (2018)

25.

Manjula, C., Florence, L.: Deep neural network based hybrid approach for software defect prediction using software metrics. Clust. Comput. 1–17 (2018)

26.

Keke, G., Qiu, M., Elnagdy, S.A.: Security-aware information classifications using supervised learning for cloud-based cyber risk management in financial big data. In: 2016 IEEE 2nd International Conference on Big Data Security on Cloud, IEEE International Conference on High Performance and Smart Computing, IEEE International Conference on Intelligent Data and Security, pp. 197–202 (2016)

27.

Zhang, L., Rao, K., Wang, R., Jia, Y.: Risk prediction model based on improved AdaBoost method for cloud users. Open Cybern. Syst. J. 9, 44–49 (2015)CrossRef

28.

Pop, D.: Machine learning and cloud computing: survey of distributed and SaaS solutions. Inst. e-Austria Timisoara, Tech. Rep 1 (2012)

29.

Bsch, S., Nissen, V., Wnscher, A.: Automatic classification of data-warehouse-data for information lifecycle management using machine learning techniques. Inf. Syst. Front. 19(5), 1085–1099 (2016)CrossRef

30.

Fall, D., Okuda, T., Kadobayashi, Y., Yamaguchi, S.: Risk adaptive authorization mechanism (RAdAM) for cloud computing. J. Inf. Process. 24(2), 371380 (2016)

31.

Guo, C., Liu, Y., Huang, M.: Obtaining evidence model of an expert system based on machine learning in cloud environment. J. Internet Technol. 16(7), 13391349 (2015)

32.

Amin, Z., Sethi, N., Singh, H.: Review on fault tolerance techniques in cloud computing. Int. J. Comput. Appl. 116(18), 1117 (2015)

33.

Pellegrini, A., Di Sanzo, P., Avresky, D.R.: Proactive cloud management for highly heterogeneous multi-cloud infrastructures. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1311–1318 (2016)

34.

Thakur, K.S.S.P.P, Godavarthi, T.R.: 10.1.1.416.6042. vol. 3, no. 6, pp. 698–703 (2013)

35.

Shen, C., Tong, W., Choo, K. K. R., Kausar, S.: Performance prediction of parallel computing models to analyze cloud-based big data applications. Clust. Comput. pp. 1–16 (2017)

36.

Kwon, D., Kim, H., Kim, J., Suh, S. C., Kim, I., Kim, K. J.: A survey of deep learning-based network anomaly detection. Clust. Comput. pp. 1–13 (2017)

37.

Muthusankar, D., Kalaavathi, B., Kaladevi, P.: High performance feature selection algorithms using filter method for cloud-based recommendation system. Clust. Comput. 0(i), 1–12 (2018)

38.

Madni, S.H.H., Latiff, M.S.A., Coulibaly, Y., Abdulhamid, S.M.: Recent advancements in resource allocation techniques for cloud computing environment: a systematic review. Clust. Comput. 20(3), 24892533 (2017)CrossRef

39.

Schroeder, B., Gibson, G.: The computer failure data repository (CFDR): collecting, sharing and analyzing failure data. In: SC 06 Proceedings of 2006 ACM/IEEE Conference Supercomputing, March, p. 154 (2006)

40.

Schroeder, B., Gibson, G.: The computer failure data repository (CFDR). In: Workshop on Reliability Analysis of System Failure Data (RAF’07), MSR Cambridge, p. 6 (2007)

41.

Vapnik, V.N.: An overview of statistical learning theory. IEEE Trans. Neural Netw. 10(5), 988–999 (1999)CrossRef

42.

Medeiros, M.C., Veiga, A., Resende, M.G.C.: A combinatorial approach to piecewise linear time series analysis. J. Comput. Graph. Stat. 11(1), 236–258 (2002)MathSciNetCrossRef

43.

Zhou, Y.: Failure trend analysis using time series model. In: 2017 29th Chinese Control and Decision Conference, no. 1, pp. 859–862 (2017)

44.

Ho, S., Xie, M., Goh, T.: A comparative study of neural network and Box-Jenkins ARIMA modeling in time series prediction. Comput. Ind. Eng. 42(24), 371–375 (2002)CrossRef

45.

Casalicchio, E.: A study on performance measures for auto-scaling CPU-intensive containerized applications. Clust. Comput. 1–12 (2019)

46.

Nussbaum, L., Anhalt, F., Mornard, O., Gelas, J., Nussbaum, L., Anhalt, F., Mornard, O., Linux-based, J. G., Nussbaum, L., Mornard, O.: Linux-based virtualization for HPC clusters. In: Montreal Linux Symposium (2009)

47.

Benedicic, L., Cruz, F.A., Madonna, A., Mariotti, K.: Portable, High-Performance Containers for HPC. Cornell University, Ithaca (2017)

48.

Nanda, S., Hacker, T.J.: Racc: resource-aware container consolidation using a deep learning approach. In: Proceedings of First Workshop on Machine Learning Computing System— MLCS18, pp. 1–5 (2018)

49.

CANONICAL LTD, Linux containers, infrastructure for container projects, 2018. https://linuxcontainers.org/. Accessed 21 Jan 2019

50.

Dwyer, T., Fedorova, A., Blagodurov, S., Roth, M., Gaud, F., Pei, J.: A practical method for estimating performance degradation on multicore processors, and its application to HPC workloads. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC) (2012)

51.

Buyya, R., Ranjan, R., Calheiros, R.N.: Modeling and simulation of scalable cloud computing environments and the cloudsim toolkit: challenges and opportunities. In: Proceedings of 2009 International Conference on High Performance Computing Simulation, HPCS 2009, pp. 1–11 (2009)

52.

Fulay, A.: Database containerization platform checklist—Container Journal (2016). https://containerjournal.com/2016/09/19/1860/. Accessed 21 Jan 2019

53.

Onur, C.: Utilizing containers for HPC and deep learning workloads—CIO, DELL EMC: innovating to transform (2018). https://www.cio.com/article/3269351/analytics/utilizing-containers-for-hpc-and-deep-learning-workloads.html. Accessed 21 Jan 2019

Titel: Failure prediction using machine learning in a virtualised HPC system and application
verfasst von: Bashir Mohammed
Irfan Awan
Hassan Ugail
Muhammad Younas
Publikationsdatum: 21.03.2019
Verlag: Springer US
Erschienen in: Cluster Computing / Ausgabe 2/2019
Print ISSN: 1386-7857
Elektronische ISSN: 1573-7543
DOI: https://doi.org/10.1007/s10586-019-02917-1

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 2/2019

TASM: technocrat ARIMA and SVR model for workload prediction of web applications in cloud

Approximation guarantees of evolutionary optimization on the minimum crossing spanning tree problem

Design and implementation of skiplist-based key-value store on non-volatile memory

Optimizing communication performance in scale-out storage system

Understanding the performance of storage class memory file systems in the NUMA architecture

A programmable shared-memory system for an array of processing-in-memory devices