Skip to main content
Top

2023 | OriginalPaper | Chapter

A Critical Review of Faults in Cloud Computing: Types, Detection, and Mitigation Schemes

Authors : Ramandeep Kaur, V. Revathi

Published in: Intelligent Systems and Machine Learning

Publisher: Springer Nature Switzerland

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The continuous rise in for demand services in large-scale distributed systems led to the development of cloud Computing (CC). Because it provides a combination of various software resources, CC is considered dynamically scalable. However, due to the cloud’s dynamic environment, a variety of unanticipated problems and faults occur that hinder CC performance. Fault tolerance refers to a platform’s capacity to respond smoothly to unanticipated hardware or programming failure. Failure must be analyzed and dealt with efficiently in cloud computing in order to accomplish high accuracy and reliability. Over the years, a significant number of techniques and approaches have been proposed for detecting the faults in CC as well as increasing their tolerance ability. In this review paper, we first provided a brief overview of Cloud computing systems, their architecture, and their working mechanism. Moreover, the services provided by Cloud computing and the issues faced by it are also highlighted in this paper. Also, the taxonomy of various faults that occur in the CC environment along with their mitigation techniques is discussed. Furthermore, it has been analyzed that traditional fault detection methods were not generating effective results which resulted in poor performance in cloud environments. Therefore, an ample number of authors stated to use Machine Learning (ML) based models for fault detection in CC. Nonetheless, ML algorithms were not able to handle a large volume of data therefore the concept of Deep Learning was introduced in fault detection approaches. Moreover, it has been also observed that the performance of DL methods can be enhanced significantly by using optimization algorithms along with them. Some of the recently proposed fault detection and tolerant systems based on ML, DL and optimization have been reviewed in this paper.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Buyya, R., et al.: Cloud computing, and emerging IT platforms: vision, hype, and reality for delivering computing as the 5th utility. Futur. Gener. Comput. Syst. 25, 17 (2009) Buyya, R., et al.: Cloud computing, and emerging IT platforms: vision, hype, and reality for delivering computing as the 5th utility. Futur. Gener. Comput. Syst. 25, 17 (2009)
2.
go back to reference Kumari, P., Kaur, P.: A survey of fault tolerance in cloud computing. J. King Saud Univ. Comput. Inf. Sci. 33(10), 1159–1176 (2021) Kumari, P., Kaur, P.: A survey of fault tolerance in cloud computing. J. King Saud Univ. Comput. Inf. Sci. 33(10), 1159–1176 (2021)
3.
go back to reference Gao, J., Wang, H., Shen, H.: Task failure prediction in cloud data centers using deep learning. IEEE Trans. Serv. Comput. (2020) Gao, J., Wang, H., Shen, H.: Task failure prediction in cloud data centers using deep learning. IEEE Trans. Serv. Comput. (2020)
4.
go back to reference Haresh, M., Kalady, S.: Agent-based dynamic resource allocation on federated clouds. In: Recent Advances in Intelligent Computational Systems (RAICS), pp. 111–114 (2011) Haresh, M., Kalady, S.: Agent-based dynamic resource allocation on federated clouds. In: Recent Advances in Intelligent Computational Systems (RAICS), pp. 111–114 (2011)
5.
go back to reference Ganesh, A., Sandhya, M., Shankar, S.: A study on fault tolerance methods in cloud computing. In: 2014 IEEE International Advance Computing Conference (IACC). IEEE (2014) Ganesh, A., Sandhya, M., Shankar, S.: A study on fault tolerance methods in cloud computing. In: 2014 IEEE International Advance Computing Conference (IACC). IEEE (2014)
7.
go back to reference Devi, K., Paulraj, D.: Multilevel fault-tolerance aware scheduling technique in cloud environment. J. Internet Technol. 22(1), 109–119 (2021) Devi, K., Paulraj, D.: Multilevel fault-tolerance aware scheduling technique in cloud environment. J. Internet Technol. 22(1), 109–119 (2021)
8.
go back to reference Mell, P., Grance, T.: The NIST definition of cloud computing. NIST Special Publication, 800-145 (Draft) (2011). Accessed 11 Oct 2013 Mell, P., Grance, T.: The NIST definition of cloud computing. NIST Special Publication, 800-145 (Draft) (2011). Accessed 11 Oct 2013
10.
go back to reference Wang, T., et al.: Fault detection for cloud computing systems with correlation analysis. In: 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM), pp. 652–658 (2015) Wang, T., et al.: Fault detection for cloud computing systems with correlation analysis. In: 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM), pp. 652–658 (2015)
11.
go back to reference Prajapati, V., Thakkar, V.: A survey on failure prediction techniques in cloud computing. No. 4134. EasyChair (2020) Prajapati, V., Thakkar, V.: A survey on failure prediction techniques in cloud computing. No. 4134. EasyChair (2020)
12.
go back to reference Sivagami, V.M., Easwara Kumar, K.S.: Survey on fault tolerance techniques in cloud computing environment. Int. J. Sci. Eng. Appl. Sci. 1(9), 419–425 (2015) Sivagami, V.M., Easwara Kumar, K.S.: Survey on fault tolerance techniques in cloud computing environment. Int. J. Sci. Eng. Appl. Sci. 1(9), 419–425 (2015)
13.
go back to reference Gokhroo, M.K., Govil, M.C., Pilli, E.S.: Detecting and mitigating faults in cloud computing environment. In: 3rd IEEE International Conference (2017) Gokhroo, M.K., Govil, M.C., Pilli, E.S.: Detecting and mitigating faults in cloud computing environment. In: 3rd IEEE International Conference (2017)
14.
go back to reference Charity, T.J.: Resource reliability using fault tolerance in cloud computing, pp. 65–71 (2016) Charity, T.J.: Resource reliability using fault tolerance in cloud computing, pp. 65–71 (2016)
16.
go back to reference Singh, A., Kinger, S.: An efficient fault tolerance mechanism based on moving averages algorithm. IJARCSSE (2013). ISSN: 2277 128X Singh, A., Kinger, S.: An efficient fault tolerance mechanism based on moving averages algorithm. IJARCSSE (2013). ISSN: 2277 128X
18.
go back to reference Sirbu, A., Babaoglu, O.: Towards data-driven autonomics in data centers. In: Proceedings - 2015 International Conference on Cloud Computing ICCAC 2015, pp. 45–56 (2015) Sirbu, A., Babaoglu, O.: Towards data-driven autonomics in data centers. In: Proceedings - 2015 International Conference on Cloud Computing ICCAC 2015, pp. 45–56 (2015)
19.
go back to reference Pop, D.: Machine learning and cloud computing: survey of distributed and SaaS solutions. Inst. e-Austria Timisoara, Tech. Figure 2. Component Failure Analysis Rep 1 (2012) Pop, D.: Machine learning and cloud computing: survey of distributed and SaaS solutions. Inst. e-Austria Timisoara, Tech. Figure 2. Component Failure Analysis Rep 1 (2012)
20.
go back to reference Bui, K.T., et al.: A fault detection and diagnosis approach for multi-tier application in cloud computing. J. Commun. Netw. 22(5), 399–414 (2020) Bui, K.T., et al.: A fault detection and diagnosis approach for multi-tier application in cloud computing. J. Commun. Netw. 22(5), 399–414 (2020)
21.
go back to reference Fadaei Tehrani, A., Safi-Esfahani, F.: A threshold sensitive failure prediction method using support vector machine. Multiagent Grid Syst. 13(2), 97–111 (2017) Fadaei Tehrani, A., Safi-Esfahani, F.: A threshold sensitive failure prediction method using support vector machine. Multiagent Grid Syst. 13(2), 97–111 (2017)
22.
go back to reference Razzaghzadeh, S., Norouzi Kivi, P., Panahi, B.: A hybrid algorithm based-on Gossip architecture by using SVM for reliability in cloud computing. Soft Comput. J. (2021) Razzaghzadeh, S., Norouzi Kivi, P., Panahi, B.: A hybrid algorithm based-on Gossip architecture by using SVM for reliability in cloud computing. Soft Comput. J. (2021)
24.
go back to reference Radhika, D., Duraipandian, M.: Load balancing in cloud computing using support vector machine and optimized dynamic task scheduling. In: 2021 9th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), pp. 1–6 (2021). https://doi.org/10.1109/ICRITO51393.2021.9596289 Radhika, D., Duraipandian, M.: Load balancing in cloud computing using support vector machine and optimized dynamic task scheduling. In: 2021 9th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), pp. 1–6 (2021). https://​doi.​org/​10.​1109/​ICRITO51393.​2021.​9596289
25.
go back to reference Haoxiang, W., Smys, S.: MC-SVM based workflow preparation in cloud with named entity identification. J. Soft Comput. Paradigm (JSCP) 2(02), 130–139 (2020) Haoxiang, W., Smys, S.: MC-SVM based workflow preparation in cloud with named entity identification. J. Soft Comput. Paradigm (JSCP) 2(02), 130–139 (2020)
26.
go back to reference Anushuya, G., Gopikaa, K., Gokul Prasath, S., Keerthika, P.: Resource management in cloud computing using SVM with GA and PSO. Int. J. Eng. Res. Technol. (IJERT) ETEDM 6(04) (2018) Anushuya, G., Gopikaa, K., Gokul Prasath, S., Keerthika, P.: Resource management in cloud computing using SVM with GA and PSO. Int. J. Eng. Res. Technol. (IJERT) ETEDM 6(04) (2018)
27.
go back to reference Madhusudhan, H.S., Satish Kumar, T., Syed Mustapha, S.M.F.D., Gupta, P., Tripathi, R.P.: Hybrid approach for resource allocation in cloud infrastructure using random forest and genetic algorithm. Sci. Programm. 2021, 10 (2021). https://doi.org/10.1155/2021/4924708 Madhusudhan, H.S., Satish Kumar, T., Syed Mustapha, S.M.F.D., Gupta, P., Tripathi, R.P.: Hybrid approach for resource allocation in cloud infrastructure using random forest and genetic algorithm. Sci. Programm. 2021, 10 (2021). https://​doi.​org/​10.​1155/​2021/​4924708
28.
go back to reference Chhetri, T., Dehury, C.K., Lind, A., Srirama, S.N., Fensel, A.: A combined metrics approach to cloud service reliability using artificial intelligence (2021) Chhetri, T., Dehury, C.K., Lind, A., Srirama, S.N., Fensel, A.: A combined metrics approach to cloud service reliability using artificial intelligence (2021)
29.
go back to reference Lin, Q., et al.: Predicting node failure in cloud systems. In: ESEC/FSE, Lake Buena Vista, FL, USA. Association for Computing Machinery (ACM) (2018) Lin, Q., et al.: Predicting node failure in cloud systems. In: ESEC/FSE, Lake Buena Vista, FL, USA. Association for Computing Machinery (ACM) (2018)
30.
go back to reference Guan, Q., Zhang, Z., Fu, S.: Ensemble of bayesian predictors and decision trees for proactive failure management in cloud computing systems. J. Commun. (2012) Guan, Q., Zhang, Z., Fu, S.: Ensemble of bayesian predictors and decision trees for proactive failure management in cloud computing systems. J. Commun. (2012)
31.
go back to reference Pitakrat, T., Okanovic, D., van Hoorn, A., Grunske, L.: Hora: architecture-aware online failure prediction. J. Syst. Softw. (2018) Pitakrat, T., Okanovic, D., van Hoorn, A., Grunske, L.: Hora: architecture-aware online failure prediction. J. Syst. Softw. (2018)
32.
go back to reference Chen, X., Lu, C., Pattabiraman, K.: Failure prediction of jobs in compute clouds: a google cluster case study. In: 2014 IEEE International Symposium on Software Reliability Engineering Workshops (2014) Chen, X., Lu, C., Pattabiraman, K.: Failure prediction of jobs in compute clouds: a google cluster case study. In: 2014 IEEE International Symposium on Software Reliability Engineering Workshops (2014)
33.
go back to reference Islam, T., Manivannan, D.: Predicting application failure in cloud: a machine learning approach. In: 2017 IEEE International Conference on Cognitive Computing (ICCC) (2017) Islam, T., Manivannan, D.: Predicting application failure in cloud: a machine learning approach. In: 2017 IEEE International Conference on Cognitive Computing (ICCC) (2017)
35.
go back to reference Talwar, B., Arora, A., Bharany, S.: An energy efficient agent aware proactive fault tolerance for preventing deterioration of virtual machines within cloud environment. In: 2021 9th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), pp. 1–7 (2021). https://doi.org/10.1109/ICRITO51393.2021.9596453 Talwar, B., Arora, A., Bharany, S.: An energy efficient agent aware proactive fault tolerance for preventing deterioration of virtual machines within cloud environment. In: 2021 9th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), pp. 1–7 (2021). https://​doi.​org/​10.​1109/​ICRITO51393.​2021.​9596453
36.
go back to reference Mohammed, B., et al.: Failure analysis modelling in an infrastructure as a service (Iaas) environment. Electron. Notes Theor. Comput. Sci. 340, 41–54 (2018) Mohammed, B., et al.: Failure analysis modelling in an infrastructure as a service (Iaas) environment. Electron. Notes Theor. Comput. Sci. 340, 41–54 (2018)
37.
go back to reference Golani, B., Datta, J., Singh, G.: Prediction of cloud server job failures using machine learning based KNN classification and LSTM modelling methods. Int. J. Eng. Res. Technol. (IJERT) 10(05) (2021) Golani, B., Datta, J., Singh, G.: Prediction of cloud server job failures using machine learning based KNN classification and LSTM modelling methods. Int. J. Eng. Res. Technol. (IJERT) 10(05) (2021)
38.
go back to reference Shalu, Singh, D.: Artificial neural network-based virtual machine allocation in cloud computing. J. Discrete Math. Sci. Crypt., 1–12 (2021) Shalu, Singh, D.: Artificial neural network-based virtual machine allocation in cloud computing. J. Discrete Math. Sci. Crypt., 1–12 (2021)
39.
go back to reference Qasem, G.M., Madhu, B.K.: Proactive fault tolerance in cloud data centers for performance efficiency. Int. J. Pure Appl. Math. 117(22), 325–329 (2017) Qasem, G.M., Madhu, B.K.: Proactive fault tolerance in cloud data centers for performance efficiency. Int. J. Pure Appl. Math. 117(22), 325–329 (2017)
40.
go back to reference Bambharolia, P., Bhavsar, P., Prasad, V.: Failure prediction and detection in cloud datacenters. Int. J. Sci. Technol. Res. 6 (2017) Bambharolia, P., Bhavsar, P., Prasad, V.: Failure prediction and detection in cloud datacenters. Int. J. Sci. Technol. Res. 6 (2017)
41.
go back to reference Rawat, A., Sushil, R., Agarwal, A., Afzal: A new approach for VM failure prediction using stochastic model in cloud. IETE J. Res. (2018) Rawat, A., Sushil, R., Agarwal, A., Afzal: A new approach for VM failure prediction using stochastic model in cloud. IETE J. Res. (2018)
42.
go back to reference Li, Z., Liu, L., Kong, D.: VM failure prediction method based on AdaBoost-Hidden Markov model. In: IEEE International Conference on Intelligent Transportation, Big Data and Smart City (ICITBS) (2019) Li, Z., Liu, L., Kong, D.: VM failure prediction method based on AdaBoost-Hidden Markov model. In: IEEE International Conference on Intelligent Transportation, Big Data and Smart City (ICITBS) (2019)
43.
go back to reference Zhang, S.: PreFix: switch failure prediction in datacenter networks. In: Proceedings of the ACM on Measurement and Analysis of Computing Systems (2018) Zhang, S.: PreFix: switch failure prediction in datacenter networks. In: Proceedings of the ACM on Measurement and Analysis of Computing Systems (2018)
44.
go back to reference Singla, N., Bawa, S.: Priority scheduling algorithm with fault tolerance in cloud computing. Int. J. 3(12) (2013) Singla, N., Bawa, S.: Priority scheduling algorithm with fault tolerance in cloud computing. Int. J. 3(12) (2013)
45.
go back to reference Mylara Reddy Chinnaiah, N.N.: Fault-tolerant software systems using software configurations for cloud computing. J. Cloud Comput. 7 (2018). Article number: 3 Mylara Reddy Chinnaiah, N.N.: Fault-tolerant software systems using software configurations for cloud computing. J. Cloud Comput. 7 (2018). Article number: 3
46.
go back to reference Liu, J., Wang, S.: Using proactive fault-tolerance approach to enhance cloud service reliability. IEEE Trans. Cloud Comput. 6(4), 1191–1202 (2018) Liu, J., Wang, S.: Using proactive fault-tolerance approach to enhance cloud service reliability. IEEE Trans. Cloud Comput. 6(4), 1191–1202 (2018)
47.
go back to reference Kalanirnika, G.R., Sivagami, V.: Fault tolerance in cloud using reactive and proactive. Int. J. Comput. Sci. Eng. Commun., 1159–1164 (2015) Kalanirnika, G.R., Sivagami, V.: Fault tolerance in cloud using reactive and proactive. Int. J. Comput. Sci. Eng. Commun., 1159–1164 (2015)
48.
go back to reference Amin, Z., Sethi, N., Singh, H.: Review of fault tolerance techniques in cloud computing. Int. J. Comput. Appl., 11–17 (2015) Amin, Z., Sethi, N., Singh, H.: Review of fault tolerance techniques in cloud computing. Int. J. Comput. Appl., 11–17 (2015)
49.
go back to reference Marahatta, C.C.: Energy-aware fault-tolerant scheduling scheme based on intelligent prediction model for cloud data center. In: 2018 Ninth International Green and Sustainable Computing Conference (IGSC), Pittsburgh, PA, USA, pp. 1–8 (2018) Marahatta, C.C.: Energy-aware fault-tolerant scheduling scheme based on intelligent prediction model for cloud data center. In: 2018 Ninth International Green and Sustainable Computing Conference (IGSC), Pittsburgh, PA, USA, pp. 1–8 (2018)
51.
go back to reference Shukla, A., Kumar, S., Singh, H.: Fault tolerance based load balancing approach for web resources in cloud environment. Int. Arab J. Inf. Technol. 17(2), 225–232 (2020) Shukla, A., Kumar, S., Singh, H.: Fault tolerance based load balancing approach for web resources in cloud environment. Int. Arab J. Inf. Technol. 17(2), 225–232 (2020)
52.
go back to reference Ragmani, A., et al.: Adaptive fault-tolerant model for improving cloud computing performance using artificial neural network. Procedia Comput. Sci. 170, 929–934 (2020) Ragmani, A., et al.: Adaptive fault-tolerant model for improving cloud computing performance using artificial neural network. Procedia Comput. Sci. 170, 929–934 (2020)
53.
go back to reference Gupta, A.K., Mamgain, A.: Machine learning based approach for fault tolerance in cloud computing. Int. J. Adv. Res. Ideas Innov. Technol. 4, 59–62 (2018) Gupta, A.K., Mamgain, A.: Machine learning based approach for fault tolerance in cloud computing. Int. J. Adv. Res. Ideas Innov. Technol. 4, 59–62 (2018)
54.
go back to reference Rawat, A., et al.: A new adaptive fault tolerant framework in the cloud. IETE J. Res., 1–13 (2021) Rawat, A., et al.: A new adaptive fault tolerant framework in the cloud. IETE J. Res., 1–13 (2021)
56.
go back to reference Jhawar, R., Piuri, V., Santambrogio, M.: Fault tolerance management in cloud computing: a system-level perspective. IEEE Syst. J. 7(2), 288–297 (2012)CrossRef Jhawar, R., Piuri, V., Santambrogio, M.: Fault tolerance management in cloud computing: a system-level perspective. IEEE Syst. J. 7(2), 288–297 (2012)CrossRef
57.
go back to reference Ahmad, Z., Jehangiri, A.I., Ala’anzy, M.A., Othman, M., Umar, A.I.: Fault-tolerant and data-intensive resource scheduling and management for scientific applications in cloud computing. Sensors (Basel) 21(21), 7238 (2021). https://doi.org/10.3390/s21217238 Ahmad, Z., Jehangiri, A.I., Ala’anzy, M.A., Othman, M., Umar, A.I.: Fault-tolerant and data-intensive resource scheduling and management for scientific applications in cloud computing. Sensors (Basel) 21(21), 7238 (2021). https://​doi.​org/​10.​3390/​s21217238
Metadata
Title
A Critical Review of Faults in Cloud Computing: Types, Detection, and Mitigation Schemes
Authors
Ramandeep Kaur
V. Revathi
Copyright Year
2023
DOI
https://doi.org/10.1007/978-3-031-35081-8_17

Premium Partner