Skip to main content
Top
Published in:

Open Access 17-10-2024

In-depth analysis of fault tolerant approaches integrated with load balancing and task scheduling

Authors: Sheikh Umar Mushtaq, Sophiya Sheikh, Sheikh Mohammad Idrees, Parvaz Ahmad Malla

Published in: Peer-to-Peer Networking and Applications | Issue 6/2024

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The article delves into the evolution of cloud computing over the past decade, emphasizing the critical role of fault tolerance in maintaining service reliability. It discusses the three main layers of cloud services and the necessity for fault tolerance mechanisms to address various types of faults, including network, physical, and process faults. The paper also explores the integration of fault tolerance with load balancing and task scheduling, highlighting the importance of these techniques in enhancing cloud performance. Additionally, it presents a general problem formulation for fault tolerance using replication and discusses the selection and elimination criteria for the reviewed articles. The study concludes with a comparative analysis of recent surveys on fault tolerance, load balancing, and scheduling, emphasizing the unique contributions of the current research.
Notes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Abbreviations
QoS
Quality of Services
VM
Virtual Machine
FT
Fault Tolerance

1 Introduction

Over the last 10 years, the use of Cloud has grown substantially. More facilities are incorporated into the cloud environment and are allowed to be accessed by everyone globally. Likewise, Cloud Computing companies such as IBM, Yahoo, Amazon, and Google are providing global access to services to customers [1]. Moreover, these are metered services which we commonly term subscriptions, and are frequently applied in the Software as a Service (SaaS) delivery simulation [2].
The cloud environment consists of two components i.e., the frontend, and the backend. The front end is the main interface on the consumer side and is accessed through different networks over the internet [3]. The Backend side particularly deals with the CSP (Cloud Service Provider) and provides services by utilizing data center resources. In these data centers, different physical machines known as servers are being stored. Multiple virtual copies of these physical machines can be created using the virtualization process. Virtualization deals with and handles multiple upcoming requests for a particular application/service across the globe. The different shareable resources can be Applications, Software, Hardware, etc.
In cloud architecture, there are mainly three services [4], Infrastructure (IaaS), Software (SaaS), and Platform as a Service (PaaS) [5, 6]. There may be chances of faults in all these three layers in a similar way while providing user services. Therefore, the detection and removal of faults is necessary for obtaining the best possible reliability as presented in [7, 8]. Moreover, the deficiencies in the infrastructure of the cloud yield a direct impact on resource reliability and availability [4]. These deficiencies need to be critically analyzed and treated to boost reliability and robustness. DNN (Deep Neural Network), a powerful deep learning tool exhibits is a promising solution for this [9]. Fault Tolerance is a significant technique that can notice, locate, and recover from faults and failures in the cloud environment. It makes the cloud more robust and enhances the efficiency of the environment [10]. Mainly, fault tolerance falls into two sub-areas i.e., Hardware Fault Tolerance and Software Fault Tolerance [11].
On the other hand, scheduling tasks appropriately is vital in delivering critical and essential services of the cloud. The ineffective scheduling of tasks increases the task execution time and waiting time. Besides, insignificant load balancing results in the under and over-utilizing of resources where the under-utilization of resources can lead to the wastage of resources, and over-utilization of resources can degrade the performance of cloud systems. Henceforth, proficient load distribution is essential to boost the performance of cloud-based applications.
There is a fundamental need to incorporate load balancing and scheduling in efficient fault-handling mechanisms due to architectural challenges in the cloud system. Therefore, this paper conducts a hybrid review employing fault tolerance with scheduling, load balancing, and analysis of QoS (Quality of Service) parameters optimization. This comprehensive review primarily centers on three core classifications of fault tolerance techniques, namely Reactive, Proactive, and Resilient Approaches. The Reactive Procedures are the conventional techniques of fault tolerance that include replication, detection, checkpointing/restarting, and recovery. In the Proactive Methods, the system is prevented from reaching a defective state that includes monitoring, prediction, and pre-emption. The actions are taken to minimize the defects, and thereby the failure condition is avoided. The Resilient Methods have shown a recent take-off in the literature and indicate a potential trajectory for the future of fault tolerance in cloud environments. This is because these methods are grounded on artificial cleverness and ML [10]. Besides, simulation toolkits play an analytical role in evaluating settings of cloud computing. These toolkits allow us to simulate and evaluate the cloud set-ups cost-efficiently without the requirement for massive infrastructure. Some of the most effective and powerful simulators have been discussed in [12]. Comparative analysis has been performed in [13] among various simulators concerning to various parameters to determine the features and functions of each toolkit.

1.1 Research methodology and data analysis

This section focuses on the setting of the methods that are used to perform the qualitative opinion of the literature in the review and the sources of considered state-of-the-art works. It also includes the incorporated methodology for the proposed research. In the end, we specified our significant contributions to this review.
The selection and elimination of the published articles were determined based on some aspects. The related articles were selected after analyzing the abstract, and afterward, a critical review/analysis was performed. The selection of the papers was achieved based on the standard in the database and the article itself. Furthermore, the inclusion was done based on the following conditions.
a. Searching strategy
A systematic survey of fault tolerance with efficient scheduling and load distribution techniques proposed in the literature was conducted through well-known sources.
Several search keywords include Cloud Resources, Fault-tolerance, Task Scheduling, Load Balancing, QoS Parameters, Resource Optimization, failure in a cloud, essential cloud services, cloud architecture, scheduling techniques, etc., used in this study.
 
b. Duration and validity of study
  • This review research mostly incorporates articles from 2009 to 2023 from well-believed journals, books, and conferences.
  • The statistics of the considered year for publications from 2009 to 2023 are depicted in Fig. 1.
  • Very few studies are included from 2007 and 2008.
  • The selected duration is chosen to capture a comprehensive range of data such as technical progressions, economic sequences, and policy variations, and confirm data availability pertinent to our study that replicates the evolution, progression, and trends applicable to our study objectives.
Fig. 1
Percentage of the included papers (2009 to 2023)
Full size image
 
c. Language and selection/inclusion criteria
  • The decision for the language criterion was specified as English. Because English is considered as the primary language for scholarly and intellectual publications particularly in the fields of computer science and distributed computing. Regulating criteria to English articles ensured that we selected high-quality and broadly recognized studies, smoothing a thorough and appropriate review.
  • The primary priority was given to hybrid fault tolerance approaches including either scheduling or load balancing.
  • Hybrid fault tolerance approaches optimize some other QoS parameters as well. Figure 2 presents the detailed inclusion and exclusion of the studies.
Fig. 2
Showing the Methodology of Inclusion and exclusion criteria of the studies
Full size image
 
d. Data processing and analysis
  • The data was initially organized into Excel and prepared for analysis.
  • Data categorization was made based on different QoS parameters, the environment used for simulation, types of faults considered, and other thematic considerations. This categorization helps us to analyze the literature more clearly and precisely.
  • The qualitative information was obtained by considering diverse Quality of Service (QoS) metrics, types of faults addressed, and the range of simulation environments utilized across a timeframe.
  • Furthermore, the analysis also highlights the various fault tolerance methods employed in the existing literature.
 
e. Synthesis of the analysiss
  • For meaningful conclusions and insights, the data was observed based on the objectives of the study.
  • The patterns and relationships among the various studies were discussed for comparison and assessment.
 
f. Quality assessment and validation procedure
The presented Methodology Adapted for this study can be summarized in four stages:
  • Originally, the related articles were searched through the related keywords.
  • Some articles were selected based on title, standards, and optimization parameters.
  • Selected articles were gone through the abstract, and further inclusion and exclusion were performed.
  • Finally, inclusive articles were extensively reviewed, analyzed, and incorporated into this survey.
 

1.2 Motivation

Faults can lead to malfunctions that worsen a system's overall performance. Failures result in the breakdown/shutdown of a system, but occasionally, flaws cause performance to decline rather than the entire shutdown of the system. Various fault tolerance solutions can be employed to address different types of defects, such as network, physical, and process problems. However, it is crucial to achieve without comprehending the existence of the issue inside the architecture and the damage that the system flaw produced. Cloud is made up of comprises levels, each of which takes services from the layers below it. The failure at any layer has the potential to contaminate the layer right above it. Since faults at any one layer may affect the services that any of the layers provide. Thus, for high-performance computers, the appropriate fault tolerance system is needed to effectively handle these faults. The faults should be managed critically and dynamically to make the cloud environment more efficient and intelligent. Besides, in the cloud, efficient task scheduling leads to the maximum utilization of virtual machines, reducing operational costs, thereby revealing enhancements in the QoS parameters and eventually improving overall performance. Also, load balancing techniques need to be addressed comprehensively in different environments like static, dynamic, and nature-inspired cloud environments. Moreover, it is essential to thoroughly examine load-balancing techniques across various settings, including static, dynamic, and nature-inspired environments.
Various methods have been suggested in academic literature to address this concern and multifarious reviews are available in the literature for future researchers. While studying the existing surveys, it was observed that the surveys are not thorough enough, wide-ranging, and sufficient in certain ways. Although the authors in [14] have presented a comprehensive survey about fault tolerance, this survey does not focus on other aspects of the cloud like efficient load balancing and scheduling. In [15], an immense survey focused on scheduling but lacked emphasis on fault handling and load distribution. Besides, [16] presented a vast survey focusing on load balancing across cloud resources but lacking in fault handling and cloud optimization. Similarly, [17] also provides a survey emphasizing fault tolerance frameworks, however, fails to significantly enhance the performance of the cloud environment. In [18], only considering fault-tolerant approaches does not give prominence to major cloud aspects such as scheduling and load balancing. Similarly, the most recent survey presented in [19] focused on both scheduling and fault tolerance but no ways for optimal load distribution. Additionally, the observations presented in [20] were limited to a few aspects concerning fault-handling techniques, and only crash and byzantine fault models were considered. Also, there is no consideration of QoS parameters. Similarly, the recent survey was presented in [21] but was found limited to reliability. In other words, these reviews were not significantly focused on the discussed issues of the cloud related to fault tolerance with scheduling/load balancing simultaneously. After this comprehensive analysis, it was observed that none of the mentioned surveys offer extensive consideration of the above-mentioned scenarios of cloud computing. The QoS and other important aspects related to the clouds' fault tolerance concerns are focused on by the researchers in the existing surveys but are very limited. This renders the current review inadequate for analyzing the current art in cloud systems. Hence, there is a dire need to present a survey focusing on reliability-related aspects of the cloud. Therefore, we got motivated and moved to present this systematic and hybrid review. In this survey, we try to discover and explore the site of hybrid fault tolerance models that will focus not only on traditional fault tolerance techniques but also integrate some other important cloud aspects like scheduling/load balancing. This integration helps us to highlight the likely applications, challenges, and incipient trends.
The hybrid models in fault tolerance with load balancing and scheduling extend to several advantages over single scheme approaches.
  • Illustrative example:
  • Consider the scenario, where the CSP hosts several services and applications for its clients, utilizing solely fault tolerance mechanisms (single-model schemes). In often cases, fault tolerance frequently results in redistributing the workload from faulty virtual machines (VMs) to the unaffected VMs. This redistribution often upsets the load equilibrium between VMs, which leads to an unequal workload distribution and a deterioration in overall service performance. However, if CSP implements the hybrid model which integrates multiple reliability measures would enhance reliability and provide robust services to the clients. In our example, if CSPs employ the hybrid model that performs load balancing after fault tolerant measures. This will help CSPs to simultaneously minimize the risks of non-uniform load distributions and other overheads associated with fault tolerance and progress the QoS.
  • Besides, to make this emerging domain more observable for future researchers, there is a need to analyze the up-to-date methods concerning these factors [10]. This review is also inspired by peer surveys of the existing literature along with their limitations. Moreover, it represented the analysis of some important aspects of the existing literature such as QoS, static/dynamic, environmental setup used, fault tolerance approaches, and fault models, and presented the results in the graphical visualization form. The analysis provided offers a comprehensive perspective on the existing research efforts that have been the focal point of existing studies. The overall comparison of the top-cited surveys with the proposed survey is also illustrated in the subsequent sections.

1.3 Our contribution and features of the study

The primary contributions of this survey include:
  • This article presents an in-depth examination of the cloud environment. The main faults and fault taxonomy in cloud systems are also discussed in detail.
  • Various researchers have already addressed fault tolerance and load balancing mechanisms, however, much of their work has focused on the employment of either fault tolerance or load balancing separately. The presented survey incorporates a review of fault tolerance with two other related aspects, i.e., load balancing and scheduling which is the peak need of the time and was found missing in the current surveys.
  • Moreover, Tables 1 and 2 present a comparative analysis of our contribution with the recent and current top-cited studies respectively.
  • The survey has been presented in two categories i.e., Fault tolerance with Scheduling and Fault tolerance with Load balancing.
  • The generalized problem formulation of fault tolerance has also been presented to understand the workings of fault tolerance using the replication technique.
  • We further outlined the difficulties associated with ensuring fault tolerance integrated with scheduling and load balancing in cloud computing systems and comprised a thorough examination of common problems faced. It will assist future researchers to promptly recognize or understand the problems related to the study.
  • The study also presents feasible graphical observations about the literature such as parameters optimized, faults model addressed, the environmental tool used, etc. These detailed observations are presented separately for both categories and were not found in the existing surveys to the best of our knowledge. A dedicated discussion and observation section is designed for that purpose.
  • This hybrid review aids in investigating the potential challenges of hybrid fault-tolerant models and provides a detailed roadmap for future research directions. The aim is to enhance migration methods, thereby mitigating failures among nodes.
  • Moreover, the overall study provides a platform for future researchers to analyze the current state of the art regarding considered issues and find the appropriate future research problems.
  • In the end, there is a dedicated section highlighting the future research directions of the problem.
Table 1
Comparative analysis related to the contribution of the top-cited study and the proposed study
Authors
Year
Fault taxonomy
Fault tolerance
Fault tolerance approaches
Load balancing
Load balancing approaches
Scheduling
Hybrid review of scheduling and fault tolerance
Hybrid review of load balancing and fault tolerance
Fault tolerance problem formulation
Graphical representation of results
[14]
2021
 × 
 × 
 × 
 × 
 × 
 × 
[15]
2021
 × 
 × 
 × 
 × 
 × 
 × 
 × 
 × 
 × 
[16]
2021
 × 
 × 
 × 
 × 
 × 
 × 
 × 
 × 
 × 
[17]
2018
 × 
 × 
 × 
 × 
 × 
 × 
 × 
 × 
[18]
2021
 × 
 × 
 × 
 × 
 × 
 × 
 × 
 × 
[19]
2022
 × 
 × 
 × 
 × 
 × 
 × 
[20]
2020
 × 
 × 
 × 
 × 
 × 
 × 
 × 
[22]
2019
 × 
 × 
 × 
 × 
 × 
 × 
 × 
Presented Survey
_
Table 2
Enlightenment of reactive fault-tolerant techniques
Classification/Category
Enlightenment
Check-pointing [23, 24]
The system state is saved periodically and in case of breakdown, the job is initiated over from the last checkpoint rather than the beginning. i.e., the job is resumed from the recent state
Retry [25]
In case of premature task closure or failure, we repeat the task with the same resource until it succeeds without any other consideration or fault reasons
Replication [26, 27]
In this approach, replicas of tasks are created and stored at diverse places. Until all these replicas are destroyed, the execution of the task will continue even in the presence of malfunctions and failures
Task Resubmission [23, 26]
This approach continuously submits the affected task to the identical or alternative resource [10]. There is a resource loss in this technique by re-executing the unsuccessful task repeatedly [28]
Job Migration [25]
The failed job is migrated from the particular machine to an additional machine
Rescue Workflow [25]
This approach lets the system continue working even in the presence of fault until the fault will not allow the system to progress further
Load Balancing [2931]
The total load is efficiently distributed across machines so that no machine will be under or overloaded [32, 33]. Load balancing helps to condense the hardware, its cost, and time cost, & thereby improving overall task execution and efficiency [3436]
N-Version and Recovery Block [37]
These are the most used fault tolerance techniques in software where N-version programming has N independent groups/developers for developing N different versions of software modules [37]. A recovery block is used in case task replication is required
Custom Exception Handling [10]
In this approach, the developers concatenate some code or script into the software to handle certain errors at the running time [10]

1.4 Organization of the paper

The following specified structure is adhered to for this research review article:
The overall structure of this research review article is as specified. Section 2.2.3 presents the detailed introduction of the study having its subsection from 1.1 to 1.7. These subsections encompass the Research Methodology and Data Analysis, Motivation with an illustrative example elucidating how hybrid frameworks can benefit CSPs, and the Authors' Contribution. Moreover, Section 2.2.3 also focuses on the significance of fault tolerance in the cloud, encompassing a taxonomy of faults, errors, and failures, along with delineating the challenges associated with fault tolerance in dedicated subsections. It also delineates the specifics of scheduling, load balancing, and fault tolerance in the pursuit of reliable cloud services. Additionally, it formulates the problem associated with fault tolerance in this context. The detailed survey literature with the comparative analysis is elaborated on in Section 2. Section 3 depicts the discussions and observations from the existing reviewed literature while presenting the overall analysis of fault tolerance with both scheduling and load balancing in the dedicated sections. The forthcoming directions with open issues and future works in the related research are highlighted in Section 4. Additionally, the methodical roadmap for open challenges is also included in Section 4. Finally, Section 5 concludes the whole study. The organization of the presented study has been presented in Fig. 3.
Fig. 3
Showing the organization of the study
Full size image

1.5 Fault tolerance in cloud computing

Faults in any resource may affect the task execution time and QoS parameters of the cloud, which will eventually reduce the deed of the system. The efficient fault tolerance policy helps to identify and overcome errors in the cloud architecture, and thereby the performance metrics are boosted. The fault tolerance capability should be considered with other techniques like scheduling and load balancing for the effective performance of the system. Moreover, the load balancing and scheduling approaches should do their respective standardizes along with fault tolerance. In case of a crash or connection error, the system should be capable enough to provide an alternative VM to handle these failures for smooth and uninterrupted task execution. Because these crashes in any nodes will affect the efficiency of the entire system. Therefore, handling faults enhances the utility of the system to accomplish the tasks precisely and accurately resolving the occurrence of internal defects [38]. An inclusion of fault tolerance with other reliability-related techniques like scheduling and load balancing will make the cloud environment more efficient, specifically for the real-time and dynamic processing of tasks [39]. Hence, fault tolerance is a major aspect that ensures robustness, reliability, and other performance metrics in the cloud environment [40, 41].

1.5.1 Fault, error, and failure taxonomies

The fault is the condition of the system when it loses the ability to function for an expected output due to an unexpected condition or defect in any of the internal or external components. The main faults within the cloud environment are enumerated as follows: [42]:
  • The Network Faults: These defects arise due to network interruption in any connection, nodes, cluster, etc., [43, 44].
  • The Physical Faults: When any of the hardware resources like CPU, memory, storage, etc., fails, these types of faults will occur. The power failure also gives rise to these types of faults [42].
  • The Process Faults: These are the common faults in a cloud environment that occur because of the unavailability of any resource, software, etc., [43].
  • The Service Expiry Fault: This type of fault arises if the service clock of the resource run out when the application is in use [43].
  • The Media Fault: Any crash in the media of the cloud will lead to these types of faults [39].
  • The Processor Faults: This type of fault mainly occurs because of malfunctioning in the operating system [45].
  • The Restrictions Faults: This type of fault occurs when any fault arises and is unnoticed or ignored by the controlling or any other responsible agent [17].
  • The Parametric Faults: If the optimizing parameters are ambiguous or do not differ and remain unexplained, this type of fault occurs [17].
  • The Time Restriction Faults: These faults occur when the particular application is not completed by the predefined deadline [17].
  • The fault tolerance mechanism makes the cloud environment efficient by providing necessary services even in case of failure of one or multiple components [46, 47]. If there is any kind of fault in the system, it leads to error, and error, in turn, culminates in failure.
  • Fault: The abnormal state of any coordination when assigned tasks cannot be performed. Usually, the fundamental cause of this state is the presence of some bugs in single/multiple components of the system [26, 29, 30, 4850]. Faults are categorized into various groups, as depicted in Fig. 4.
  • Error: A system experiencing faults may transition into an error state. Compromised performance due to errors can subsequently result in incomplete or complete failure of the system. Errors have been classified into the following categories, as shown in Fig. 5.
  • Failure: The presence of an error can take the system to the failure state and it has a absolute effect on the user. Moreover, the failure is recognized by the user by seeing the incorrect output of the system [25, 26, 30]. The failures have been classified into the following categories, as exhibited in Fig. 6.
Fig. 4
Showing different fault categories
Full size image
Fig. 5
Showing different error categories
Full size image
Fig. 6
Showing different failure categories
Full size image

1.6 General fault-tolerance challenges in cloud computing

Ensuring a fault-tolerant cloud environment involves evaluating numerous challenges. Some of these challenges are discussed below:
  • Task and failure heterogeneity: The cloud utilizes different hardware and operating systems simultaneously and considers the underlying heterogeneous frameworks [51]. Resultantly, in handling the heterogenous type of faults, and eventually increasing the complexity to overcome them.
  • Automation: The extensive use of VMs in the cloud environment is increasing exponentially and managing these platforms in real time is more difficult. Therefore, there is a good need to automate fault tolerance strategies for complex networks [15].
  • Cloud halts: The main plan of fault tolerance is to provide uninterrupted service altogether in case of any service interruption or malfunction of any host server or network system. The Service Level Agreements [26] for all companies should be prepared accordingly.
  • Retrieval Points and Recovery Time Objectives targeting: This Point is established to preserve the set of track records that may be at risk of loss in the event of a server error [14]. On the other hand, Recovery Time is the time required by the procedure to get back on track or running after the failure [52]. The main aim is to decrease RPO (Retrieval Point Objectives) and RTO (Recovery Time Objectives) at the minimum possible rate [10].
  • Cloud Workload: Cloud workloads are the specific applications-related tasks/services or specific amounts of work executed on a cloud resource. The workloads could be of two types, i.e., Enabled, and Native loads. The Native workloads are also labeled as “born on the web” and are entirely cloud-developed applications. On the other hand, an enabled workload pertains to the computational tasks generated by cloud applications. Moreover, the Proactive and resilient approaches seem relevant [53] to fill the fault tolerance conditions of both Active and Native concepts [10].

1.7 Measures for effective cloud reliability- a need for the hybrid framework

The claim for the cloud computing standard has enlarged intensely in the past few years as it allows the dynamic fetching and renouncing of computing resources that too in a device-independent and cost-effective manner with slight effort or communication from the service provider. Despite lots of enhancements in the cloud, it is still prone to many system failures which results in growing apprehension regarding the reliability of cloud public services. Reliability is the way of measuring the efficiency of the system and its value can be adjusted accordingly after performing computation where the default reliability is 100% [54]. The conditions of reliability must be met for stable and efficient processing of the cloud. It is also one of the critical Quality of Service constraints. Moreover, optimized QoS parameters play an important role in effective and adequate resource allocation and have been extensively inspected in Cloud Computing standards. These parameters are used to consider the efficiency of various Scheduling, Load Balancing, or Fault Tolerance techniques in the cloud.

1.7.1 Cloud scheduling approach

Cloud scheduling is performed by mapping the incoming task to the most suitable available VM. The objective of ascertaining the sequence in which events or tasks should be executed in the cloud and simultaneously analyzing the required QoS parameters is termed Scheduling. Cloud Scheduling mainly includes the following:
Prediction of future incoming workloads and Normalizing the QoS parameters.
Selection of the most optimal VM and executing the particular task via, Heuristic/Meta-Heuristic algorithms.
Generally, the VM/task scheduling is done in two ways:
  • On-Demand Scheduling: This scheduling considers the dynamic cloud workloads on demand and VMs are provided quickly by cloud service providers as required. However, it may lead to the problem of workload dispersal. In other words, multiple tasks may be processed by a single VM at a time (Over-provisioning Problem) resulting in degrading the performance of the system.
  • Long-Term Reservation: This scheduling reserves the resources for the long term. However, providing many VMs can lead to Under-provisioning problems in some situations.
These Under and Provisioning problems may cause the wastage of VMs and task execution time, and thereby the overall cost of services may increase. Hence, a well-organized and effective provisioning technique is essential that examines and schedules the cloud workloads efficiently. Figure 7 explains the process of VM Provisioning and Scheduling (VPS) [55].
Fig. 7
VM provisioning and scheduling (VPS)
Full size image
The main aims of VM provisioning are:
  • Fulfill the User’s demand without SLA violation.
  • Prior prediction of user requirements based on incoming workload size.
In cloud provisioning, the SLA is settled between the end users and Cloud Service Providers after fully analyzing the incoming workloads. Before scheduling (mapping) the incoming workload (applications/tasks) to the particular VM/resources, the running VMs are monitored regularly for load estimation [56]. If the VM is found overutilized, then that particular VM is disabled temporarily for any future assignments and these VMs are not allocated immediately after mapping. Afterward, the task executing capability of the VM is also tested before any further allocation. This study also contains a review of various research papers focusing on the principles of load-balancing and scheduling. In the cloud, efficient scheduling of jobs is the main factor ensuring high-performance applications. However, in the cloud, scheduling not only has to pact with the dynamism and the widespread nature of the cloud, but it should also consider the optimization of other important parameters. The matching of tasks to the corresponding machines and scheduling the organization of execution of these tasks refers to mapping. Efficient mapping minimizes the total execution time of the meta-task. The meta-task is identified as a collected work of independent tasks having no inter-task dependencies. The mapping of such meta-tasks is being achieved statically (i.e., offline or in an analytical manner). The general problem of optimally mapping tasks to machines is NP-complete [57]. Task scheduling [58] is the fundamental step of VM management in the cloud. Task scheduling can be of two types: Static and Dynamic Scheduling.

1.7.2 Load balancing approaches

Load balancing is among the chief requirements of a cloud environment. Load balancing usually shifts the load from the highly loaded VM to the minimum loaded VM to ensure the uniform dispersal of load among VMs. It aimed to share the workload among computational resources to maintain load equilibrium and allow each resource to function within its designated efficiency threshold. The uneven distribution of load among VMs affects the improvement of response time, interaction overhead, output, and resource utilization of the system [31]. Furthermore, it improves VM availability and maintains reliability. Besides, the load can be balanced by implanting resource redundancy that fulfills scalability. Numerous strategies have been intended by researchers to attain the finest load balancing. Some of the advantages that inspire the implementation of load balancing in the Cloud are as follows:
  • Efficient VM Utilization in a Cloud Environment: In the cloud, VMs may be inadequately loaded further the general performance of the system will be affected. Moreover, the selected competitive VM can be highly utilized while the other VM may remain idle throughout the process and the underutilized VM may wait for a task. This scenario results in higher processing time and maximizes waiting time. To overcome such inconsistencies, VM utilization needs to be efficient by optimally balancing the load among resources.
  • Adequate Load Distribution: Ample Load distribution is necessary to attain the best possible performing of the system. It leads to utilizing the maximum computing capability of a particular VM and parallel task execution. Likewise, it ensures an adequate load allocated to every single VM according to its capacity in all conditions. It is necessary to dispense workload among all VMs uniformly according to their processing capacities to diminish the task execution time to the meanest possible value.
  • Minimization of Response Time: Inappropriate load distribution leads to several disparities resulting in higher response time which eventually results in an inconsistent state of the system. Thus, it is crucial to realize optimal load balancing to minimize the response time and achieve enhanced system throughput.
Besides, In the Cloud, VM can work independently or collectively as per the requirement and nature of the task. Each VM is capable of processing workload as per its processing capabilities. The prime target of load balancing is to achieve a balanced distribution of workloads among the available VMs. Typically, load-balancing algorithms comprise two elementary policies, i.e., the transfer policy and the location policy [59]. The transfer policy identifies whether the VM is overloaded or not. The dynamic system aspects are also addressed by this policy. The transfer policy also elects the necessity to introduce the load migration for the system. This policy determines when a node is ready to function as a transmitter based on workload evidence i.e., transfers a task to another VM. It further determines when a node acts as a recipient and retrieves a task from another VM.
However, the location policy decides on a suitable under-loaded or over-loaded VM. It locates corresponding VMs and allows them to send or receive workload between them to expand the total performance of a system. Later policies are further categorized as receiver-initiated, sender-initiated, or symmetrically initiated. The location policy chooses an alternative VM for the task migration. If the VM is identified as a qualified receiver, it further searches for a qualified sender VM and vice versa. Upon a virtual machine's eligibility as a transmitter or receiver, a selection policy will be implemented to determine which job in the queue should be moved first [31]. Based on the information and implementation used by these two policies, load balancing mechanisms are also classified as mentioned below [60]:
  • Static Load Balancing
  • Dynamic Load Balancing
  • Adaptive Load Balancing
  • Periodic Load Balancing
  • Non-Periodic Load Balancing
  • Advance Load Balancing
Generally speaking, load-balancing algorithms can also be categorized as hierarchical, decentralized, or centralized depending on where migration decisions are prepared [61, 62].

1.7.3 Fault-tolerant approaches

Cloud is a dynamic system that supports several dispersed resources (VMs) that are heterogeneous and complete millions of user tasks. Nevertheless, this VM has the flexibility to join or exit the system at any given time. Thus, achieving fault tolerance is a critical issue in such dynamic systems [63]. Additionally, the execution of a fault-tolerant system also leads to the optimizations of various QoS parameters and cloud characteristics. Therefore, significant benefits can be attained. It also assures task execution on time, in case of any unexpected scenarios like failure, resource disconnection from the system, task migration, any other unanticipated user operation, etc. Moreover, while numerous previous studies have tackled fault tolerance and task allocation, only a limited number have examined issues at the processor level. In recent literature, a handful of works have delved into extensive research on scheduling and load balancing while incorporating fault tolerance [17]. The concept of abstraction has been split into different layers, i.e., Infrastructure as a Service, Platform as a Service, and Software as a Service layer. There is a necessity to implement appropriate fault tolerance techniques for fault diagnosis to determine several faults in these service levels. This research article includes various fault diagnosis methods corresponding to these service layers, along with fault categories. The defects in any layer can have an impact on its top layer because of the layer interrelationships [17].
Moreover, to reach higher levels of strength in cloud computing, the failures need to be accessed and handled effectively [26, 29, 48, 64]. Extensive work has been proposed in the literature to make the cloud fault-proof. Some approaches proposed in the literature can be labeled as mentioned in Fig. 8.
Fig. 8
Showing the categories of fault tolerance techniques under different approaches
Full size image
Reactive fault tolerance
Once a defect has occurred, reactive fault tolerance is applied. Using this approach, we can decrease the impact of the fault in the cloud and thereby increase the system's robustness and reliability [46, 48]. The focus is on the device recovering in case of failure inside the system [10]. Furthermore, data replication and data transfer are used for restoration [65]. These approaches address Byzantine Faults, Crash faults, Hardware faults, and Host failure. Different fault-tolerant techniques that utilize a reactive approach are planned in Table 2.
Proactive fault-tolerance
This strategy provides pre-planned alternative solutions for the process of handling faults; therefore, fault prediction is proactive. Moreover, the faulty component is substituted with an alternative component runtime to avoid recovery from errors and faults [4, 46, 47, 66]. This approach provides the effectiveness of cost with maximum efficiency and reliability of the system [27] and addresses Software and Parametric faults. Some of the proposed proactive fault-tolerant techniques in the literature are listed in Table 3.
Table 3
Enlightenment of proactive fault-tolerant techniques
Classification/Category
Enlightenment
Software Rejuvenation [25, 26]
In this strategy, the system is rebooted periodically and it starts from the new state. Mainly this strategy is used to address the issue of the aged/old device [67]
Pre-emptive Migration [66, 68]
This strategy performs constant observation of an application to track crucial resources like CPU and RAM [69]
Prediction [70]
This approach requires a basic knowledge of system defects [70]
Monitoring [71]
This strategy more actively participates in carrying innovative resources such as planning, expanding, and migration [71]
Self-Healing [23, 48, 72]
This strategy mainly uses the divide-and-conquer technique to enhance the working of the system. It allows the system to classify, identify, and heal the problems without the intervention of any administrator
SGuard [37]
The SGuard strategy primarily depends on the recovery and rollback process and is mainly proposed for sharing video services [73]
Resilient fault-tolerance
These techniques have some similarities with the Proactive approach. The defects are forecasted, and the effects are prevented or moderated by applying some methodologies. The forecasting utilizes some intelligent learning, which makes Resilient techniques different from Proactive ones. These approaches are adopted for general faults. In this strategy, the system is continuously monitored for faults, which makes it adaptive fault tolerance [10]. Some of the proposed Resilient fault-tolerant techniques in the literature are presented in Table 4.
Table 4
Enlightenment of resilient fault-tolerant techniques
Classification
Enlightenment
Machine Learning [10]
Machine learning techniques mainly reinforcement learning [10], are involved in analyzing the features and characteristics of a machine. Such strategies help the system manage its faults according to its surroundings
Fault Induction [10]
This strategy is a recent strategy used in a cloud environment [10]. Failures are managed by making assumptions based on the reaction of the system
In general, the reactive strategy does not require enforcement of any qualification mechanism in the system till the fault occurs. Efforts are being made to moderate the injurious effects after the detection of faults in the system [74]. In a Proactive strategy, the system is in continuous tracking to analyze the faults and eliminate them before they appear. The device state is continuously screened to predict the fault occurrence in advance. In Resilient strategies, the system operates even in the presence of faults, and the faults will be removed in the given timeframe. The related pros and cons of these tactics are presented in Tables 5 and 6, respectively.
Table 5
Pros of fault-tolerant strategies
Reactive strategy
Proactive strategy
Resilient strategy
Can handle rare faults [10]
Restoration from faults restricts the susceptibility of the system
These strategies seem the future of fault tolerance
Methods like checkpointing, and restarting work well for a lengthy application [10]
The forecasting makes the system more effective [10]
This strategy is more appropriate for real-time applications [10]
The faults are discovered and eliminated continuously
This reduces the resource requirement as the system handles faults efficiently [10]
Table 6
Cons of fault-tolerant strategies
Reactive strategy
Proactive strategy
Resilient strategy
These strategies cannot be applied to real-time applications
As the prediction is required here, and wrong predictions will degrade the performance of the strategy [10]
Frequent modification is required as the cloud itself is the most dynamic environment
Restoration from failure will increase the response time significantly [10]
 
Learning time is required for the agent [10]

1.7.4 General problem formulation for fault tolerance using replication

General problem formulation for fault tolerance using replication
Problem Statement: Problem formulation that focuses on the importance of fault tolerance in the circumstances of clouds.
Problem Scope: The fault tolerance in the cloud is addressed for continuous service delivery even in the event of failures or breakdowns.
Objectives: The main goal is to reduce fault-related service interruptions and downtime to maximize cloud service availability. Additionally, increasing resource utilization, loss of data, and maintaining SLA thresholds are also included in the formulation.
Problem Constraints: To guarantee that the efficiency effect of services is provided as needed. The fault tolerance techniques should add as little overhead as possible. Moreover, the solution should apply to the related computational resources.
Parameters: The parameters manipulated during fault tolerance are MTTF (Mean Time to Failure), MTBF (Mean Time Between Failure), MTTR (Mean Time To Reappear), etc. However, the parameters that are optimized are average resource utilization, makespan, recovery rate, failure rate, success rate, etc. There can be some decision parameters in fault tolerance such as selection of alternative resources, fault detection algorithm, recovery mechanism, etc.
Problem Formulation: For fault tolerance in real-time systems, two important sets can be considered i.e., tasks set (T), and VM set (V). T: {t1, t2…tn}, indicating that n real-time tasks at any instance in the Cloud environment. For each actual-time task {ti | ti ∈ T}, tI has some set of attributes associated with it such as arrival time, dimensions, expected execution time, anticipated finish time, anticipated harvest time, deadline-limit, etc. Deadline and harvest time can be related to each other as follows:
$$Exp\; HT=D-Min\; PT$$
V: { v1, v2…vm}, indicating that m number of accessible VMs in the Cloud environment.
For each accessible VM {vi | vi ∈ V}, vI has some set of attributes associated with it such as vm_id, capacity, cluster, etc.
Fault tolerance can be achieved by using any of the fault-tolerant approaches. Here we are utilizing the replication Fault tolerant technique. Here, the scheduler should possess the capability to generate the required amount of replicas separately for every real-time task.
For each {ti | ti ∈ T}
Enable the scheduler to generate replicas
Allocated VM to each replica,
Calculate the expected finish time fi,j,k for a given replica by the following equation:
$${\text{F}}_{\text{i},\text{j},\text{k}}=\text{A}\left({\text{t}}_{\text{i}}\right)+\text{w}\left({\text{r}}_{\text{i}}\right)+\text{e}\left({\text{r}}_{\text{i},\text{j},\text{k}}\right)$$
Where, i,j, and k represent the key of the original real-time task, the key of the current replica, and the key of the allotted VM, respectively. A is the arrival time for the real-time task, w is the waiting time of the replica, and e is the expected execution time of the replica over the allotted VM.
Further, e(ri,j,k) is computed by the following equation:
$$\text{e}\left({\text{r}}_{\text{i},\text{j},\text{k}}\right)=\frac{task\;dimensions}{computional\;power\;of\;alloted\;VM}$$
After e(ri,j,k) expires, the following condition is evaluated for every real-time task.
Ifreplica(ti) = failed
Mark ti “failed”
Else Mark ti “Succeeded”
Additionally, a reservation mechanism can also be used to achieve Fault tolerance where we reserve the VM in advance which will be allocated in case of fault.
Estimation Metrics: It comprises the estimation of some optimization parameters like recovery time, reached reliability, and effectiveness of resource use for both fault and regular operating conditions.
The advancement in cloud computing technology has reformed the approach computing assets are provisioned, utilized, and managed. Cloud computing offers a vast array of services that are flexible, scalable, and cost-effective. To improve the utilization of cloud resources, various dynamic resource allocation algorithms have been intended in the works. However, ensuring fault-tolerant scheduling and load balancing is a critical challenge that needs to be addressed to provide uninterrupted services in the cloud. Virtual machine reservation is one of the promising approaches that can mitigate these challenges by allocating reserved resources for fault tolerance and load balancing.

2.1 Scheduling with fault-tolerance

Efficient scheduling in the cloud provides optimization of various Quality of Service parameters, especially task completion time. Besides, scalability, availability, security, and fault tolerance are the key features of cloud services. Instead of the complete breakdown of the system, the faults in the cloud lead to performance degradation only. Without fault-tolerant scheduling when one or more components of the system fail, the task execution, waiting time, response time, etc. may increase. This leads to enhanced throughput as well. However, Fault tolerance provides an alternative way for the process completion even if some of the resources may not work properly [46, 64]. Few works of literature have proposed fault-tolerant scheduling algorithms with optimized parameters. Recently, in [75], the Dynamic Clustering Cuckoo Whale Optimization Algorithm (DCCWOA) has been suggested for supporting effective fault-tolerant scheduling in the cloud. The algorithm was tested for varying the tasks between 100 to 1000 with 8 virtual machines. The problem of fault tolerance was also investigated in [76], and a greedy-based best fit decreasing (GBFD) algorithm was proposed for increasing the success rate of task execution along with optimization of other parameters. The model was valued with numerous loads of PUMA datasets. Additionally, the computational complexity was claimed to be O(nm) where n is the VM number in the data center, and m represents computing nodes. In [77], authors proposed GWO (Grey Wolf Optimization)—based Task Scheduling evaluated on the 1000MI task dataset. Fault handling is carried out in the proposed work with efficient task scheduling by employing the task resubmission technique. Extending the chain of work and solving the problems of dependability relationships, learning automata was used and a self-adapting scheduling strategy namely, ADATSA was proposed in [78]. The model was experimentally evaluated on 53 servers with 3 Master nodes and 50 slaves. The complexity was proposed to be O(NK) + O(MS) where N represents cluster nodes, K represents resource category, M is average tasks on a node, and S is average state transitions. In [79], a Fault-Tolerant Hybrid Resource Allocation Model (FTHRM) was recommended which confirms fault tolerance and minimized Turn-around-Time (TAT). The proposed model employs a prior reservation process to distribute resources to the respective tasks, ensuring the guaranteed execution of tasks. Resource reservation is also enabled for time slots with resource organization as needed by the task set with adjusting VM heterogeneity. In case of resource failure, alternative resources are being supplied where the most preferred resource has had the least former workload and the smallest execution time. The authors in [80] presented the framework for adaptive scheduling and fragmentation of tasks namely (WSADF) Workflow-scheduling applying -adaptable and dynamic-fragmentation which initially creates the fragments concerned with the number of VMs in the fragmentation phase and later the scheduling phase pick out the VMs concerned to reduce the usage of bandwidth. WSADF was evaluated on the workload ranging from 25 to 1000 and VMs ranging from 5 to 25. While making the task scheduling adaptable to both heterogeneity and homogeneous environments, CPSO and FIPS were proposed in [81]. The proposed task scheduling was evaluated on 30 servers under 1000 iterations. In this chain to integrate localized edge clouds with publicly accessible clouds and enhance scheduling effectiveness and scalability, a hierarchy-based edge cloud concept was introduced in [82]. Additionally, FTDS, a failure rescue technique is suggested to address the fears that arise while mobile apps are being executed. For evaluation, the workflow was taken from 10 to 70 applications while taking the length of the workflow from 10 to 60. Besides, some of the SLA (Service level agreement) parameters like, CPU necessity, system bandwidth, and memory need to be considered with appropriate scheduling. In this regard, the pre-emption-based algorithm was proposed in [83] which pre-empts the resources from the low-priority task to the high-priority task in case of unavailability of the resources and provides reservation of resources reflecting numerous SLA parameters for facility deployment. The evaluations were carried out via 4 cloud simulations by performing 10 consecutive runs and 60 requests having 10 to 15 subtasks. The cost and deadline of the tasks are considered for defining the priority of the tasks. Moreover, it provides a dynamic resource provisioning and effective fault tolerance process. In this chain, a fault-tolerance aware task scheduling scheme was proposed in [84] namely Checkpointed League Championship Algorithm (CPLCA). This algorithm provides fault tolerance using the checkpointing strategy along with task migration and was evaluated by using workload in the form of Standard Workload Format accessible via the San Diego Supercomputer Center (SDSC). Efficient scheduling and fault handling mutually may ensure task execution and thereby fulfill the real-time environment of the cloud. However, heterogeneous systems and their complexities are increasing dramatically leading to failures. These failures can be eliminated by implementing efficient scheduling approaches. Therefore, the task scheduling problem on heterogeneous systems was addressed in [85]. Being an NP-hard problem, a heuristic algorithm Deadline Based Scheduling Algorithm (DBSA) was proposed to resolve it. The DBSA approach dynamically estimates the figure of permanent tolerating failures by calculating the makespan first till the system tolerates a fixed number of failures. Afterward continuously comparing the makespan with the specified deadline to get the successive number of tolerating failures. The model was evaluated in the workload ranges from 20 to 100 with 4 and 8 VMs. Gaussian Elimination, Fast Fourier Transformation, and Molecular Dynamics Code are used as a kind of application graphs for testing. Finally, the task is mapped to the appropriate processor without violating precedence constraints. Further, in [86] Cost-effective, NNCA_PSO was proposed by modifying Particle Swarm Optimisation (PSO). During evaluations, the workload was varied from 70 to 560 and VMs were used from 4 to 8. Further, the Advance Reservation Fault Tolerance Model (ARFTM) was proposed in [87] which maps the tasks using MCT and tolerates faults using the advance reservation technique. ARFTM was evaluated by varying the workload from 1 to 300.
However, in [88], the fact that “the network bandwidth is limited” and the scheduling policies should decrease the bandwidth usage in cloud computing was considered. Moreover, the author proposes a data locality-based task scheduling approach, i.e., the Balance Reduce Algorithm (BAR). It will reduce network access and thereby reduce bandwidth usage and job completion time while not specifying the type and nature of workload used for evaluation. Furthermore, an improved Balance Reduce Algorithm was proposed with an improvement in machine failure handling. Later in [41], fault tolerance-based scheduling was proposed namely the Dynamic Clustering League Championship algorithm (DCLCA) to reduce the premature failure of the tasks. The model was evaluated in two scenarios where a parallel workload archive containing 73,496 tasks in the form of Standard Workload Format accessible via the San Diego Supercomputer Center (SDSC) was used in the first scenario. In the second scenario, workloads were produced as of the CloudSim’s Workload PlanetLab. All the surveyed methods are brief in Table 7.
Table 7
Comparative analysis of recent scheduling-based fault tolerance algorithms
Method
Year
Parameters
Comparison approaches
Outcomes
Limitations
Platform/Environment
HFSLM [56]
2024
Makespan, average resource utilization
Maxmin, Minmin, FTHRM, OLB, ELISA, MELISA
Efficient Resource utilization and makespan
No security was considered
Self simulator
ARFTM [87]
2023
Reliability
MCT
Highly Reliable
Inadequate load distribution
Self simulator
RFRTS [89]
2024
Reliability
FCWS, FR-MOS
Reliability with varied load
No security
Self simulator
DCCWOA [75]
2023
Makespan, failure ratio, and failure slowdown
ACO, GA, and LCA
58.19%, 19.88%, and 29.32% Makespan, failure proportion, and failure strike parameters respectively
Limited optimization of QoS parameters
Cloudsim toolkit
(MSMO classifier) Modified Sequential Minimal Optimization accompanied Delta-Checkpoint [90]
2023
Accuracy and prediction of faults with reliabilty
Related ML based Classifiers
Enhanced credibility for reliability
Reliability was not proved while comparing with MSMO
Cloud simulation 3.0.3
GBFD [76]
2022
SERoV, average expenditure, average completion time
FCFS algorithm, Cost-Greedy Dynamic Price Scheduling (CGDPS) algorithm [4], Modified Best Fit Decreasing (MBFD) algorithm
Optimizes performance of the cloud systems
Lacks dynamic resource utilization and uniform load distribution
Cloudsim toolkit
GWO-based Task Scheduling [77]
2022
Makespan, Execution time, Communication delay
ANGEL, TTSA (Temporal Task Scheduling Algorithm), MapReduce Scheduling, and Dynamic Slot Scheduling
Effective task scheduling with fault tolerance is achieved with optimized parameters
Evaluations were carried out only on four tasks
CloudSim, JDK7.0 and Eclipse
ADATSA (Self-adapting Task Scheduling algorithm) [78]
2022
Adaptability in circumstances, optimization of resources, and QoS
LAEAS, PSOS, and K8S scheduling engine
Better adaptability and QoS
Lack of heterogeneity in VMs
Amazon EC2 and Apache JMeter(v 5.4.0)
Fault-Tolerant Hybrid Resource allocation Model (FTHRM) [79]
2021
Turnaround Time, Flow Time, Resource Utilization
MCT
FTHRM improvises TAT from 32 to 40%, Lowers Flow Time to 26 to 45%, Provides 15 to 27% better average resource utilization than traditional MCT
The proposed system was not fully dynamic concerning the nature of tasks
Simulation via C-Language
WSADF [80]
2019
Adaptability, Response time, Throughput
FPD in the fragmentation phase, CTC, SLV, and QDA in the Scheduling Phase
Adaptable to the environment, improvements in response time and Throughput
Increased delays and average response time which eventually reduces throughput and efficiency
CloudSim simulator
Canonical Particle Swarm Optimization (CPSO), Fully informed particle Swarm Optimization (FIPS) [81]
2019
Throughput, Utilization, Adaptability
CPSO in h-DDSS (Heterogenous Dynamic Dedicated Server Scheduling) and DDSS (Dynamic Dedicated Server Scheduling) FIPS in h-DDSS and DDSS
Scheduling is adaptable to both heterogenous and homogenous environments
May not manage the real-time data
Not specified
Fault-Tolerant Dynamic Scheduling (FTDS) [82]
2019
Scalability, Success rate, Competitive Ratio
UES, IC-PCP in LIGO and Epigenomics
Improvement in scalability and performance, Achieves the trade-off between cost and system delay
May consume energy
Amazon T2, RWP Model
Dynamic Clustering League Championship algorithm (DCLCA) [41]
2018
Makespan
MTCT, MAXMIN, Ant Colony Optimization, and Genetic Algorithm-based NSGA-II
In the case of 5 cloud users with 5 and 2 brokers and data centers respectively, DCLCA lowers makespan with an improvement of 57.8, 53.6, 24.3, and 13.4%, and in the case of 10 and 5 cloud users and data centers, DCLCA shows improvement of 60.0, 38.9, 31.5 and 31.2%
A limited number of cloud users, brokers, and data centers were considered
CloudSim 3.0.3 toolkit with Eclipse Luna 4.4.0
Deadline Based Scheduling Algorithm (DBSA) [85]
2018
Makespan, Reliability, and PSS (possibility of Scheduling Success)
HEFT and FTSA
DBSA can successfully endure crashes and enhance reliability within time constraints
Limited optimization of QoS parameters
Not Specified
Nearest Neighbour Cost-Aware Particle Swarm Optimization (NNCA_PSO) [86]
2018
Scalability, Makespan, and Monetary Cost
PSO and CA_PSO
High Scalability, Low Makespan, and Monetary cost
The model is less reliable
CloudSim toolkit
Checkpointed League Championship Algorithm (CPLCA) [84]
2017
Makespan, and Response Time
Ant Colony Optimization (ACO), Genetic Algorithm (GA), and the basic League Championship Algorithm (LCA)
CPLCA scheme produces an enhancement of 41%, 33%, and 23% on Makespan, and 54%, 57%, and 30% improvement in Response Time than ACO, GA, and LCA respectively
Insufficient load balancing for a dynamic system
CloudSim 3.0.3 toolkit has a modified CloudAnalyst GUI interface
Improved BAR (Balance Reduce Algorithm) [88]
2012
Task Completion Time
BAR (Balance Reduce Algorithm)
Minimizes Makespan even in case of failure by fault tolerance
Not suitable for heterogeneous environment
Cloudsim
Scheduling and fault tolerance frameworks
Various scheduling and fault tolerance frameworks are recommended in the literature. In this section, these frameworks are surveyed and presented. Comparative analyses of different scheduling and fault tolerance frameworks are presented in Table 8.
Table 8
Comparative analysis of various Fault tolerance and scheduling frameworks
Framework
Approach
Used techniques
Parameters
Features
Limitation
Self-healing (SHelp) [91]
Proactive
Self-healing, restarting, checkpointing,
Response time, throughput, availability
Speedy functionality, fewer overheads than ASSURE
Not suitable for software faults
PFHC [92]
Proactive
Replication
Execution time, reliability
Lower cost, more suitable for HPC
Computational cost is very high
WSRC [93]
Proactive
Rejuvenation technique
Resource availability and other overheads
Improved availability and improved overheads using variable time rejuvenation
Restricted suitability
SRFC [94]
Proactive
Software rejuvenation
Scalability, throughput, reliability
Improved availability, Multiple VM rejuvenation
Limited to software rejuvenation
Fault tolerance scheduling (FTDG) [95]
Proactive
Pre-emptive migration
Reliability, response time, and throughput
Minimum Response time
Restrictive applications
AFTRC [96]
Reactive
Replication and checkpointing
Accuracy, reliability, and availability
Applicable for real-time applications
Stunted availability of resources when the load is high
BFTCloud [97]
Reactive
Replication
Scalability, throughput, reliability
Highly reliable and is qualified of tolerating all byzantine faults
Low resource utilization
Fault-tolerant scheduling mechanism for real-time tasks in virtualized clouds (FESTAL) [98]
Reactive
Replication
Throughput, reliability, availability, usability
Energy-efficient resource utilization framework
Execution can crash if both central and backup fail concurrently
Efficient fault tolerance technique (EFTT) [99]
Resilient
Machine learning
Throughput, availability, reliability, response time
High response time, high availability and reliability, adaptive in nature
Insufficient resource utilization

2.1.1 Proactive-based scheduling and fault tolerance framework

In this approach, the system can handle any disruptions or interruptions. The state of the system is monitored continuously for breakdowns and failure. Some of the proactive-based scheduling and fault tolerance frameworks found in the literature are mentioned below:
  • SHelp [91]: This approach was proposed by improving the existing framework namely, ASSURE [100] which was implemented at the rescue points. ASSURE searches the rescue point by traversing the rescue-trace graph while in SHelp each rescue point is assigned some zero-initialized weight. For a particular rescue point, the weight increases proportionally as the rescue point is applied. Whenever a fault occurs, the searching of these rescue points is performed in the decreasing order of their weight.
  • PFHC [92]: This is a proactive framework of fault tolerance proposed for HPC (High-Performance Computing) applications in the cloud. This framework works on three chief modules: Node Monitoring Module is prepared with some special Lm-sensors [101, 102] to perform periodic monitoring for several parameters such as fan speed, CPU temperature, etc., for wellness. The fault Tolerance Module comes into action in case of the occurrence of faults. It is responsible for providing the information to the resource provider for an alternative resource and migrating the requests to the new resource. The Controller Module is installed at every node. The fault-tolerant policy is implemented by this module. The module is also responsible for the actual-time migration of VMs.
  • WSRC [93]: This framework contains a module namely, a failure detector that checks the Virtual Machine Manager (VMM) periodically for any kind of variations such as delay in response time or mismanagement of memory. If any fluctuation is found, the VM running status is saved and VMM is repaired using the rejuvenation technique. Rejuvenation generally leads to high overheads however, WSRC uses variable time rejuvenation to control overheads.
  • SRFSC [94]: The software rejuvenation technique was used in this framework. This framework primarily works in three phases: In the first phase, the packet that has the information about CPU and VM’s memory usage is received by Aging Failure Detection. The another step is the evaluation of VM for failing grades. This step is known as Aging Degree Evaluation which itself is done in two main aspects: CPU/Memory usage and Packet arrival, i.e., whether the packet arrived before/after expectations or the packet has been lost. In the third and final step, the decision of Rejuvenation is taken. Here, the VMs are migrated to any other native VM and the original VM is rebooted.
  • FTDG [95]: FTDG is a fault-tolerant framework where the pre-emptive relocation is being achieved. The architecture of this framework mainly comprises of four functioning spaces. User Space is used by the user to submit their data flows. Graph Space transforms the submitted user data into Direct Acyclic Graphs (DAG). Moreover, the DAG is analyzed for the critical and non-critical paths. In Storm Space, Scheduling and fault tolerance mechanisms are applied. Hardware Space contains various data center resources.

2.1.2 Reactive-based scheduling and fault tolerance frameworks

In such frameworks, the faults are handled once they occur. Unlike proactive approaches, monitoring of system behavior is not required in such frameworks. Some of the Reactive-based scheduling and fault tolerance frameworks found in the literature are mentioned below.
  • AFTRC [96]: In (Adaptive Fault Tolerance in Real-time Cloud Computing), the received tasks are held in some input buffer and the task execution will be accomplished on a First Come First Serve basis. This model also consists of the other modules. The Acceptance Test (AT) is the module that checks the results of each embedded algorithm for accuracy and verifies the results. The Time Checker (TC) checks whether the result is obtained within the deadline or not. If not, then the specific task is sent backside to the input buffer. The Reliability Assessor (RA) adjusts the reliabilities of VMs based on obtained results. The decision Mechanism (DM) takes the highest reliable node and selects the output from that.
  • BFTCloud [97]: This framework uses replication techniques and completes the user requests timely even in the presence of faults. The amount of replicas/nodes is utilized by employing the failure probability of all nodes. The failure likelihood of the replica group should constantly be less than the top-level failure likelihood. The functioning of BFTCloud framework mainly works in five phases: Primary Selection: In this phase, the basal node is designated based on the rating by adding the priority weight and QoS value assigned to each node. The highest rating value node will be chosen as the primary node. Replica Selection: In this phase, the number of replicas is selected by observing the QoS of every node from the viewpoint of both the primary node and the cloud module. The new QoS is calculated, and again rating will be done. Request Execution: This phase allows the nodes to complete the request and react to the cloud module accordingly. The cloud, in turn, checks the consistency of the obtained results based on different cases [17]. If the results are consistent, then the primary replica is assigned to the next request. Primary Updating: In case of a fault in the primary replica, this phase informs all other replicas to select the alternative. Replica Updating: This phase removes the faulty replica and adds the new nodes to decrease the failure probability.
  • FESTAL [98]: It is a fault-tolerant scheduling framework where the primary backup technique is realized to handle the faults. In this framework, the user tasks are queue up in some input buffer and assigned to the schedular having three controllers, i.e., Resource Controller, Backup Copy Controller, and Real-time Controller.
The Backup Copy Controller takes the backup. Afterward, the Resource Controller explores the two VMs, that can complete the task within the deadline. Based on the search results, two decisions can be made.
  • In case the two corresponding VMs are found, both task instances are scheduled on the respective VMs.
  • In case no such VM is found, the task is rejected.
In this framework, "If the anticipated end time is less than or identical to the task time-limit, a passive backup is utilized; otherwise, an active backup is employed.

2.1.3 Resilient-based scheduling and fault tolerance frameworks

These techniques have some similarities to the proactive approach. Moreover, the defects are forecasted, and the effects are prevented/moderated by applying some methodologies. The forecasting uses some intelligent learning that makes resilient techniques different from proactive ones. Compared to conventional fault tolerance techniques, resilient fault tolerance provides increased durability and adaptability in the event of system breakdowns. Some of the advantages of resilient fault tolerance over traditional fault tolerance are:
  • Dynamic environment:
  • Resilient systems can bounce back from errors without sacrificing functionality because they can dynamically adjust according to shifting circumstances. They are made to respond quickly to changing threats and difficulties. However, conventional fault tolerance techniques could find it difficult to adjust to sudden or rapid shifts in the environment. They might not react to new kinds of errors as well since they frequently rely on predetermined rules.
  • Recovery:
  • Often, automated recovery mechanisms found in resilient systems are capable of promptly detecting and fixing errors without the need for human interaction. This reduces the effect on Coordinated functions and decreases downtime. On the other hand, to recover from errors, traditional approaches might need more manual intervention as compared to Resilient ones. This could result in longer time frames for recovery and a higher chance of service interruption.
  • Real-time track reporting:
  • Sophisticated analytics and tracking techniques offer practical observations on the system’s performance and are frequently integrated into resilient systems. Furthermore, active defect identification and prevention are also possible by these techniques. Unlikely, conventional approaches might be less successful in locating and addressing errors as they depend on frequent checks or event-generated reactions.
  • Optimization:
  • Resilient systems maximize the use of the available resources during fault recovery, guaranteeing that resources are distributed effectively to sustain critical operations. Besides, traditional techniques could use expensive strategies, which could result in more inefficiency and lower effectiveness of the system all around.
  • Flexibility and adaptability:
  • Improved adaptability and flexibility are frequently displayed by resilient designs, enabling them according to changing demands and adjust resources upward or downward in response to consumption. However, traditional approaches could find it difficult to adjust dynamically or regulate shifting demands, which could result in inefficiencies during times of high consumption.
Put it all up, systems can recover swiftly and efficiently in dynamic contexts due to resilient fault tolerance, which provides a more systematic and flexible approach to addressing failures. When compared to conventional fault tolerance techniques, this strategy frequently results in increased overall performance, decreased interruptions, and enhanced system efficiency. In this context, EFTT (Efficient Fault Tolerance Technique) is a type of resilient-based approach. In [99], the author used Machine Learning to handle faults and generate solutions for FT. ML was, nevertheless, applied as a sub-constituent of the general FT solution. Some solutions have intensively employed ML to forecast faults using a set of specified variables. Many applications have been working with ML while handling hardware faults. Here, artificial intelligence, or machine learning, is used to create a system that can operate autonomously like a human without the need for human concern. Machine learning procedures can be used to increase a system's reliability even in the case of fault tolerance. Such fault tolerance techniques are known as Resilient Fault-Tolerant Techniques as discussed in Section 5.3. Machine learning techniques are typically used in proactive approaches, predicting failures before they happen by using historical system data. The Resilient techniques for fault tolerance are the emerging ones because the ML accesses data and even can learn from data. One of the similar learning methods namely, reinforcement learning was applied in [103] that studies the fitness of VMs in cloud environments. By using such types of learning, every VM participates in the learning process independently. As recommended in [104], fault tolerance in a distributed or parallel learning system is achieved by constantly tracking the input parameters in the server. Here, the entire system returns to the most recent checkpoint following an error. Checkpoints are not performed at every stage by such systems, even in the presence of a high number of calls and activity in the network. Forecasting defects are well-known in fault identification and handling, as stated in [105]. Quick error detection can prevent more serious system failures. Numerous processes make up this operation, and some of the most recent research investigations include model-based approaches that are quantitative, model-based approaches that are qualitative, and history-based. Apart from reinforcement learning, unsupervised learning is an additional technique for pattern recognition in the data without predefined output [106]. Such techniques do not allow for the estimation of the outcome since unsupervised learning lacks an output target. Instead, algorithms have chosen to depend on their expertise to pull out as much detail as they can from the data. The deep learning techniques were proposed in [107] as a rapid way to identify multicriteria errors in complicated industrialized analysis. Fault tolerance can benefit from the application of such AI-related techniques.
Fault Induction: In this Resilient technique, failures are managed by making assumptions based on the reaction of the system. Moving forward in this technique, [108] proposed that a hybrid energy system be practically used to apply the multi-source power administration technique. The analysis shows how to improve fault tolerance, scalability, efficiency, and dependability. The concepts proposed in [99] are being used by some of the most well-known firms in the world, including Google and Amazon, to increase their fault tolerance. Here the authors have employed the software namely gameday. GameDay is software intended to highlight significant shortcomings in methods for finding flaws and dependencies between different components of the system. In a GameDay scenario, team members from every level of the business must collaborate to find a solution. In the repeatable tests if everything went perfectly, then the GameDay activity will be considered successful. Similarly, the authors in [109], employed game theory and declared that the kind of smart grid operator will swiftly supply electricity through a dispersed system. Additionally, several classifiers have been compared for the metrics like accuracy and fault predictions in [110].

2.2 Load balancing with fault tolerance

Load balancing with fault tolerance is a significant dispute in cloud computing. The efficient load balancing techniques require the inclusion of fault tolerance capacity as well. It enables the system to distribute the load to all the available nodes uniformly and simultaneously deals with detecting and removing the faults to maximize the performance and efficiency of the cloud environment. Various algorithms are surveyed and presented in Table 9. The authors have introduced Honeybee Inspired-Load Balancing (HBI-LB) in [111], a reliable and nature-inspired Fault Tolerant load-balancing approach. The assigned tasks in the suggested method were in the range of 100 to 500 in number and 2000 to 10000 in length. Further 10 and 15 fog centers and fog nodes were utilized respectively. The information of scheduling the other in-progress tasks about the status and load on the resources is given by other assigned tasks in the same way as the honeybees inform buddies about their position. Besides, in [112], the Proactive and Reactive Fault Tolerance Framework (PFTF) was proposed with ECB (Elastic Cloud Balancer). It avoids the situation in the cloud where some nodes are idle or minimum loaded, and some are overloaded. The proposed ECB enhances the scheduling quality in combination with the Job Shop Scheduling by considering and optimizing QoS parameters. The model was evaluated by taking the tasks in the range of 9 to 13 and task size in the range of 1000 MB to 8000 MB. Additionally, due to the dynamic nature of cloud infrastructure, real-time features such as availability and reliability need to be achieved. In this chain, Proactive Load Balancing Fault Tolerance (PLBFT) was proposed in [113] as an efficient fault-tolerant load-balancing model evaluated on the private cloud platform. This model relies on migrating the faulty VM to another destination host directly. Besides, the load in the destination VM is managed (in case of overload in the destination VM) before migrating the defective VM there. Furthermore, this approach shows high reliability as compared to other similar techniques. Load balancing and fault tolerance techniques are designed to provide highly reliable and available services. For further growth in the availability of cloud services, a combination of load-balancing and fault-tolerant techniques has been proposed [114]. The proposed model is highly reliable in case of task failure while taking the task number between 13 to 18, task execution time between 1 to 9, and task priority between 1 to 3 with four VMs. Moreover, in [115], Deadline Pre-emptive Scheduling (DBPS) was proposed based on cloud partitioning where the fault tolerance has been achieved by Throttled Load Balancing for Cloud (TLBC). The model was tested on a workload of 10 to 300 while not specifying the number of VMs. However, a Machine learning-based approach was proposed in [99], namely, Fault-tolerance Load Balancing (FTLB), which embeds fault tolerance in load balancing with the optimization of other QoS parameters. The evaluation was performed using 100 computing cycles on three VMs. Furthermore, an Integrated Virtualized Failover strategy (IVFS) like AFTRC was proposed in [116]. It employs replication and checkpoint-restart in which Cloud Load Balancer (CLB) was added to AFTRC, and checkpointing was carried out by implementing the Reward Renewal Process (RRP) [117]. Once the load was received, it was transferred to CLB by the Cloud Controller (CC). The main job of CLB was to replicate the load on some suitable VM based on load information in case of failure.
Table 9
Comparative analysis of different proposed fault tolerance and load balancing algorithms
Model/Technique
Year
Parameters
Compared with
Outcomes
Advantages
Limitations
Platform/Environment
Honeybee Inspired‑load Balancing (HBI-LB) [111]
2022
Average response time
Round Robin (RR), Throttled (TH), and Equally Spread Current Execution Load (ESCEL)
Average response time was optimized than compared approaches
Maintains load equilibrium
The model was not evaluated on a large task scale. No fault-tolerant parameter was considered
CloudSim 3.0.3-based Cloud Analyst tool
Proactive Load Balance Fault Tolerance (PLBFT) [113]
2021
Execution time, reliability
Adaptive Fault Tolerance in Real-Time Cloud (AFTRC)
PLBFT achieved the highest reliability calculations than AFTRC
Better Fault prediction and tolerance
An increased number of migrations was observed which maximized
execution time
Cloud Simulator
Load balancing with fault tolerance algorithm using
Replication technique [115]
2021
Availability, Resource Utilization
Fault Tolerance Workflow Scheduling the FTWS [17]
Efficient task scheduling along with fault-tolerance
Optimized Availability and System Performance
Poor resource utilization
Amazon EC2
Proactive Fault Tolerance Framework (PFTF) [112]
2017
Execution time, Network congestion, cost
High-Performance Linpack (HPL), Honeybee
Foraging Algorithm
Improved execution time and time delay
Network congestion delay was reduced by 47%, Reducing the cost
No consideration of Resilient Fault Tolerance
CloudSim 3.0 tool
CLBC (Load Balancer), Deadline Based Pre-Emptive Scheduling (DBPS) [115]
2014
Throughput, Completion time, execution time, and computational costs
Traditional related algorithms
The computational cost was minimized
Effective Load Balancing
Not suitable for deadline-based task accomplishment
Cloudsim
FTLB [99]
2017
Throughput, Availability, Reliability, Response time
Ant Colony, Osmosis LB, Honeybee Foraging, Artificial Bee Colony
High Response time, High Availability, and Reliability
Adaptive nature
Slow in function
Not specified
Integrated Virtualized Failover strategy (IVFS) [116]
2016
Pass Rate, Task Finish Time
Virtualization and Fault Tolerance Approach (VFT) [118]
High Node Pass Rate and Less Service Task Finish Time
High fault-tolerant, both forward and backward recovery
Not suitable for the large-scale environment
 
The comparative analysis of different fault tolerance-based load-balanced algorithms is presented in Table 10. These algorithms were proposed to distribute the workload regardless of faults across the nodes, i.e., having the capacity to handle the faults.
Table 10
Comparative analysis of fault-tolerant-based load-balancing algorithms
Algorithm
Year
Parameters
Outcomes
Limitations
Platform/Environment
Hybrid Load Balancing [119]
2017
Response Time
Minimizes response time and overloading situations
Lacks migration in case of failure
Cloudsim
Throttled algorithm and Equally Spread Current Execution algorithms (TA & ESCE) [120]
2017
Waiting Time, Turnaround Time, Resource Utilization
Turnaround time and wait time were reduced and resource utilization was enhanced
Lacks migration technique for performance optimization
Cloud Simulator
Starvation Threshold–based Load Balancing (STLB) [121]
2019
Response Time
Increases in resource utilization rate, Minimizing migration cost and response time
Not suitable for dependent tasks
CloudSim
Enhanced LB (TA & ESCE) [122]
2017
Response Time
Evades overloading, reduced cost, and response time
Not optimizing other QoS parameter
CloudSim
Genetic Algorithm and the gravitational emulation local search GA-GEL [123]
2015
Makespan
Reduced Makespan
Uneven Load Distribution
CloudAnalyst
LBHM [124]
2018
Response Time, Processing Time
Processing and response time were reduced
Increases Execution Time of the VM
CloudSim 3.0.3
LB strategy based on AC [125]
2014
Response Time
Reduces response time
Not optimizing other QoS parameter
CloudAnalyst
VM-Assign Load Balancing [126]
2014
Resource Utilization
Enhances Resource utilization
No Dynamism was considered
CloudAnalyst
Modified optimize response time [127]
2016
Response Time
Response time was enhanced
Insufficient load distribution
Not Specified
Weighted active monitoring load balancing (WAMLB) [128]
2018
Resource Utilization
Effective resource utilization
Not optimizing other QoS parameter
CloudAnalyst
Priority-based modified throttled algorithm (PMTA) [129]
2016
Response Time
Balanced load distribution and minimized response time
Starvation for low-priority tasks
CloudSim3.0 and CloudSim-based tool
Enhanced LB (TA & ESCE) [122]
2017
Response Time & Machine Cost
Uniform load distribution with less cost
No considered weakness found
CloudAnalyst
LBHM [124]
2018
Response time, processing time & Algorithm cost
Cost-effective
The current workload of VM is not studied
CloudAnalyst
MET [130]
2017
Response Time, Cost
Cost-effective and minimum response time
May sometimes increase processing time
CloudAnalyst
Hybrid Approach (TA & ESCE) [131]
2019
Response Time, Processing Time, Cost
Cost-effective and minimum response time
Does not include any fault-tolerant strategy
Cloud sim
Improved WRR (weighted Round Robin) [132]
2018
Processing time, and cost
Avoid starvation and cost-efficient
The current workload of VM is not studied and lacks fault handling
Eclipse framework
STLB [121]
2019
Resource Utilization and overall cost
Increased utilization rate and Dropped overall migration cost
In-appropriate for dependent workload
CloudSim
LB Strategy [133]
2014
Availability
Uniform workload distribution and high availability
Increased response time because of FCFS allocation
CloudSim
Token-bucket rate-limiting technique [134]
2023
Availability and Scalability
Good quality of services to customers
May cause load imbalance
Zuul gateway
Cuckoo optimization-based energy-reliability aware resource scheduling technique (CRUZE) [134]
2020
Cloud service availability, energy consumption
Reducing energy consumption and increasing availability
May cause load imbalance
CloudSim toolkit
Single intervention at random interval (SIRI) strategy [135]
2023
Service Availability, penalty rate
Prevents SLA violations and offers high service availability
May cause load imbalance
Amazon EC2
Backpropagation (BP)-based OnlIne hardware fault Diagnosis System has been built, named BOIDS [136]
2020
Hardware- faults (transient, intermittent, and permanent faults)
More than 97% accuracy in diagnosing hardware faults
Only hardware fault models are considered
SpecInt2000 and MiBench set to 1c1t (1 core 1 thread)
PSO, Round Robin, (ESCE) Equally Spread Current Execution, Throttled Load balancing [137]
2023
Response time, Processing time of data center
Identified the valuable relation between VMs and tasks
Lacks the dynamism of circumstances
cloud analyst platform

3 Discussions and observations

The presented survey summarizes the focus of researchers on distinct hybrid fault tolerance-related frameworks. The main emergent and developing methods of fault tolerance in a cloud environment are categorized into three different categories: Reactive Methods, Proactive Methods, and Resilient Methods. The survey was conducted on two main hybrid fault-tolerant categories, i.e., scheduling with fault tolerance and load balancing with fault tolerance. On surveying, several observations were gathered and listed below.

3.1 Statistics of hybrid survey of scheduling and fault tolerance algorithms

While dealing with hybrid frameworks of scheduling and fault tolerance, researchers have focused on all three fault tolerance approaches, i.e., Reactive, Proactive, and Resilient. However, it is observed that more emphasis on Proactive and less on Resilient ones. The related statistics of these approaches are depicted in Fig. 9.
Fig. 9
Showing different fault-tolerance approaches targeted by researchers
Full size image
Moreover, different techniques such as Replication, Migration, and Rejuvenation have also been employed while dealing with this hybrid framework. Replication techniques are mainly used for reactive approaches. On the other hand, Migration and Rejuvenation techniques are utilized for proactive approaches. It is also observed from the literature that replication and migration techniques were more frequently used to address the faults in the cloud. Moreover, self-healing and checkpoint restart techniques are used by the SHelp framework. The statistics of different approaches employed for Reactive, Proactive, and Resilient strategies in this hybrid framework are depicted in Fig. 10.
Fig. 10
Showing category-wise percentage of different techniques used in different fault-tolerant approaches
Full size image
It is also noticed from the presented survey that different types of faults have been handled by using hybrid fault-tolerant scheduling and load-balancing frameworks. Moreover, it was observed that software faults, hardware faults, parametric faults, and crashes were resolved using a proactive approach. The reactive approach addressed configuration faults, parametric faults, byzantine faults, participant faults, and host failures. Likewise, resilient approaches are utilized to manage general faults. Additionally, the overall statistics of different faults handled by considered hybrid frameworks are depicted in Fig. 11.
Fig. 11
Showing the percentage of optimized parameters in surveyed scheduling and fault tolerance
Full size image
The statistics of the fault models focused on the surveyed articles show that researchers are more motivated towards software faults but the transient, intermittent, and permanent faults are found to be less in the eyes of the researchers. For several strong reasons, addressing these kinds of faults is essential in distributed systems/applications. First, proactive steps to guarantee system resilience are required due to the unpredictable nature of transient faults, which are brief interruptions in system performance. To reduce downtime and provide consistent user experience, organizations must recognize and address transient issues. Second, a major threat to system reliability is intermittent failures, which are defined by irregular disruptions that might happen at any time. To avoid flowing failures and guarantee the stability of necessary executions to preserve the system's overall integrity, intermittent faults must be effectively managed. Furthermore, we cannot exaggerate the seriousness of permanent faults. These enduring problems may cause the system to deteriorate over time, impacting system operation and SLAs. Therefore, resolving permanent faults is essential for maintaining the system's lifespan and functionality while ignoring them might cause irrevocable harm and compromise the global sustainability of the system. Finally, the maintenance of system continuity, robustness, and reliability is the primary reason for managing the discussed hardware failures. In the end, proactive fault management techniques contribute to uninterrupted system/application performance during unexpected obstacles by protecting the integrity of crucial operations and improving SLAs and thereby user experience and satisfaction.

3.2 Statistics of hybrid survey of load balancing and fault tolerance algorithms

It is also perceived in this survey that researchers have focused on the optimization of various parameters simultaneously along with fault tolerance. The response time was considered and optimized more frequently as compared to other QoS parameters. And least consideration is on task waiting time and the computational cost. Based on this survey, the statistics of various optimized parameters are presented in Fig. 12. Besides, the considered frameworks include both dynamic and static environments, and the researchers are more motivated toward dynamism than static algorithms. Figure 13. depicts the statistics of the surveyed models that support dynamism.
Fig. 12
Showing the percentage of optimized parameters in surveyed load balancing and fault tolerance frameworks
Full size image
Fig. 13
Showing the percentage of dynamism in surveyed hybrid load balancing and fault tolerance frameworks
Full size image
The analysis was carried out for the parameter optimization of the reliable cloud. Figure 14 presents the degree of optimization in metrics of scheduling with fault tolerance, scheduling with load balancing, fault tolerance, load balancing, and scheduling. Additionally, parameter optimization analysis of various fault-tolerant approaches from the literature was also conducted and presented in Fig. 15.
Fig. 14
Showing the analyses of parameter optimizations for different cloud reliability measures
Full size image
Fig. 15
Showing the percentage of parameter optimizations for different fault-tolerant approaches
Full size image
Finally, the observations regarding the platform or environment used for simulation in the presented surveys are statistically presented in Fig. 16.
Fig. 16
Showing the percentage of tools used for simulation by the researchers
Full size image

4 Forthcoming research directions and open issues

It can be examined from the reviewed state-of-art that some important QoS parameters, except Response Time, are not being focused on. Other parameters, such as makespan, turnaround time, waiting time, flowtime, resource utilization, and accuracy, also need to be considered. Furthermore, various other faults, like byzantine and system crashes, etc., are also not examined much in hybrid fault tolerance algorithms. Therefore, it is necessary to enhance the performance of these hybrid fault tolerance algorithms by contemplating these limitations in forthcoming research. Moreover, researchers should focus on some of the below-mentioned aspects to overcome the limitations of existing techniques [138, 139].
  • Focus more on resilient fault tolerance.
  • Focus on the computational cost along with fault tolerance.
  • Identify and predict the faults accurately.
  • Resolve faults with load balancing and scheduling.
  • Fault handling with optimization of other QoS parameters.

4.1 Future works

After careful consideration and assessment, it is concluded that several research fields might be followed to raise the performance of cloud computing and boost the optimization of QoS parameters of cloud systems. They are listed below:
1.
The researchers can make the scheduling efficient for better makespan and average resource utilization.
 
2.
The assessed state-of-the-art shows that, except for response time, certain crucial QoS criteria are not being prioritized. It is also necessary to take into account additional factors including turnaround time, waiting time, flow time, resource utilization, and accuracy.
 
3.
To improve task execution time and scheduling, a large body of research is focused on discovering resource and workload identification criteria. For workloads to be adaptive, scalable, and optimal, under and overusing resources should be avoided.
 
4.
A sender-initiated load balancing mechanism that assists in uniform load distribution among dispersed nodes is necessary for task relocation.
 
5.
Reservation can be used for fault tolerance as suggested in [87] for ensuring complete execution of tasks where the resources are reserved well in advance and may be used in case of faults.
 
6.
It is essential to concentrate on penalty limiting while taking into account system failures to attain QoS optimisation-based allocation.
 
7.
Only a few scheduling methods include the availability feature, and it's highly dependent on VM failure and changes in the impact rate of users, therefore, to decrease VM failure, it is important to take this parameter into further consideration in later algorithms.
 
8.
The penalties on account of faults can be minimized by accompanying the models with efficient load-balancing techniques.
 
9.
It is clear from examining several methods that a task scheduling algorithm by itself cannot address all the issues. Most algorithms base their work scheduling on a few factors. One method, for instance, only considers the response time and execution time parameters and overlooks other QoS principles like the execution cost, dependability, utilization, etc. Therefore, by including more standards, an improved scheduling algorithm that can produce better results may be developed.
 
10.
Future studies should consider the factors of scalability, elasticity, and other fault overheads which are the properties of the system to fit in a situation.
 

4.2 Methodical roadmap for open challenges

A structured strategy or roadmap presented in Fig. 17 that incorporates prioritization based on influence and feasibility is needed to address the scheduling and load balancing with fault tolerance challenges.
Fig. 17
Showing the proposed structured roadmap to address the cloud challenges
Full size image

5 Conclusion

In this study, diverse models for analyzing the faults, and rectifying these faults by implementing fault-tolerance integrated with scheduling and load-balancing strategies in cloud environments are comprehensively surveyed. The main emergent and developing methods regarding fault tolerance in the cloud environment are categorized into Proactive, Reactive, and Resilient. In resilient approaches, the revolutionary technologies AI/ML are considered and are observed to be more efficient than proactive and reactive techniques. It is because the reactive and proactive techniques normally employ the traditional procedures like, checkpoint restart, replication, migration, etc. which have limitations as these procedures could find it difficult to adjust dynamically or regulate to shifting demands, which could result in inefficiencies during times of high consumption.
After reviewing the literature, the below-mentioned conclusions can be drawn:
  • Checkpoint, restarting, and replication were found to be the frequently used methods to address the faults in the cloud.
  • Scholars and researchers are more concerned with determining crash defects than hardware faults such as transient, intermittent, or permanent faults.
  • When it comes to the implementation tool for evaluating the presented algorithms, research is mostly using the Cloudsim tool.
  • Proactive approaches have been used more frequently than reactive and resilient.
  • Researchers are more motivated toward response time and less towards makespan, adaptability, accuracy, and crashes.
  • Since the resilient approach utilized machine learning and artificial intelligence to predict and handle faults; therefore, it is the forthcoming effort of fault tolerance in the cloud.

Declarations

Ethical approval

Not applicable.
All the authors agreed to publish the manuscript.

Competing interest

The authors declare no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Written by :

Sheikh Umar Mushtaq

received the B.C.A. and M.C.A. degrees in computer applications from the University of Kashmir, Srinagar, India. He is currently a Research Scholar of computer application with Lovely Professional University, Phagwara, Punjab. He has presented and published various conferences with the best presentation awards. His research interests include scheduling, load balancing, fault tolerance, and cloud computing. Sheikh Umar is dedicated to make significant contributions to these academic and technological communities.

Sophiya Sheikh

received the B.Sc. degree in information technology from Maharshi Dayanand Saraswati University, Ajmer, India, the master’s degree in computer application from Rajasthan Technical University, Kota, and the Ph.D. degree from the Department of Computer Science, Central University of Rajasthan, India. She is currently an Associate Professor with Lovely Professional University, Phagwara, Punjab, India. She regularly writes articles and research papers in reputed national and international magazines and journals. Her research interests include grid/distributed computing and cloud computing. She is an editor of various books. She has organized various international conferences and workshops. She is a Potential Reviewer in various reputed journals, such as IEEE SYSTEMS JOURNAL, Cluster Computing, Scientific Reports, Journal of Supercomputing, Journal of Cloud Computing, and Concurrency and Computation.

Sheikh Mohammad Idrees

holds a Ph.D. in Computer Science and is currently working as a researcher at the Decentralized Systems Engineering Laboratory, Department of Computer Science, Norwegian University of Science and Technology (NTNU), Norway. He is an active computer science researcher and educator with a strong commitment to advancing the field of computer science. Throughout his career, he has made significant contributions to the academic community. He is recognized for his expertise in blockchain, decentralized finance, distributed ledger technologies, fintech, data analytics, big data, cloud computing, and machine learning. His work has been widely published in prestigious journals and conferences. In addition, he frequently serves as a book editor for leading publishers, including Springer, Taylor & Francis. He is the recipient of the ERCIM Postdoctoral Research Fellowship by the European Research Consortium for Informatics and Mathematics at NTNU, Norway.

Parvaz Ahmad Malla

is an Assistant Professor in Computer Applications at Higher Education Department, Jammu and Kashmir and is currently posted at Govt. College of Engineering and Technology, Safapora Kashmir. He has a teaching experience of a decade. His academic journey began with a Bachelor’s and Master’s in Computer Applications, followed by an MPhil, and he is currently pursuing a Ph.D. at Lovely Professional University. Specializing in Cloud Computing, Green Computing, and the Internet of Things, Parvaz is dedicated to advancing these fields through both teaching and research, making significant contributions to the academic and technological communities.
Literature
1.
go back to reference Prasad R, Rohokale V (2020) Cyber security: the lifeline of information and communication technology. Springer International Publishing, Cham, Switzerland Prasad R, Rohokale V (2020) Cyber security: the lifeline of information and communication technology. Springer International Publishing, Cham, Switzerland
2.
go back to reference Lowe D, Galhotra B (2018) An overview of pricing models for using cloud services with an analysis of the pay-per-use model. Int J Eng Technol 7(3.12):248–254 Lowe D, Galhotra B (2018) An overview of pricing models for using cloud services with an analysis of the pay-per-use model. Int J Eng Technol 7(3.12):248–254
3.
go back to reference I Odun-Ayo, M Ananya, F Agono, R Goddy-Worlu (2018) Cloud computing architecture: A critical analysis. In 2018 18th international conference on computational science and applications (ICCSA) (pp. 1–7). IEEE I Odun-Ayo, M Ananya, F Agono, R Goddy-Worlu (2018) Cloud computing architecture: A critical analysis. In 2018 18th international conference on computational science and applications (ICCSA) (pp. 1–7). IEEE
4.
go back to reference Mukwevho MA, Celik T (2018) Toward a smart cloud: A review of fault-tolerance methods in cloud systems. IEEE Trans Serv Comput 14(2):589–605 Mukwevho MA, Celik T (2018) Toward a smart cloud: A review of fault-tolerance methods in cloud systems. IEEE Trans Serv Comput 14(2):589–605
5.
go back to reference Alzakholi O, Shukur H, Zebari R, Abas S, Sadeeq M (2020) Comparison among cloud technologies and cloud performance. J Appl Sci Technol Trends 1(2):40–47 Alzakholi O, Shukur H, Zebari R, Abas S, Sadeeq M (2020) Comparison among cloud technologies and cloud performance. J Appl Sci Technol Trends 1(2):40–47
6.
go back to reference S Smys, R Bestak, Á Rocha (Eds.) (2019) Inventive Computation Technologies (Vol. 98). Springer Nature S Smys, R Bestak, Á Rocha (Eds.) (2019) Inventive Computation Technologies (Vol. 98). Springer Nature
7.
go back to reference U Samal, A Kumar (2023) "A software reliability model incorporating fault removal efficiency and it’s release policy." Comput Stat 1–19 U Samal, A Kumar (2023) "A software reliability model incorporating fault removal efficiency and it’s release policy." Comput Stat 1–19
8.
go back to reference U Samal, A Kumar (2024) "Empowering software reliability: Leveraging efficient fault detection and removal efficiency." Qual Eng 1–12 U Samal, A Kumar (2024) "Empowering software reliability: Leveraging efficient fault detection and removal efficiency." Qual Eng 1–12
9.
go back to reference Samal U, Kumar A (2024) A neural network approach for software reliability prediction. Int J Reliability Qual Safety Eng 31:2450009 Samal U, Kumar A (2024) A neural network approach for software reliability prediction. Int J Reliability Qual Safety Eng 31:2450009
10.
go back to reference Kumar S, Kushwaha DAS (2019) Future of fault tolerance in cloud computing. Think India J 22(17):6 Kumar S, Kushwaha DAS (2019) Future of fault tolerance in cloud computing. Think India J 22(17):6
11.
go back to reference Gupta V, Kaur BP, Jangra S (2019) An efficient method for fault tolerance in cloud environment using encryption and classification. Soft Comput 23(24):13591–13602 Gupta V, Kaur BP, Jangra S (2019) An efficient method for fault tolerance in cloud environment using encryption and classification. Soft Comput 23(24):13591–13602
12.
go back to reference MA Shahid (2022) "A systematic survey of simulation tools for cloud and mobile cloud computing paradigm." J Independent Stud Res Comput 20.1 MA Shahid (2022) "A systematic survey of simulation tools for cloud and mobile cloud computing paradigm." J Independent Stud Res Comput 20.1
13.
go back to reference Shahid, Asim M, Alam MM, Su’ud MM (2023) A systematic parameter analysis of cloud simulation tools in cloud computing environments. Appl Sci 13.15:8785 Shahid, Asim M, Alam MM, Su’ud MM (2023) A systematic parameter analysis of cloud simulation tools in cloud computing environments. Appl Sci 13.15:8785
14.
go back to reference Shahid MA, Islam N, Alam MM, Mazliham MS, Musa S (2021) Towards Resilient Method: An exhaustive survey of fault tolerance methods in the cloud computing environment. Comput Sci Rev 40:100398 Shahid MA, Islam N, Alam MM, Mazliham MS, Musa S (2021) Towards Resilient Method: An exhaustive survey of fault tolerance methods in the cloud computing environment. Comput Sci Rev 40:100398
15.
go back to reference Houssein EH, Gad AG, Wazery YM, Suganthan PN (2021) Task scheduling in cloud computing based on meta-heuristics: Review, taxonomy, open challenges, and future trends. Swarm Evol Comput 62:100841 Houssein EH, Gad AG, Wazery YM, Suganthan PN (2021) Task scheduling in cloud computing based on meta-heuristics: Review, taxonomy, open challenges, and future trends. Swarm Evol Comput 62:100841
16.
go back to reference Shafiq DA, Jhanjhi NZ, Abdullah A (2021) Load balancing techniques in cloud computing environment: A review. J King Saud Univ Comput Inform Sci 34:3910–3933 Shafiq DA, Jhanjhi NZ, Abdullah A (2021) Load balancing techniques in cloud computing environment: A review. J King Saud Univ Comput Inform Sci 34:3910–3933
18.
go back to reference Kumari P, Kaur P (2021) A survey of fault tolerance in cloud computing. J King Saud Univ Comput Inform Sci 33(10):1159–1176 Kumari P, Kaur P (2021) A survey of fault tolerance in cloud computing. J King Saud Univ Comput Inform Sci 33(10):1159–1176
19.
go back to reference Wijayanti A (2020) Critical analysis on legal aid regulation for marginal community based on legal language. TEST: Eng Manag 8(2):2806–2814 Wijayanti A (2020) Critical analysis on legal aid regulation for marginal community based on legal language. TEST: Eng Manag 8(2):2806–2814
21.
go back to reference Samal U, Kumar A (2024) Enhancing software reliability forecasting through a hybrid ARIMA-ANN model. Arab J Sci Eng 49(5):7571–7584 Samal U, Kumar A (2024) Enhancing software reliability forecasting through a hybrid ARIMA-ANN model. Arab J Sci Eng 49(5):7571–7584
22.
go back to reference Kumar P, Kumar R (2019) Issues and challenges of load balancing techniques in cloud computing: A survey. ACM Comput Surv (CSUR) 51(6):1–35 Kumar P, Kumar R (2019) Issues and challenges of load balancing techniques in cloud computing: A survey. ACM Comput Surv (CSUR) 51(6):1–35
23.
go back to reference SM Ataallah, SM Nassar, EE Hemayed (2015) Fault tolerance in cloud computing-survey. In 2015 11th International computer engineering conference (ICENCO) (pp. 241–245). IEEE SM Ataallah, SM Nassar, EE Hemayed (2015) Fault tolerance in cloud computing-survey. In 2015 11th International computer engineering conference (ICENCO) (pp. 241–245). IEEE
24.
go back to reference Mahallat I (2015) Fault-tolerance techniques in cloud storage: a survey. Int J Database Theory Appl 8(4):183–190 Mahallat I (2015) Fault-tolerance techniques in cloud storage: a survey. Int J Database Theory Appl 8(4):183–190
25.
go back to reference S Prathiba, S Sowvarnica (2017) Survey of failures and fault tolerance in cloud. In 2017 2nd International Conference on Computing and Communications Technologies (ICCCT) (pp. 169–172). IEEE S Prathiba, S Sowvarnica (2017) Survey of failures and fault tolerance in cloud. In 2017 2nd International Conference on Computing and Communications Technologies (ICCCT) (pp. 169–172). IEEE
26.
go back to reference Amin Z, Singh H, Sethi N (2015) Review on fault tolerance techniques in cloud computing. Int J Comput Appl 116(18):11–17 Amin Z, Singh H, Sethi N (2015) Review on fault tolerance techniques in cloud computing. Int J Comput Appl 116(18):11–17
27.
go back to reference Ragmani A, Elomri A, Abghour N, Moussaid K, Rida M, Badidi E (2020) Adaptive fault-tolerant model for improving cloud computing performance using artificial neural network. Procedia Comput Sci 170:929–934 Ragmani A, Elomri A, Abghour N, Moussaid K, Rida M, Badidi E (2020) Adaptive fault-tolerant model for improving cloud computing performance using artificial neural network. Procedia Comput Sci 170:929–934
28.
go back to reference Yao G, Ren Q, Li X, Zhao S, Ruiz R (2020) A hybrid fault-tolerant scheduling for deadline-constrained tasks in Cloud systems. IEEE Trans Serv Comput 15:1371–1384 Yao G, Ren Q, Li X, Zhao S, Ruiz R (2020) A hybrid fault-tolerant scheduling for deadline-constrained tasks in Cloud systems. IEEE Trans Serv Comput 15:1371–1384
29.
go back to reference Cheraghlou MN, Khadem-Zadeh A, Haghparast M (2016) A survey of fault tolerance architecture in cloud computing. J Netw Comput Appl 61:81–92 Cheraghlou MN, Khadem-Zadeh A, Haghparast M (2016) A survey of fault tolerance architecture in cloud computing. J Netw Comput Appl 61:81–92
30.
go back to reference Singh G, Kinger S (2013) A survey on fault tolerance techniques and methods in cloud computing. Int J Eng Res Technol 2(6):3335–3346 Singh G, Kinger S (2013) A survey on fault tolerance techniques and methods in cloud computing. Int J Eng Res Technol 2(6):3335–3346
31.
go back to reference Shah R, Veeravalli B, Misra M (2007) On the design of adaptive and decentralized load balancing algorithms with load estimation for computational grid environments. IEEE Trans Parallel Distrib Syst 18(12):1675–1686 Shah R, Veeravalli B, Misra M (2007) On the design of adaptive and decentralized load balancing algorithms with load estimation for computational grid environments. IEEE Trans Parallel Distrib Syst 18(12):1675–1686
32.
go back to reference JM Shah, K Kotecha, S Pandya, DB Choksi, N Joshi (2017) Load balancing in cloud computing: Methodological survey on different types of algorithm. In 2017 International Conference on Trends in Electronics and Informatics (ICEI) (pp. 100–107). IEEE JM Shah, K Kotecha, S Pandya, DB Choksi, N Joshi (2017) Load balancing in cloud computing: Methodological survey on different types of algorithm. In 2017 International Conference on Trends in Electronics and Informatics (ICEI) (pp. 100–107). IEEE
33.
go back to reference Mishra SK, Sahoo B, Parida PP (2020) Load balancing in cloud computing: a big picture. J King Saud Univ-Comput Inform Sci 32(2):149–158 Mishra SK, Sahoo B, Parida PP (2020) Load balancing in cloud computing: a big picture. J King Saud Univ-Comput Inform Sci 32(2):149–158
34.
go back to reference Rathore NK, Rawat U, Kulhari SC (2020) Efficient hybrid load balancing algorithm. Natl Acad Sci Lett 43(2):177–185MathSciNet Rathore NK, Rawat U, Kulhari SC (2020) Efficient hybrid load balancing algorithm. Natl Acad Sci Lett 43(2):177–185MathSciNet
35.
go back to reference AK Patra, A Nanda, S Panigrahi, AK Mishra (2020) Design of artificial pancreas based on fuzzy logic control in type-I diabetes patient. In Innovation in Electrical Power Engineering, Communication, and Computing Technology (pp. 557–569). Springer, Singapore AK Patra, A Nanda, S Panigrahi, AK Mishra (2020) Design of artificial pancreas based on fuzzy logic control in type-I diabetes patient. In Innovation in Electrical Power Engineering, Communication, and Computing Technology (pp. 557–569). Springer, Singapore
36.
go back to reference SL Peng, G Suseendran, D Balaganesh (2020) Intelligent computing and innovation on data science. Springer Singapore SL Peng, G Suseendran, D Balaganesh (2020) Intelligent computing and innovation on data science. Springer Singapore
37.
go back to reference Chinnaiah MR, Niranjan N (2018) Fault tolerant software systems using software configurations for cloud computing. J Cloud Comput 7(1):1–17 Chinnaiah MR, Niranjan N (2018) Fault tolerant software systems using software configurations for cloud computing. J Cloud Comput 7(1):1–17
38.
go back to reference H Arabnejad, C Pahl, G Estrada, A Samir, F Fowley (2017) A fuzzy load balancer for adaptive fault tolerance management in cloud platforms. In European Conference on Service-Oriented and Cloud Computing (pp. 109–124). Springer, Cham H Arabnejad, C Pahl, G Estrada, A Samir, F Fowley (2017) A fuzzy load balancer for adaptive fault tolerance management in cloud platforms. In European Conference on Service-Oriented and Cloud Computing (pp. 109–124). Springer, Cham
39.
go back to reference Edemo MK (2019) Developing fault tolerance architecture for real-time systems of cloud computing. Addis Ababa Science and Technology University, Addis Ababa, p 94 Edemo MK (2019) Developing fault tolerance architecture for real-time systems of cloud computing. Addis Ababa Science and Technology University, Addis Ababa, p 94
40.
go back to reference R Sana, B Harika, S Kumar (2020) Modeling for fault tolerance and scalability in cloud environment. 15 R Sana, B Harika, S Kumar (2020) Modeling for fault tolerance and scalability in cloud environment. 15
41.
go back to reference Abdulhamid SIM, Abd Latiff MS, Madni SHH, Abdullahi M (2018) Fault tolerance aware scheduling technique for cloud computing environment using dynamic clustering algorithm. Neural Comput Appl 29(1):279–293 Abdulhamid SIM, Abd Latiff MS, Madni SHH, Abdullahi M (2018) Fault tolerance aware scheduling technique for cloud computing environment using dynamic clustering algorithm. Neural Comput Appl 29(1):279–293
42.
go back to reference Sengupta S, Negi A (2019) Comparative Analysis of Contrast Enhancement Techniques for MRI Images. In International conference on Computer Networks, Big data and IoT (pp. 290–296). Springer, Cham Sengupta S, Negi A (2019) Comparative Analysis of Contrast Enhancement Techniques for MRI Images. In International conference on Computer Networks, Big data and IoT (pp. 290–296). Springer, Cham
43.
go back to reference P Jain (2019) A dynamic process for fault tolerance techniques in cloud computing (DPFT). J Gujrat Res Soc 21:10 P Jain (2019) A dynamic process for fault tolerance techniques in cloud computing (DPFT). J Gujrat Res Soc 21:10
44.
go back to reference P Gupta, PK Gupta (2020) Trust & fault in multi layered cloud computing architecture (pp. 77–93). Springer P Gupta, PK Gupta (2020) Trust & fault in multi layered cloud computing architecture (pp. 77–93). Springer
45.
go back to reference Madani S, Jamali S (2018) A comparative study of fault tolerance techniques in cloud computing. Int J Res Comput Appl Robot 6(3):7–15 Madani S, Jamali S (2018) A comparative study of fault tolerance techniques in cloud computing. Int J Res Comput Appl Robot 6(3):7–15
46.
go back to reference TJ Charity, GC Hua (2016) Resource reliability using fault tolerance in cloud computing. In 2016 2nd International Conference on Next Generation Computing Technologies (NGCT) (pp. 65–71). IEEE TJ Charity, GC Hua (2016) Resource reliability using fault tolerance in cloud computing. In 2016 2nd International Conference on Next Generation Computing Technologies (NGCT) (pp. 65–71). IEEE
47.
go back to reference G Vallee, K Charoenpornwattana, C Engelmann, A Tikotekar, C Leangsuksun, T Naughton, SL Scott (2008) A framework for proactive fault tolerance. In 2008 Third International Conference on Availability, Reliability and Security (pp. 659–664). IEEE G Vallee, K Charoenpornwattana, C Engelmann, A Tikotekar, C Leangsuksun, T Naughton, SL Scott (2008) A framework for proactive fault tolerance. In 2008 Third International Conference on Availability, Reliability and Security (pp. 659–664). IEEE
48.
go back to reference Saikia LP, Devi YL (2014) Fault tolerance techniques and algorithms in cloud computing. Int J Comput Sci Commun Networks 4(1):01–08 Saikia LP, Devi YL (2014) Fault tolerance techniques and algorithms in cloud computing. Int J Comput Sci Commun Networks 4(1):01–08
49.
go back to reference Essa YM (2016) A survey of cloud computing fault tolerance: techniques and implementation. Int J Comput Appl 138(13):34–38 Essa YM (2016) A survey of cloud computing fault tolerance: techniques and implementation. Int J Comput Appl 138(13):34–38
50.
go back to reference Khaldi M, Rebbah M, Meftah B, Smail O (2020) Fault tolerance for a scientific workflow system in a cloud computing environment. Int J Comput Appl 42(7):705–714 Khaldi M, Rebbah M, Meftah B, Smail O (2020) Fault tolerance for a scientific workflow system in a cloud computing environment. Int J Comput Appl 42(7):705–714
51.
go back to reference Xia Z, Zhu Y, Sun X, Qin Z, Ren K (2015) Towards privacy-preserving content-based image retrieval in cloud computing. IEEE Trans Cloud Comput 6(1):276–286 Xia Z, Zhu Y, Sun X, Qin Z, Ren K (2015) Towards privacy-preserving content-based image retrieval in cloud computing. IEEE Trans Cloud Comput 6(1):276–286
52.
go back to reference GP Sarmila, N Gnanambigai, P Dinadayalan (2015) Survey on fault tolerant—Load balancing algorithmsin cloud computing. In 2015 2nd International Conference on Electronics and Communication Systems (ICECS) (pp. 1715–1720). IEEE GP Sarmila, N Gnanambigai, P Dinadayalan (2015) Survey on fault tolerant—Load balancing algorithmsin cloud computing. In 2015 2nd International Conference on Electronics and Communication Systems (ICECS) (pp. 1715–1720). IEEE
53.
go back to reference K Kotecha, V Piuri, HN Shah, R Patel (Eds.) (2020) Data Science and Intelligent Applications: Proceedings of ICDSIA 2020. Springer K Kotecha, V Piuri, HN Shah, R Patel (Eds.) (2020) Data Science and Intelligent Applications: Proceedings of ICDSIA 2020. Springer
54.
go back to reference Dhingra M, Gupta N (2019) Algorithms to enhance the reliability of virtual nodes using adaptive fault tolerance techniques. Int J Innov Technol Exploring Engineering 8(11):515–519 Dhingra M, Gupta N (2019) Algorithms to enhance the reliability of virtual nodes using adaptive fault tolerance techniques. Int J Innov Technol Exploring Engineering 8(11):515–519
55.
go back to reference Singh S, Chana I (2016) Resource provisioning and scheduling in clouds: QoS perspective. J Supercomput 72(3):926–960 Singh S, Chana I (2016) Resource provisioning and scheduling in clouds: QoS perspective. J Supercomput 72(3):926–960
56.
go back to reference SU Mushtaq, S Sheikh, SM Idrees (2024) "Next-gen cloud efficiency: fault-tolerant task scheduling with neighboring reservations for improved cloud resource utilization." IEEE Access SU Mushtaq, S Sheikh, SM Idrees (2024) "Next-gen cloud efficiency: fault-tolerant task scheduling with neighboring reservations for improved cloud resource utilization." IEEE Access
57.
go back to reference TD Braun, HJ Siegel, N Beck, LL Bölöni, M Maheswaran, AI Reuther, ... RF Freund (2001) A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems. J Parallel Distributed Comput 61(6):810–837 TD Braun, HJ Siegel, N Beck, LL Bölöni, M Maheswaran, AI Reuther, ... RF Freund (2001) A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems. J Parallel Distributed Comput 61(6):810–837
58.
go back to reference Reda NM, Tawfik A, Marzok MA, Khamis SM (2015) Sort-Mid tasks scheduling algorithm in grid computing. J Adv Res 6(6):987–993 Reda NM, Tawfik A, Marzok MA, Khamis SM (2015) Sort-Mid tasks scheduling algorithm in grid computing. J Adv Res 6(6):987–993
59.
go back to reference Y Feng, D Li, H Wu, Y Zhang (2000) A dynamic load balancing algorithm based on distributed database system. In Proceedings Fourth International Conference/Exhibition on High Performance Computing in the Asia-Pacific Region. 2:949–952. IEEE Y Feng, D Li, H Wu, Y Zhang (2000) A dynamic load balancing algorithm based on distributed database system. In Proceedings Fourth International Conference/Exhibition on High Performance Computing in the Asia-Pacific Region. 2:949–952. IEEE
60.
go back to reference JC Patni, MS Aswal, OP Pal, A Gupta (2011) Load balancing strategies for grid computing. In 2011 3rd International Conference on Electronics Computer Technology 3:239–243. IEEE JC Patni, MS Aswal, OP Pal, A Gupta (2011) Load balancing strategies for grid computing. In 2011 3rd International Conference on Electronics Computer Technology 3:239–243. IEEE
61.
go back to reference Cao J, Spooner DP, Jarvis SA, Nudd GR (2005) Grid load balancing using intelligent agents. Futur Gener Comput Syst 21(1):135–149 Cao J, Spooner DP, Jarvis SA, Nudd GR (2005) Grid load balancing using intelligent agents. Futur Gener Comput Syst 21(1):135–149
62.
go back to reference Balasangameshwara J, Raju N (2012) A hybrid policy for fault tolerant load balancing in grid computing environments. J Netw Comput Appl 35(1):412–422 Balasangameshwara J, Raju N (2012) A hybrid policy for fault tolerant load balancing in grid computing environments. J Netw Comput Appl 35(1):412–422
63.
go back to reference Abohamama AS, Mohammed (2018) Fault Tolerance For Real Time Cloud Computing. Mansoura University, Diss Abohamama AS, Mohammed (2018) Fault Tolerance For Real Time Cloud Computing. Mansoura University, Diss
64.
go back to reference MK Gokhroo, MC Govil, ES Pilli (2017) Detecting and mitigating faults in cloud computing environment. In 2017 3rd International Conference on Computational Intelligence & Communication Technology (CICT) (pp. 1–9). IEEE MK Gokhroo, MC Govil, ES Pilli (2017) Detecting and mitigating faults in cloud computing environment. In 2017 3rd International Conference on Computational Intelligence & Communication Technology (CICT) (pp. 1–9). IEEE
65.
go back to reference Noor ASM, Zian NFM, Rahim NHA, Mamat R, Wan Azman WNA (2019) Novelty circular neighboring technique using reactive fault tolerance method. Int J Electrical Comput Eng 9(6):5211 Noor ASM, Zian NFM, Rahim NHA, Mamat R, Wan Azman WNA (2019) Novelty circular neighboring technique using reactive fault tolerance method. Int J Electrical Comput Eng 9(6):5211
66.
go back to reference C Engelmann, GR Vallee, T Naughton, SL Scott (2009) Proactive fault tolerance using preemptive migration. In 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing (pp. 252–257). IEEE C Engelmann, GR Vallee, T Naughton, SL Scott (2009) Proactive fault tolerance using preemptive migration. In 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing (pp. 252–257). IEEE
67.
go back to reference Rezaei Kalantari K, Ebrahimnejad A, Motameni H (2020) Presenting a new fuzzy system for web service selection aimed at dynamic software rejuvenation. Complex Intell Syst 6(3):697–710 Rezaei Kalantari K, Ebrahimnejad A, Motameni H (2020) Presenting a new fuzzy system for web service selection aimed at dynamic software rejuvenation. Complex Intell Syst 6(3):697–710
68.
go back to reference Bala A, Chana I (2012) Fault tolerance-challenges, techniques and implementation in cloud computing. Int J Comput Sci Issues (IJCSI) 9(1):288 Bala A, Chana I (2012) Fault tolerance-challenges, techniques and implementation in cloud computing. Int J Comput Sci Issues (IJCSI) 9(1):288
69.
go back to reference MG Rakesh, A Baunthiyal, AK Jain (n.d.) Preemptive fault tolerance in DDS based distributed system using application migration MG Rakesh, A Baunthiyal, AK Jain (n.d.) Preemptive fault tolerance in DDS based distributed system using application migration
70.
go back to reference Mohammed B, Awan I, Ugail H, Younas M (2019) Failure prediction using machine learning in a virtualised HPC system and application. Clust Comput 22(2):471–485 Mohammed B, Awan I, Ugail H, Younas M (2019) Failure prediction using machine learning in a virtualised HPC system and application. Clust Comput 22(2):471–485
71.
go back to reference Battula SK, Garg S, Montgomery J, Kang B (2019) An efficient resource monitoring service for fog computing environments. IEEE Trans Serv Comput 13(4):709–722 Battula SK, Garg S, Montgomery J, Kang B (2019) An efficient resource monitoring service for fog computing environments. IEEE Trans Serv Comput 13(4):709–722
72.
go back to reference Park J, Yoo G, Lee E (2005) Proactive self-healing system based on multi-agent technologies. In: Third ACIS Int'l conference on software engineering research, management and applications (SERA'05). IEEE, pp 256–263 Park J, Yoo G, Lee E (2005) Proactive self-healing system based on multi-agent technologies. In: Third ACIS Int'l conference on software engineering research, management and applications (SERA'05).  IEEE, pp 256–263
73.
go back to reference B Mohammed (2019) A framework for efficient management of fault tolerance in cloud data Centres and high-performance computing systems: An investigation and performance analysis of a cloud based virtual machine success and failure rate in a typical cloud computing environment and prediction methods (Doctoral dissertation, University of Bradford) B Mohammed (2019) A framework for efficient management of fault tolerance in cloud data Centres and high-performance computing systems: An investigation and performance analysis of a cloud based virtual machine success and failure rate in a typical cloud computing environment and prediction methods (Doctoral dissertation, University of Bradford)
74.
go back to reference Nazari Cheraghlou M, Khademzadeh A, Haghparast M (2019) New fuzzy-based fault tolerance evaluation framework for cloud computing. J Netw Syst Manage 27(4):930–948 Nazari Cheraghlou M, Khademzadeh A, Haghparast M (2019) New fuzzy-based fault tolerance evaluation framework for cloud computing. J Netw Syst Manage 27(4):930–948
75.
go back to reference Liakath JA, Krishnadoss P, Natesan G (2023) DCCWOA: A multi-heuristic fault tolerant scheduling technique for cloud computing environment. Peer-to-Peer Networking Appl 16:1–18 Liakath JA, Krishnadoss P, Natesan G (2023) DCCWOA: A multi-heuristic fault tolerant scheduling technique for cloud computing environment. Peer-to-Peer Networking Appl 16:1–18
76.
go back to reference Heyang X et al (2022) Fault tolerance and quality of service aware virtual machine scheduling algorithm in cloud data centers. J Supercomput 79:1–23 Heyang X et al (2022) Fault tolerance and quality of service aware virtual machine scheduling algorithm in cloud data centers. J Supercomput 79:1–23
77.
go back to reference Indhumathi R, Amuthabala K, Kiruthiga G, Yuvaraj N, Pandey A (2023) Design of task scheduling and fault tolerance mechanism based on GWO algorithm for attaining better QoS in cloud system. Wirel Pers Commun 128(4):2811–2829 Indhumathi R, Amuthabala K, Kiruthiga G, Yuvaraj N, Pandey A (2023) Design of task scheduling and fault tolerance mechanism based on GWO algorithm for attaining better QoS in cloud system. Wirel Pers Commun 128(4):2811–2829
78.
go back to reference Zhu L et al (2021) A self-adapting task scheduling algorithm for container cloud using learning automata. IEEE Access 9:81236–81252 Zhu L et al (2021) A self-adapting task scheduling algorithm for container cloud using learning automata. IEEE Access 9:81236–81252
79.
go back to reference Sheikh S, Nagaraju A, Shahid M (2021) A fault-tolerant hybrid resource allocation model for dynamic computational grid. J Comput Sci 48:101268 Sheikh S, Nagaraju A, Shahid M (2021) A fault-tolerant hybrid resource allocation model for dynamic computational grid. J Comput Sci 48:101268
80.
go back to reference Momenzadeh Z, Safi-Esfahani F (2019) Workflow scheduling applying adaptable and dynamic fragmentation (WSADF) based on runtime conditions in cloud computing. Futur Gener Comput Syst 90:327–346 Momenzadeh Z, Safi-Esfahani F (2019) Workflow scheduling applying adaptable and dynamic fragmentation (WSADF) based on runtime conditions in cloud computing. Futur Gener Comput Syst 90:327–346
81.
go back to reference Al-Turjman F, Hasan MZ, Al-Rizzo H (2019) Task scheduling in cloud-based survivability applications using swarm optimization in IoT. Trans Emerg Telecommun Technol 30.8:e3539 Al-Turjman F, Hasan MZ, Al-Rizzo H (2019) Task scheduling in cloud-based survivability applications using swarm optimization in IoT. Trans Emerg Telecommun Technol 30.8:e3539
82.
go back to reference Meng S et al (2019) A fault-tolerant dynamic scheduling method on hierarchical mobile edge cloud computing. Comput Intell 35.3:577–598MathSciNet Meng S et al (2019) A fault-tolerant dynamic scheduling method on hierarchical mobile edge cloud computing. Comput Intell 35.3:577–598MathSciNet
83.
go back to reference S Goutam, AK Yadav (2015) Preemptable priority based dynamic resource allocation in cloud computing with fault tolerance. In 2015 International Conference on Communication Networks (ICCN) (pp. 278–285). IEEE S Goutam, AK Yadav (2015) Preemptable priority based dynamic resource allocation in cloud computing with fault tolerance. In 2015 International Conference on Communication Networks (ICCN) (pp. 278–285). IEEE
84.
go back to reference Abd Latiff MS (2017) A checkpointed league championship algorithm-based cloud scheduling scheme with secure fault tolerance responsiveness. Appl Soft Comput 61:670–680 Abd Latiff MS (2017) A checkpointed league championship algorithm-based cloud scheduling scheme with secure fault tolerance responsiveness. Appl Soft Comput 61:670–680
85.
go back to reference Liu J, Wei M, Hu W, Xu X, Ouyang A (2018) Task scheduling with fault-tolerance in real-time heterogeneous systems. J Syst Architect 90:23–33 Liu J, Wei M, Hu W, Xu X, Ouyang A (2018) Task scheduling with fault-tolerance in real-time heterogeneous systems. J Syst Architect 90:23–33
86.
go back to reference Thaman J, Singh M (2017) Cost-effective task scheduling using hybrid approach in cloud. Int J Grid Util Comput 8(3):241–253 Thaman J, Singh M (2017) Cost-effective task scheduling using hybrid approach in cloud. Int J Grid Util Comput 8(3):241–253
87.
go back to reference SU Mushtaq, S Sophiya (2023) "A fault-tolerant resource reservation model in cloud computing." Recent Adv Comput Sci CRC Press 295–301 SU Mushtaq, S Sophiya (2023) "A fault-tolerant resource reservation model in cloud computing." Recent Adv Comput Sci CRC Press 295–301
88.
go back to reference A Simy et al. (2012) "Task scheduling algorithm with fault tolerance for cloud." 2012 Int Conf Comput Sci IEEE A Simy et al. (2012) "Task scheduling algorithm with fault tolerance for cloud." 2012 Int Conf Comput Sci IEEE
89.
go back to reference SU Mushtaq, S Sheikh, A Nain (2024) "The response rank based fault-tolerant task scheduling for cloud system." 2023 1st International Conference on Advanced Informatics and Intelligent Information Systems (ICAI3S 2023). Atlantis Press SU Mushtaq, S Sheikh, A Nain (2024) "The response rank based fault-tolerant task scheduling for cloud system." 2023 1st International Conference on Advanced Informatics and Intelligent Information Systems (ICAI3S 2023). Atlantis Press
90.
go back to reference Shahid MA, Alam MM, Su’ud MM (2023) Achieving reliability in cloud computing by a novel hybrid approach. Sensors 234:1965 Shahid MA, Alam MM, Su’ud MM (2023) Achieving reliability in cloud computing by a novel hybrid approach. Sensors 234:1965
91.
go back to reference G Chen, H Jin, D Zou, BB Zhou, W Qiang, G Hu (2010) Shelp: Automatic self-healing for multiple application instances in a virtual machine environment. In 2010 IEEE International Conference on Cluster Computing (pp. 97–106). IEEE G Chen, H Jin, D Zou, BB Zhou, W Qiang, G Hu (2010) Shelp: Automatic self-healing for multiple application instances in a virtual machine environment. In 2010 IEEE International Conference on Cluster Computing (pp. 97–106). IEEE
92.
go back to reference IP Egwutuoha, S Chen, D Levy, B Selic, R Calvo (2012) A proactive fault tolerance approach to High Performance Computing (HPC) in the cloud. In 2012 Second International Conference on Cloud and Green Computing (pp. 268–273). IEEE IP Egwutuoha, S Chen, D Levy, B Selic, R Calvo (2012) A proactive fault tolerance approach to High Performance Computing (HPC) in the cloud. In 2012 Second International Conference on Cloud and Green Computing (pp. 268–273). IEEE
93.
go back to reference Bruneo D, Distefano S, Longo F, Puliafito A, Scarpa M (2013) Workload-based software rejuvenation in cloud systems. IEEE Trans Comput 62(6):1072–1085MathSciNet Bruneo D, Distefano S, Longo F, Puliafito A, Scarpa M (2013) Workload-based software rejuvenation in cloud systems. IEEE Trans Comput 62(6):1072–1085MathSciNet
94.
go back to reference J Liu, J Zhou, R Buyya (2015) Software rejuvenation based fault tolerance scheme for cloud applications. In 2015 IEEE 8th International Conference on Cloud Computing (pp. 1115–1118). IEEE J Liu, J Zhou, R Buyya (2015) Software rejuvenation based fault tolerance scheme for cloud applications. In 2015 IEEE 8th International Conference on Cloud Computing (pp. 1115–1118). IEEE
95.
go back to reference Sun D, Zhang G, Wu C, Li K, Zheng W (2017) Building a fault tolerant framework with deadline guarantee in big data stream computing environments. J Comput Syst Sci 89:4–23MathSciNet Sun D, Zhang G, Wu C, Li K, Zheng W (2017) Building a fault tolerant framework with deadline guarantee in big data stream computing environments. J Comput Syst Sci 89:4–23MathSciNet
96.
go back to reference S Malik, F Huet (2011) Adaptive fault tolerance in real time cloud computing. In 2011 IEEE World Congress on services (pp. 280–287). IEEE S Malik, F Huet (2011) Adaptive fault tolerance in real time cloud computing. In 2011 IEEE World Congress on services (pp. 280–287). IEEE
97.
go back to reference LJL Zhang, J Zhang, J Fiaidhi, I Bojanova (2011) Cloud computing LJL Zhang, J Zhang, J Fiaidhi, I Bojanova (2011) Cloud computing
98.
go back to reference Wang J, Bao W, Zhu X, Yang LT, Xiang Y (2014) FESTAL: fault-tolerant elastic scheduling algorithm for real-time tasks in virtualized clouds. IEEE Trans Comput 64(9):2545–2558MathSciNet Wang J, Bao W, Zhu X, Yang LT, Xiang Y (2014) FESTAL: fault-tolerant elastic scheduling algorithm for real-time tasks in virtualized clouds. IEEE Trans Comput 64(9):2545–2558MathSciNet
99.
go back to reference Shahid MA, Islam N, Alam MM, Su’ud MM, Musa S (2020) A comprehensive study of load balancing approaches in the cloud computing environment and a novel fault tolerance approach. IEEE Access 8:130500–130526 Shahid MA, Islam N, Alam MM, Su’ud MM, Musa S (2020) A comprehensive study of load balancing approaches in the cloud computing environment and a novel fault tolerance approach. IEEE Access 8:130500–130526
100.
go back to reference Sidiroglou S, Laadan O, Perez C, Viennot N, Nieh J, Keromytis AD (2009) Assure: automatic software self-healing using rescue points. ACM SIGARCH Comput Archit News 37(1):37–48 Sidiroglou S, Laadan O, Perez C, Viennot N, Nieh J, Keromytis AD (2009) Assure: automatic software self-healing using rescue points. ACM SIGARCH Comput Archit News 37(1):37–48
102.
go back to reference K Toshniwal, JM Conrad (2010) A web-based sensor monitoring system on a Linux-based single board computer platform. In Proceedings of the IEEE SoutheastCon 2010 (SoutheastCon) (pp. 371–374). IEEE K Toshniwal, JM Conrad (2010) A web-based sensor monitoring system on a Linux-based single board computer platform. In Proceedings of the IEEE SoutheastCon 2010 (SoutheastCon) (pp. 371–374). IEEE
104.
go back to reference Safara F, Souri A, Baker T, Al Ridhawi I, Aloqaily M (2020) PriNergy: A priority-based energy-efficient routing method for IoT systems. J Supercomput 76:8609–8626 Safara F, Souri A, Baker T, Al Ridhawi I, Aloqaily M (2020) PriNergy: A priority-based energy-efficient routing method for IoT systems. J Supercomput 76:8609–8626
105.
go back to reference Asadi AN, Azgomi MA, Entezari-Maleki R (2020) Analytical evaluation of resource allocation algorithms and process migration methods in virtualized systems. Sustain Comput Inform Syst 25:100370 Asadi AN, Azgomi MA, Entezari-Maleki R (2020) Analytical evaluation of resource allocation algorithms and process migration methods in virtualized systems. Sustain Comput Inform Syst 25:100370
106.
go back to reference Yuan H, Liu H, Bi J, Zhou M (2020) Revenue and energy cost-optimized biobjective task scheduling for green cloud data centers. IEEE Trans Autom Sci Eng 18:817–830 Yuan H, Liu H, Bi J, Zhou M (2020) Revenue and energy cost-optimized biobjective task scheduling for green cloud data centers. IEEE Trans Autom Sci Eng 18:817–830
107.
go back to reference Welsh T, Benkhelifa E (2020) On resilience in cloud computing: A survey of techniques across the Cloud Domain. ACM Comput Surv (CSUR) 53:1–36 Welsh T, Benkhelifa E (2020) On resilience in cloud computing: A survey of techniques across the Cloud Domain. ACM Comput Surv (CSUR) 53:1–36
110.
go back to reference Shahid MA, Alam MM, Su’ud MM (2023) Improved accuracy and less fault prediction errors via modified sequential minimal optimization algorithm. Plos One 18.4:e0284209 Shahid MA, Alam MM, Su’ud MM (2023) Improved accuracy and less fault prediction errors via modified sequential minimal optimization algorithm. Plos One 18.4:e0284209
111.
go back to reference Verma R, Chandra S (2022) HBI-LB: A dependable fault-tolerant load balancing approach for fog based internet-of-things environment. J Supercomput 79:1–19 Verma R, Chandra S (2022) HBI-LB: A dependable fault-tolerant load balancing approach for fog based internet-of-things environment. J Supercomput 79:1–19
112.
go back to reference Tamilvizhi T, Parvathavarthini B (2019) A novel method for adaptive fault tolerance during load balancing in cloud computing. Clust Comput 22(5):10425–10438 Tamilvizhi T, Parvathavarthini B (2019) A novel method for adaptive fault tolerance during load balancing in cloud computing. Clust Comput 22(5):10425–10438
113.
go back to reference Attallah SM, Fayek MB, Nassar SM, Hemayed EE (2021) Proactive load balancing fault tolerance algorithm in cloud computing. Concurrency Comput: Practice and Experience 33(10):e6172 Attallah SM, Fayek MB, Nassar SM, Hemayed EE (2021) Proactive load balancing fault tolerance algorithm in cloud computing. Concurrency Comput: Practice and Experience 33(10):e6172
114.
go back to reference T Mohmmed, N Abdalrahman (n.d.) A load balancing with fault tolerance algorithm for cloud computing. In 2020 International Conference on Computer, Control, Electrical, and Electronics Engineering (ICCCEEE) (pp. 1–6). IEEE T Mohmmed, N Abdalrahman (n.d.) A load balancing with fault tolerance algorithm for cloud computing. In 2020 International Conference on Computer, Control, Electrical, and Electronics Engineering (ICCCEEE) (pp. 1–6). IEEE
115.
go back to reference MR Sumalatha, C Selvakumar, T Priya, RT Azariah, PM Manohar (2014) CLBC-Cost effective load balanced resource allocation for partitioned cloud system. In 2014 International Conference on Recent Trends in Information Technology (pp. 1–5). IEEE MR Sumalatha, C Selvakumar, T Priya, RT Azariah, PM Manohar (2014) CLBC-Cost effective load balanced resource allocation for partitioned cloud system. In 2014 International Conference on Recent Trends in Information Technology (pp. 1–5). IEEE
116.
go back to reference Mohammed B, Kiran M, Awan IU, Maiyama KM (2016) An integrated virtualized strategy for fault tolerance in cloud computing environment. In 2016 Intl IEEE conferences on ubiquitous intelligence & computing, advanced and trusted computing, scalable computing and communications, cloud and big data computing, internet of people, and smart world congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld). IEEE. pp 542–549 Mohammed B, Kiran M, Awan IU, Maiyama KM (2016) An integrated virtualized strategy for fault tolerance in cloud computing environment. In 2016 Intl IEEE conferences on ubiquitous intelligence & computing, advanced and trusted computing, scalable computing and communications, cloud and big data computing, internet of people, and smart world congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld). IEEE. pp 542–549
117.
go back to reference Naksinehaboon N, Mihaela P, Nassar R, Leangsuksun CB, Scott S (2009) High performance computing systems with various checkpointing schemes. Int J Comput Commun Control 4(4):386–400 Naksinehaboon N, Mihaela P, Nassar R, Leangsuksun CB, Scott S (2009) High performance computing systems with various checkpointing schemes. Int J Comput Commun Control 4(4):386–400
118.
go back to reference Das P, Khilar PM (2013) VFT: a virtualization and fault tolerance approach for cloud computing. In: 2013 IEEE conference on information & communication technologies. IEEE, pp 473–478 Das P, Khilar PM (2013) VFT: a virtualization and fault tolerance approach for cloud computing. In: 2013 IEEE conference on information & communication technologies. IEEE,  pp 473–478
119.
go back to reference Sachdeva R, Kakkar S (2017) A novel approach in cloud computing for load balancing using composite algorithms. Int J 7(2):51–56 Sachdeva R, Kakkar S (2017) A novel approach in cloud computing for load balancing using composite algorithms. Int J 7(2):51–56
120.
go back to reference Babu KRR, Joy AA, Samuel (2017) Load balancing of tasks using hybrid technique with analytical method of esce & throttled algorithm. Int J Nov Res Dev 2(6):61–66 Babu KRR, Joy AA, Samuel (2017) Load balancing of tasks using hybrid technique with analytical method of esce & throttled algorithm. Int J Nov Res Dev 2(6):61–66
121.
122.
go back to reference Subalakshmi S, Malarvizhi N (2017) Enhanced hybrid approach for load balancing algorithms in cloud computing. Int J Sci Res Comput Sci Eng Inform Technol 2(2):136–142 Subalakshmi S, Malarvizhi N (2017) Enhanced hybrid approach for load balancing algorithms in cloud computing. Int J Sci Res Comput Sci Eng Inform Technol 2(2):136–142
123.
go back to reference S Dam, G Mandal, K Dasgupta, P Dutta (2015) Genetic algorithm and gravitational emulation based hybrid load balancing strategy in cloud computing. In Proceedings of the 2015 third international conference on computer, communication, control and information technology (C3IT) (pp. 1–7). IEEE S Dam, G Mandal, K Dasgupta, P Dutta (2015) Genetic algorithm and gravitational emulation based hybrid load balancing strategy in cloud computing. In Proceedings of the 2015 third international conference on computer, communication, control and information technology (C3IT) (pp. 1–7). IEEE
124.
go back to reference Rathore J, Keswani B, Rathore VS (2018) An efficient load balancing algorithm for cloud environment. J Invent Comput Sci Commun Technol 4(1):37–41 Rathore J, Keswani B, Rathore VS (2018) An efficient load balancing algorithm for cloud environment. J Invent Comput Sci Commun Technol 4(1):37–41
125.
go back to reference S Dam, G Mandal, K Dasgupta, P Dutta (2014) An ant colony based load balancing strategy in cloud computing. In Advanced Computing, Networking and Informatics-Volume 2 (pp. 403–413). Springer, Cham S Dam, G Mandal, K Dasgupta, P Dutta (2014) An ant colony based load balancing strategy in cloud computing. In Advanced Computing, Networking and Informatics-Volume 2 (pp. 403–413). Springer, Cham
126.
go back to reference SG Domanal, GRM Reddy (2014) Optimal load balancing in cloud computing by efficient utilization of virtual machines. In 2014 sixth international conference on communication systems and networks (COMSNETS) (pp. 1–4). IEEE SG Domanal, GRM Reddy (2014) Optimal load balancing in cloud computing by efficient utilization of virtual machines. In 2014 sixth international conference on communication systems and networks (COMSNETS) (pp. 1–4). IEEE
127.
go back to reference V Tailong, V Dimri (2016) Load balancing in cloud computing using modified optimize response time. Int J Adv Res Comput Sci Software Eng 6(5) V Tailong, V Dimri (2016) Load balancing in cloud computing using modified optimize response time. Int J Adv Res Comput Sci Software Eng 6(5)
128.
go back to reference AN Singh, S Prakash (2018) WAMLB: weighted active monitoring load balancing in cloud computing. In Big data analytics (pp. 677–685). Springer, Singapore AN Singh, S Prakash (2018) WAMLB: weighted active monitoring load balancing in cloud computing. In Big data analytics (pp. 677–685). Springer, Singapore
129.
go back to reference S Ghosh, C Banerjee (2016) Priority based modified throttled algorithm in cloud computing. In 2016 international conference on inventive computation technologies (ICICT) (Vol. 3, pp. 1–6). IEEE S Ghosh, C Banerjee (2016) Priority based modified throttled algorithm in cloud computing. In 2016 international conference on inventive computation technologies (ICICT) (Vol. 3, pp. 1–6). IEEE
130.
go back to reference Alamin MA, Elbashir MK, Osman AA (2017) A load balancing algorithm to enhance the response time in cloud computing. J Basic Appl Sci 2(2):473–490 Alamin MA, Elbashir MK, Osman AA (2017) A load balancing algorithm to enhance the response time in cloud computing. J Basic Appl Sci 2(2):473–490
134.
go back to reference Latha, Padma VL, Sudhakar Reddy N, Suresh Babu A (2023) Optimizing scalability and availability of cloud based software services using modified scale rate limiting algorithm. Theor Comput Sci 943:230–240MathSciNet Latha, Padma VL, Sudhakar Reddy N, Suresh Babu A (2023) Optimizing scalability and availability of cloud based software services using modified scale rate limiting algorithm. Theor Comput Sci 943:230–240MathSciNet
135.
go back to reference Yuan S et al (2023) Availability-aware virtual resource provisioning for infrastructure service agreements in the cloud. Inform Syst Front 25.4:1495–1512 Yuan S et al (2023) Availability-aware virtual resource provisioning for infrastructure service agreements in the cloud. Inform Syst Front 25.4:1495–1512
136.
go back to reference Wang C, Fu Z, Cui G (2019) A neural-network-based approach for diagnosing hardware faults in cloud systems. Adv Mech Eng 11(2):1687814018819236 Wang C, Fu Z, Cui G (2019) A neural-network-based approach for diagnosing hardware faults in cloud systems. Adv Mech Eng 11(2):1687814018819236
137.
go back to reference Shahid, Asim M, Alam MM, Su’ud MM (2023) Performance evaluation of load-balancing algorithms with different service broker policies for cloud computing. Appl Sci 13.3:1586 Shahid, Asim M, Alam MM, Su’ud MM (2023) Performance evaluation of load-balancing algorithms with different service broker policies for cloud computing. Appl Sci 13.3:1586
138.
go back to reference Mishra S, Scholar MT (2016) An Iwrr method based on efficient load balancing in cloud computing. Int J Recent Trends Eng Res 3(01):46–54 Mishra S, Scholar MT (2016) An Iwrr method based on efficient load balancing in cloud computing. Int J Recent Trends Eng Res 3(01):46–54
139.
go back to reference Talwani S, Chana I (2017) Fault tolerance techniques for scientific applications in cloud. In 2017 2nd International Conference on Telecommunication and Networks (TEL-NET). IEEE, pp 1–5 Talwani S, Chana I (2017) Fault tolerance techniques for scientific applications in cloud. In 2017 2nd International Conference on Telecommunication and Networks (TEL-NET). IEEE, pp 1–5
Metadata
Title
In-depth analysis of fault tolerant approaches integrated with load balancing and task scheduling
Authors
Sheikh Umar Mushtaq
Sophiya Sheikh
Sheikh Mohammad Idrees
Parvaz Ahmad Malla
Publication date
17-10-2024
Publisher
Springer US
Published in
Peer-to-Peer Networking and Applications / Issue 6/2024
Print ISSN: 1936-6442
Electronic ISSN: 1936-6450
DOI
https://doi.org/10.1007/s12083-024-01798-5

Premium Partner