Introduction
With the rapid development of Internet of Things (IoT) technology, the data that collected by various IoT devices are constantly generated and accumulated. Then the data-intensive applications are produced. However, the IoT devices have difficulty processing data-intensive applications locally due to the limited resources. It is difficult to provide all resources and schedule services for IoT users only by relying on a single cloud.
The IoT architecture composed of the center cloud and the edge clouds can be seen as multi-clouds [
1]. For the data generated by the edge devices of the Internet of Things, there are often corresponding storage devices in the edge cloud to meet the rapid response required by edge computing. However, due to the limited computing resources of the edge cloud, the center cloud is also needed as an important part to meet the processing requirements of massive data. The multi-clouds are composed of the central cloud and the edge clouds, which complete multiple data-intensive workflow scheduling tasks. Different from the traditional workflow, the data-intensive workflow in the Internet of Things environment has the characteristics of scattered data sources, large data scale and distributed execution at the multi-clouds.
When executing this kind of workflow in multi-clouds environment, many factors such as business constraints about data privacy and long-distance data transmission should be considered. There are many challenges in the data flow control management and the data transmission scheduling.
The existing methods to solve data-intensive workflow scheduling mainly include cloud computing scheduling strategy based on heuristic thought, cloud computing scheduling strategy based on segmentation thought and cloud computing scheduling strategy based on reinforcement learning [
2]. The heterogeneous distributed resource environment of cloud computing and the parallel task structure of data-intensive workflow together form a large state space. The reinforcement learning has powerful decision-making ability when dealing with the complex space problem. Therefore, reinforcement learning is often used as a powerful means to solve scheduling problems in recent years [
3].
However, applying reinforcement learning to solve the scheduling problem of data-intensive scientific workflow in cloud computing environment has the following difficulties:
On the one hand, the time and the cost of data-intensive workflow mainly come from the link transmission process. The way to reduce the link loss is to reduce the data dependence among data centers, which usually requires the segmentation of workflow structure. On the other hand, the state set of workflow scheduling is complex, and there is a problem of over-dimensionality. It is difficult to store all reward values in the form of a table. It is solved by generalization of state vectors by neural network, and the state values are extracted by deep reinforcement learning (DRL) technology, so as to achieve the goal of dimensionality reduction.
Therefore, this paper proposes a data-intensive workflow scheduling method based on deep reinforcement learning in multi-clouds. The contributions of this paper are summarized as follows:
(1)
In order to deal with the scheduling problem of data-intensive workflow, this paper proposes a method of workflow segmentation to reduce the data transmission between partitions which will be deployed in cloud center and edge cloud. By dividing the original workflow into several blocks with similar scale and low data dependence, the algorithm provides a certain environmental state model foundation for the deep reinforcement learning scheduling algorithm in the subsequent chapters.
(2)
In this paper, deep neural network is introduced into reinforcement learning, and it is used to train reinforcement learning process. Based on the DQN algorithm, the idea of bias correction is introduced to calculate the variance of the current state reward to solve the problem of overestimation of Q value. In addition, the reward function is improved so that the workflow scheduling results converge to a stable correlated equilibrium policy.
(3)
Finally, the open source WorkflowSim simulation environment is used to evaluate the proposed method. Compared with the traditional workflow scheduling method, the experimental results show that the proposed method can effectively improve the workflow execution time and load balancing.
The rest of this paper is organized as follows. The second section introduces the related work of this paper. In the third section, the related definitions involved in scheduling method and the segmentation method of data-intensive workflow are given. In the fourth section, the detailed method of workflow scheduling strategy is designed. Then, in the fifth section, the effectiveness of this method is verified by experiments based on WordflowSim simulation environment. Finally, the sixth section summarizes the full text.
The algorithm based on heuristic idea can efficiently find the approximate optimal solution of the workflow scheduling problem in cloud computing, and it is the mainstream type of cloud computing scheduling method in recent years. The basic prototypes are ant colony optimization (ACO), particle swarm optimization (PSO), genetic algorithm (GA), etc. [
4]. Literature [
5] puts forward the earliest completion time algorithm (HEFT), which is a constructive heuristic algorithm. The algorithm first sets the priority of each task in the workflow based on the average execution cost and the average transmission cost. Then allocates resources for the tasks according to the task priority and the earliest completion time of the task on the virtual machine. Literature [
6] puts forward the big cuckoo algorithm, which imitates cuckoo’s sojourning behavior. It aims at minimizing the turnaround time and maximizing the resource utilization rate. However, this algorithm fails to take into account the interaction of big data and is not suitable for data-intensive workflow. Literature [
7] puts forward a multi-objective artificial bee colony algorithm, which is a swarm intelligence algorithm that can reduce energy consumption, execution time and cost respectively and improve resource utilization, but it does not discuss mutually exclusive performance indicators, such as execution time and overall cost, and provides a compromise solution. Literature [
8] puts forward a hybrid particle swarm optimization (PSO) HEFT algorithm, which focuses on solving the problem of high energy consumption in the process of workflow scheduling in cloud computing system, and it can obtain a scheduling solution that balances the scheduling quality and energy consumption, but this algorithm is not suitable for dealing with scientific workflow scheduling problems oriented to data flow. This kind of algorithm can find the feasible solution under the constraint conditions, but because it can’t predict the deviation between the feasible solution and the optimal solution, the convergence speed is slow, and it often falls into the local optimal solution in the process of solving, so it is difficult to meet the task requirements of low latency.
When the workflow brings a heavy burden to the data link, researchers usually use graph segmentation to minimize the data traffic between the blocks, so as to reduce the data coupling within the secondary workflow, thus reducing the link load between data centers [
9]. There are two important principles for segmentation of workflow flow graph on cloud computing platform, one is to make the data dependency between segmented subgraphs as small as possible, which can give full play to the advantages of distributed parallel computing of cloud computing; The second is to make the scale of each block as balanced as possible, which can avoid the short-board effect of workflow and improve the system performance. Literature [
10‐
12] offload computing-intensive tasks to the edge server or cloud for processing, which maximizes the quality of user experience under resource constraints. Literature [
13] use cuckoo search (CS) to segment the workflow, and finally the decision tree is used to allocate resources. Although this method can accelerate the iterative convergence and shorten the execution time, the selected fitness function can’t describe the segmentation result well.
In recent years, reinforcement learning is often used as a powerful means to solve scheduling problems. By using the excellent decision-making ability of reinforcement learning to solve scheduling problems in complex edge environments, the convergence speed can be accelerated by constantly correcting the deviation of feasible solutions and better solutions. Literature [
14] uses Q-learning algorithm to match resources in online dynamic scheduling. This method is oriented to unrelated tasks and it can obviously shorten the average response time of tasks. However, it is not suitable for workflow problems with priority or dependence, and it is difficult to predict and classify the upcoming tasks. Literature [
15] uses reinforcement learning method to optimize the scheduling of memory controller, which improves the running state of application and bandwidth utilization. Finally, cerebellar neural network is used to reduce the dimension of state space, but it is not suitable for data-intensive workflow cloud scheduling. In literature [
16], in order to improve the task processing efficiency for Internet of Vehicles (IoV),the paper design a CORA algorithm and use the Markov decision process model for formulating the dynamic optimization problem. In literature [
17], the author developed a scheduling algorithm based on pointer network and reinforcement learning method. In the state set of the algorithm, parameters such as execution time, virtual machine failure probability, communication cost and associated tasks were defined, and a state neural network based on these parameters was analyzed and constructed. Literature [
18] uses the game-based method to offload the computing-intensive tasks to achieve the goal of minimizing user costs and maximizing server profits.
Due to the limitations of reinforcement learning itself, it can’t deal with the problem of high maintenance and continuity [
19]. Deep learning method focuses on the expression of perception and input, and it is good at discovering the characteristics of data. Because deep learning can make up for the shortcomings of reinforcement learning, deep reinforcement learning (DRL) uses deep neural network’s ability to capture environmental characteristics and the decision-making ability of RL can solve complex system control problems, and it can use edge nodes as intelligent agents to learn scheduling strategies without global information about the environment. Aiming at the data transmission overhead of data-intensive workflow, as well as the optimization objectives of workflow scheduling, such as execution time and load balance, etc., this paper studies the workflow scheduling algorithm based on deep reinforcement learning.
Data-intensive workflow scheduling strategy
This paper presents a workflow scheduling method based on DQN algorithm. On the basis of DQN algorithm, the model reward function is redesigned and improved according to the characteristics of the research problem.
Multi-objective optimal scheduling
In order to promote the development of multi-objective optimal scheduling method based on deep reinforcement learning, this paper puts forward the following assumptions [
23]:
-
① Each task can only be performed by one cloud host;
-
② The running time of the task is the time interval between the start and end of the task;
-
③ The delay time of resource supply or cancellation is not considered;
-
④ Do not consider the delay time of transmission between tasks.
In this paper, two QoS indexes, namely the makespan of workflow and load balance, are considered. As the goal of cloud workflow scheduling, it is a bi-objective optimization problem. The goal of scheduling optimization algorithm can be expressed as follows:
$$\left\{ {\begin{array}{*{20}c} {f(x) = \min (Tw_{total} ,Sl_{total} )} \\ {S(p_{i} ) = r_{m} \, p_{i} \in P,r_{m} \in dc_{p} } \\ \end{array} } \right.$$
Among them, \(Tw_{total}\) is the execution time of IoT data workflow and \(Sl_{total}\) is the average load of the system; \(S(p_{i} )\) is the constraint condition of the algorithm, and \(p_{i}\) is the sub-workflow with business constraints and \(dc_{p}\) is the service node assigned,\(r_{m}\) is the resources required by sub-workflow \(p_{i}\) on service node \(dc_{p}\).
These two optimization objectives are abstracted into two agents respectively, each agent is an agent based on DQN algorithm, and carries out adaptive learning and self-optimization process through interaction with the environment and other agents.
Workflow scheduling method based on improved DQN
DQN
DQN algorithm is a popular method in the field of deep reinforcement learning. Its main modules include: environment module, loss function, experience replay module and two neural networks with the same structure but different parameters, namely estimated value network and target value network [
24].
The DQN algorithm learns the action value function Q* corresponding to the optimal strategy by minimizing the loss,
$$\begin{gathered} l\left( \theta\right) = IE_{{s,a,r,s^{\prime}}} \left[ {\left( {Q^{*} \left( {s,a|\theta} \right) - y} \right)^{2} } \right] \hfill \\ y = r + \delta \mathop {\max }\limits_{{a^{\prime}}} \left( {Q^{*} \left( {s^{\prime},a^{\prime}|\theta^{-} } \right)} \right) \hfill \\ \end{gathered}$$
Among them, \(y\) is the objective Q function, and its parameters are updated periodically with the latest ones \(\theta\), which is helpful for stable learning.
DQN uses Q-table to store the Q values of each state-action pair, and DQN uses neural network to extract complex features and analyze them to generate Q values [
25]. The estimated Q value network is used to predict the estimated Q value. Its input comes from the latest parameters of the current environment, and the parameters will be updated every iteration.
\(\theta\) is weight,
\(Q^{*} (s,a|\theta )\) is used to represent the output of the current estimated value network. The input parameters of the target Q value network are updated every once in a while.
\(\mathop {\max }\limits_{{a{\prime} }} (Q^{*} (s{\prime} ,a{\prime} |\theta^{ - } ))\) indicates the output of the target value network. The training goal of the neural network is to optimize the loss function constructed by these two Q values, and then update the parameters of the estimated Q value network by using the method of random gradient descent through back propagation. Every certain number of iterations, the parameters of the estimated Q value network will be copied to the target Q value network regularly. To some extent, it reduces the correlation between the estimated Q value and the target Q value, making divergence or oscillation more impossible, thus improving the stability of the algorithm [
26,
27].
The use of neural network module in DQN overcomes the high-dimensional data disaster of single-agent reinforcement learning. and balances the contradiction between exploration and utilization to some extent through the use of target value network, experience playback pool and exploration mechanism based on \(\varepsilon {\rm O}\) method.
DQN-RL
In this paper, we propose a DQN method based on bias correction. Q-values are obtained from multiple historical online value network models and online value network outputs. The variance of these multiple current state rewards is calculated, and then the bias correction term is calculated based on the variance and applied to the target Q-value formula, which solves the problem of Q-value overestimation to some extent.
The formula for calculating the target Q value in DQN is shown in the following equation:
$$y_{t} = r_{t} + \gamma \mathop {\max }\limits_{a} Q(s_{t + 1} ,a;\theta^{ - } )$$
The improved Q-value calculation formula:
$$y_{DQN - DL} = r_{t} + \gamma \mathop {\max }\limits_{a} Q(s_{t + 1} ,a;\theta^{ - } ) - B(s,a,r)$$
In the above equation, \(r_{t}\) shows the estimate of the tth immediate reward. \(\theta\) shows the parameters of the saved historical online value network model. \(a\) shows the saved actions in the empirical data. \(B(s,a,r)\) is the modified bias correction term.
In multi-agent learning systems, it is usually faced with the challenges of difficult determination of learning objectives, unstable learning problems and coordination of processing. In this paper, the reward function is improved so that the workflow scheduling results converge to a stable correlated equilibrium policy.
The state space set in this paper is represented by a vector
\(Vector = [s_{1} ,s_{2} ,...,s{}_{i},...,s_{n} ]\) where n is the number of tasks in the workflow, the index in the vector represents the
\(ID_{{t_{id} }}\) of each task,
\(s_{i}\) is is an integer represents the state of the ith task, and -1 represents that the task has been executed. -2 means that the task can be executed; -3 means that the predecessor node of the task has not been executed, that is, the execution conditions are not met;0 ~ m represents being executed by the virtual machine,and the value is the id of virtual machine [
28].
A suitable reward function design can ensure the stability and convergence of the algorithm in multi-agent learning scenarios. For the agent with maximum completion time, the reward function designed in this paper is as follows,
$$w_{1} = [\frac{{ET_{k,i,j} (a) - (makespan{\prime} - makespan)}}{{ET_{k,i,j} (a)}}]^{3}$$
The reward function of load balancing is as follows,
$$w_{2} = [\frac{{ET_{k,i,j} (a) - (r{\prime} - r)}}{{ET_{k,i,j} (a)}}]$$
The value ranges of the reward function of execution time
\({\text{w}}_{1}\) and the reward function of load balancing
\({\text{w}}_{2}\) both fall within [0,1], which means that the execution time is updated to make the value of the increased execution time as small as possible, and the corresponding reward value is closer to 1; Otherwise, it approaches 0. Similarly, the second formula represents that the smaller the added value of load balancing is, the more desirable its strategy is, and the closer its reward value is to 1; Otherwise, there is no reward and the value is 0.
In the process of reinforcement learning, the action with the largest Q value will be selected every time, which is to use greedy strategy to perform action selection [
29,
30]. However, in the initial stage of reinforcement learning, agents can’t master the Q value, so they need to explore and choose unknown actions in a random way. After a period of learning, they can get a certain amount of Q value. However, at this time, whether to continue to explore unknown actions or make use of the current action with the largest Q value is the balance problem of exploration and utilization faced by reinforcement learning. In order to solve this problem, this paper uses the variable
\(\varepsilon {\rm O}\) strategy, that is, at the beginning, s is set to be larger, such as 0.9, to give the model more opportunities to explore; With the increase of training rounds, the learning ability of the model becomes stronger, and the updated state action value becomes better. The value of s is gradually reduced, and the learned Q value is more used to choose the best behavior.
Conclusion
In the cloud-edge collaborative environment, IoT data workflow has a large amount of data and scattered data sources, so the data dependence among tasks of IoT data workflow is complex, and data transmission is inevitable during scheduling. This paper adopts the method based on deep reinforcement learning to optimize the multi-objective scheduling of data-intensive workflows, first divides the data-intensive workflows, and then uses the improved DQN algorithm to schedule multiple workflows. Through the experimental evaluation, this method can effectively optimize the execution time of data workflow adjustment, effectively improve the service quality, and make the average load of each node more balanced, making the system work more stable.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.