To briefly introduce our preliminary work on ICSM, the input of the ICSM framework is the user’s intent about the cloud service, i.e. functionality and non-functionality service-layer requirements. The ICSM framework translates the user’s intent to a Resource Descriptor which is the output of the ICSM framework. The Resource Descriptor includes 2 types of resource information: (1) the information of cloud resource composition that meets the functionality requirements, i.e. service type requirement, security requirement, reliability requirement, etc., and (2) the information about the resource amount that meets the performance requirement. The Resource Descriptor could be generated in resource orchestrator-compatible format, to enable seamless deployment from intent to resource provisioning.
There are 3 main functional blocks in the ICSM framework, namely Requirement Parser, Resource Composer, and Resource Designer Function (RDF). The Requirement Parser is responsible for parsing the user’s intent into atom cloud requirements related to functionality, security, reliability and performance. The Resource Composer decides the composition of the Service Function Chain (SFC) in accordance with the functionality, security and reliability requirements. The RDF is responsible for deciding the cloud computation resource amount in accordance with the performance requirement. The functional blocks can be used separately. In particular, for RDF, it could be used as a stand-alone function as long as the performance-related intent is available.
In the following part of this section, we will illustrate the detailed realization of RDF which translates the user’s intent about performance into the necessary resource amount. We will firstly analyze the factors that need to be considered for resource design (section IV-1), then propose the architecture of RDF (section IV-2), and finally introduce the intent breach prevention mechanism of RDF (section IV-3).
Factors affecting decisions on resource amount
In this section, we will analyze the factors that need to be taken into consideration when deciding the cloud resource amounts, i.e., decide the amount of vCPUs, virtual memory, and storage etc. to be allocated to each instance to meet the performance intent.
(1)
Workload and application performance requirements
The cloud user processes their application workload in the cloud environment, and requires the performance requirements to be met. Workload here refers to the type, features, and amount of processing. The performance requirements can be divided into two main categories: the processing time restriction and the processing percentage restriction. Table
2 gives two examples of workload and the corresponding performance requirements. For a certain workload, the computation resources allocated to the VM instances directly affect the processing time and percentage achieved. Thus, workload and performance requirements must be considered when deciding the resource amount allocated for VMs.
Table 2
Example of workload and performance requirements
Web server | Web page size, etc. | Requests per second | Keep average process time of requests under 1 s | Successful request rate per second |
Neural network | Layers, neurons, activation functions, etc. | Number of pixels, number of images, etc. | Keep training time under 10 s | ... |
Environmental conditions in this work are the conditions of the physical host to which the VM is allocated. Static environmental conditions include the CPU clocks and memory architecture etc. of the host; dynamic conditions include the resource usage of the host. Because the static environmental conditions are of relatively low variation and are not subject to change for a relatively long time span for a given cloud provider, we focus on dynamic environmental conditions in this work. Changes in dynamic environmental conditions such as the host’s resource usage affect the performance of VMs. On the basis of this analysis, to satisfy the performance requirements, environmental conditions clearly need be considered when the resource amount is decided.
(3)
Virtual machine performance requirements
The purpose of virtual machine performance requirements is to enhance the other features, e.g., the stability of the cloud-based service/application. The cloud user can optionally set virtual machine performance requirements. In this work we focus on a typical virtual machine performance requirement: VM resource usage restriction. The VM resource usage restriction restricts the resource usage inside VMs to be within a desired range. For instance, putting restrictions on the resource usage inside the VM, e.g., 50%–80%, prevents resources from being overused or underused, and thus improves the resource efficiency, avoids service interruption, and prevents bottlenecks from occurring. To meet the VM resource usage restrictions, suitable amounts of resources must be allocated to VMs.
Based on the analysis, and to design the resources in accordance with the performance-related intent, we propose an RDF whose inputs are the workload, performance requirements, and environmental conditions, and outputs are the resource configuration that meets the performance intent.
Intent breach prevention mechanism of RDF
In commercial cloud service delivery, intent is an important aspect of an SLA (Service Level Agreement). Intent breach happens when user’s intent is not met by the provided cloud service and resources. If the intent is breached, the cloud provider needs to refund the intent breach penalty cost to the user. For instance, the cloud user and the cloud provider agree that the cloud resource must provide the cloud service with performance no worse than the performance intent int. If the performance becomes worse than int, the cloud provider needs to refund an amount of breach penalty p to the cloud user. Thus, a mechanism to prevent intent breach and so to increase the user’s satisfaction is necessary, especially in commercial cloud service delivery. In this section, we will first introduce the typical intent breach penalty patterns, and then we will introduce the N-mode of RDF that adopts an existing unbiased loss function to train the model of RDF, and we will discuss its drawbacks in preventing intent breaches, i.e. the design motivation for the intent breach mechanism. After that, we will introduce the proposed intent breach prevention mechanism P-mode, and illustrate the biased loss function we propose for P-mode that is able to simultaneously retain the precision of the RDF model and lower the risk of intent breach. The validation results for N-mode and P-mode are presented in section VI.
On the basis of interviews with the cloud service operator, we find that the intent breach penalty can be categorized in the following 3 types, according to how the breach penalty
p is calculated with respect to the number of times an intent breach occurs, the breach extent or the breach duration.
-
Breach times penalty (BTP)
-
Breach extent penalty (BEP)
-
Breach duration penalty (BDP)
For BTP, each time the intent is breached, i.e., the real performance of the cloud service is worse than the intended performance (intent), a fixed penalty α is imposed on the cloud provider, and otherwise no penalty is imposed on the cloud provider. Thus the penalty for a cloud service order is formulated as the following:
$$ p=\left\{\begin{array}{c}0, if\ perf\ is\ no\ worse\ than\ \mathit{\operatorname{int}}\\ {}\alpha, if\ perf\ is\ worse\ than\ \mathit{\operatorname{int}}\end{array}\right. $$
(1)
where
perf is the real cloud performance,
int is the performance intent, and
α is a constant that specifies the penalty when an intent breach happens.
For BEP, each time the intent is breached, i.e., the real performance of the cloud service is worse than the intended performance (intent), a penalty is imposed on the cloud provider according to the percentage difference between the real performance and the performance intent, i.e., the intent breach extent. Otherwise no penalty is imposed on the cloud provider. Thus for a cloud service order, the penalty is formulated as:
$$ p=\left\{\begin{array}{c}0, if\ perf\ is\ no\ worse\ than\ \mathit{\operatorname{int}}\\ {}\beta\ \left| perf-\mathit{\operatorname{int}}\right|/ perf, if\ perf\ is\ worse\ than\ \mathit{\operatorname{int}}\end{array}\right. $$
(2)
where
β is a constant that specifies how much penalty is imposed according to the breach extent when an intent breach happens.
For BDP each time the intent is breached, a penalty is imposed on the cloud provider according to the duration of intent breach. Thus for a cloud service order, the penalty is formulated as:
$$ p=\left\{\begin{array}{c}0, perf\ is\ no\ worse\ than\ \mathit{\operatorname{int}}\\ {}\ \gamma \ast dur, perf\ is\ worse\ than\ \mathit{\operatorname{int}}\end{array}\right. $$
(3)
where
dur is the intent breach duration, and
γ is a constant that specifies the how much penalty is imposed according to the breach duration when an intent breach happens.
To prevent intent breach, it is necessary to allocate sufficient resources to ensure that the real cloud performance is better than the intent. Meanwhile, from the resource efficiency perspective, allocating excessive resources is able to increase the real performance but results in a cost increase. Thus it is necessary for RDF to ensure that the real performance is better than the intent but at the same time close to the intent .
As mentioned in Section IV-2, in the knowledge abstraction phase of RDF, RDF trains a set of regression models on the basis of the collected log data. The feature vector includes the workload information, the environmental condition information, and the resource configuration information, and the objective vector includes the application performance and the virtual machine performance.
If we apply existing “unbiased” loss functions such as MAE, MSE, and MAPE to RDF as shown in (4), (5) and (6) respectively to train the model, the RDF is set in N-mode [
22‐
24]. Note that “unbiased” here means that, for the same real performance,
perfi, either the real performance
perfi is worse than the inferred performance
hi, or the real performance
perfi is better than the inferred performance
hi, but as long as the absolute difference between the real performance and the inferred performance |
perfi −
hi| is identical, the model will take it as the same loss.
$$ {L}_{MAE}=\frac{1}{m}{\sum}_O^{i=m-1}\left|{perf}_i-{h}_i\right| $$
(4)
$$ {L}_{MSE}=\frac{1}{m}{\sum}_O^{i=m-1}{\left|{perf}_i-{h}_i\right|}^2 $$
(5)
$$ {L}_{MAPE}=\frac{1}{m}{\sum}_O^{i=m-1}\left|{perf}_i-{h}_i\right|/{perf}_i $$
(6)
However, in the case of service delivery, the inference may lead to different risk level of intent breach, even if the absolute difference between the real performance and the inferred performance |
perfi −
hi| is the same, depending on whether the real performance
perfi is better or worse than the inferred performance
hi. We will give an example to illustrate this. For example, assume that the cloud user has an intent to encode a 100GB in 10 s using a given encoder that runs on a VM, and would like to know how much computation resource is needed to be allocated to the VM to meet the intent. Let’s consider 2 situations where the absolute difference between the real performance and the inferred performance is the same:
(1)
The inferred performance is better than the real performance. Assume that the RDF model infers that for the given amount of workload, i.e., encoding a 100GB video, allocating 4 vCPUs to the VM to process the workload will result in a task taking 9.9 s (inferred performance hi), while the real performance is 10.1 s. In this case, when RDF used the model to determine resources, it decided that 4 vCPUs were sufficient to meet the intent and instructed the resource orchestrator to implement the service with 4 vCPUs. However, in implementation, allocating 4 vCPUs to the VM to process the workload took 10.1 s (real performance), so intent breach happened since that the real performance was worse than the intended performance of 10 s.
(2)
The real performance is better than the inferred performance. Assume that the RDF model infers that for the given amount of workload, i.e., encoding a 100GB video, allocating 4 vCPUs to the VM to process the workload will result in a task taking 10.3 s (inferred performance hi) while the real performance is 10.1 s. In this case, when RDF used the model to determine resources, it decided that 4vCPUs were unable to meet the intent and searched for other available resource design solutions. Thus prevented the intent breach happening.
In both the 2 cases described above, the absolute difference between the real performance and the inferred performance is the same at 0.2 s. However, while in case (1) the inferred performance is better than the real performance, and resulted in a higher intent breach risk, in case (2) the inferred performance is worse than the real performance, and resulted in a lower intent breach risk.
This observation indicates the N-mode loss functions’ drawbacks in distinguishing inferences that lead to different risk levels of intent breach. On the basis of the observation, we have proposed the P-mode. P-mode is able to lower the intent breach risk by adopting our proposed loss functions LBTP, LBEP, LBDP. These loss functions impose a penalty on performance inferences that lead to a high risk to intent breach.
On the basis of the breach penalty patterns, we have proposed 3 models for P-mode, which are PBTP, PBEP, PBDP. The loss functions of P-mode models (7), (9), and (10) below, are composed of 2 parts. The first part is the unbiased loss function, for which, in this work, we use MSE. However, when the first unbiased loss function is MAE or MAPE, the P-mode loss function could be formulated in a similar way. The second part is the penalty function which is imposed on the total loss function when the inferred performance penalty hi is better than the real performance perfi.
For PBTP, the cloud provider sets RDF mode in the knowledge abstraction phase to PBTP mode, and so during the learning process, the loss function LBTP is applied:
$$ {L}_{BTP}=\frac{1}{m}{\sum}_O^{i=m-1}\left(\left|{perf}_i-{h}_i\right|+\varepsilon \ast {b}_i\right) $$
(7)
and,
$$ {b}_i=\left\{\begin{array}{c}0,{perf}_i\ is\ no\ worse\ than\ {h}_i\\ {}1,{perf}_i\ is\ worse\ than\ {h}_i\end{array}\right. $$
(8)
where
m is the number of training data records, and
perfi is the real performance value for
ith training data records,
hi is the inferred performance value of
ith training data, and
ε is an adjustable constant. As we can see from the formula of
LBTP, a fixed penalty is added to the absolute error between the real performance and inferred performance when the real performance is worse than the inferred performance.
For PBEP, the cloud provider sets the RDF mode in the knowledge abstraction phase to PBEP mode, and so during the learning process, the loss function LBEP is applied:
$$ {L}_{BEP}=\frac{1}{m}{\sum}_O^{i=m-1}\left(\left|{perf}_i-{h}_i\right|+\varepsilon \ast {b}_i\ast \left|{perf}_i-{h}_i\right|\right) $$
(9)
As we can see from the formula of LBEP, the weighted difference between the real performance and the inferred performance is added to the absolute error between the real performance and inferred performance when the real performance is worse than the inferred performance.
For PBDP, the cloud provider sets the RDF mode in the knowledge abstraction phase to PBDP mode, so that during the learning process, the loss function LBDP is applied:
$$ {L}_{BDP}=\frac{1}{m}{\sum}_O^{i=m}\left(\left|{perf}_i-{h}_i\right|+\varepsilon \ast {b}_i\ast {dur}_i\right) $$
(10)
where
duri is the duration of intent breach for the
ith training data. As we can see from the formula of
LBDP, the weighted intent breach duration is added to the absolute error between the real performance and inferred performance when the real performance is worse than the inferred performance.
In the case where the intent is process time restriction, when the required process time restriction is not met, the intent breach duration duri equals the real process time, i.e. perfi. Thus in this case, the loss function for LBDP is rewritten as,
$$ {L}_{BDP}=\frac{1}{m}{\sum}_O^{i=m}\left(\left|{perf}_i-{h}_i\right|+\varepsilon \ast {b}_i\ast {perf}_i\right) $$
(11)
To conclude, RDF has two modes: Normal mode (N-mode) and intent breach Prevention mode (P-mode). N-mode is the baseline mode in which no intent breach prevention mechanism is applied. The objective of P-mode is to enhance the service quality and user’s satisfaction by suitably adding bias to performance inference to decrease intent breach risk while ensuring high inference accuracy of performance.