Top

Distributed and Parallel Databases

Published in:

Open Access 22-06-2023

SimCost: cost-effective resource provision prediction and recommendation for spark workloads

Authors: Yuxing Chen, Mohammad A. Hoque, Pengfei Xu, Jiaheng Lu, Sasu Tarkoma

Published in: Distributed and Parallel Databases | Issue 1/2024

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Patentsearch

Off

Abstract

Spark is one of the most popular big data analytical platforms. To save time, achieve high resource utilization, and remain cost-effective for Spark jobs, it is challenging but imperative for data scientists to configure suitable resource portions.In this paper, we investigate the proper parameter values that meet workloads’ performance requirements with minimized resource cost and resource utilization time. We propose SimCost, a simulation-based cost model, to predict the performance of jobs accurately. We achieve low-cost training by taking advantage of simulation framework, i.e., Monte Carlo simulation, which uses a small amount of data and resources to make a reliable prediction for larger datasets and clusters. Our method’s salient feature is that it allows us to invest low training costs while obtaining an accurate prediction. Through empirical experiments with 12 benchmark workloads, we show that the cost model yields less than 5% error on average prediction accuracy, and the recommendation achieves up to 6x resource cost saving.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

Numerous user-centric services and platforms on the modern world wide web are thriving on large-scale data analysis enabled by big data platforms. These platforms come with off-the-shelf packages to manage, process, and algorithms for big data. Organizations or individuals deploy them on local or cloud computing environment with an emphasis on the cost-effective or efficient utilization of the resources (e.g., [8, 22]). Spark is a memory-based big data platform that allows faster computation. With the native support from various cloud computing services, Spark-based big data analytics has been thriving in academia and industry. Nevertheless, managing the resources with proper configuration for Spark jobs remains challenging [21, 37].

There exists excellent previous works [21, 37] addressing parameter tuning challenges by leveraging machine learning techniques, including Decision Tree (C5.0) regression [49], Logistic regression [31], Gradient Boosting regression [19], and Reinforcement Learning [56] or by search-based algorithm, including bound-and-search [57], Random Forest [4], and Genetic Algorithm [55]. Each approach possesses their advantages, having its success in some application scenarios. Interestingly, the success of these methods relies heavily on the sufficient quality and quantity of training samples. These samples are often lacked of and required from tens [38] to thousands [53] through training workloads for hours [56] or even days [33]. The one-time training cost is still competitive to manual configuration, as they generally assume that workloads in production are similar or repeatedly run.

However, these techniques are at least not cost-effective for large-scale dynamic workloads. First, the accumulated training overhead of running large-scale data is prohibitive. Second, by the time of finishing training, the training samples may be no longer valid due to the variation of workloads [35]. Third, they mostly assume the context to be fixed input data and resource, which sometimes should be individualized for workloads [15, 46].

The existing machine learning models heavily rely on various historical logs from the framework. For example, Hernández et al. [19] developed regression models to predict the completion time of Spark jobs for a set of system and application metrics. Wang et al. [49] used a set of classification algorithms, and investigated the parameters and their effects on performance. Nevertheless, these approaches require a great deal of time for training models. There are various cost models (e.g., [48, 50]) as well. Wang et al. [50] and Ernest [48] executed jobs with smaller input data size and predicted the performance for the larger dataset by using cost functions. These machine learning and cost models are limited by network performance and disk I/O. They do not consider recommending the resource parameter values in their paper either.

This paper addresses the challenges in cost-effective resource allocation for Spark jobs. We particularly choose the white cost model as black machine learning methods are having costly training expense. The previous cost model (e.g., [48, 50]) does not catch the I/O bottleneck well when data scale. Also, when I/O settings change, they cannot adapt. Our contribution is a simulation framework assisted by a novel cost model for Spark jobs that considers runtime I/O bottleneck. The framework adaptively simulates the jobs with various data sizes and resource configurations, collects the performance information, and makes a reliable prediction.

The contributions of this paper are summarized as follows:

Contribution 1: Low-cost simulation-based prediction model We propose SimCost, a simulation-based analytical cost model, to predict the performance both cost-effectively and accurately. A simulation framework assists in cost-effective performance prediction for large input datasets and clusters by utilizing a small amount of data and resources. In order to capture the representative characteristic of jobs from a small sample data, we adopt Monte Carlo (MC) Simulation [5], which mitigates the skewness [36] of data distribution as well as the deviation in runtime performance (details in Sect. 4.1). Along with the job characteristics, the cost model considers both configurable (e.g., memory and vcore) and non-configurable (e.g., disk and network) parameters. The model covers both scalable factors and non-scalable factors, making prediction more accurate. Through simulation, the cost model further predicts intermediate input, shuffle, output size, etc., enabling a user to build jobs’ profiles with much fewer resources or time.

Contribution 2: Cost-effective parameter recommendation We also SimCost also combines with performance metrics as recommendation models to determine the proper resource amount for executing the job. Most of the previous works (e.g., [17, 23, 43, 48]) tune job performance with a fixed resource. However, the best resource allocation often relies on the characteristic of a job, the data input size, and I/O bottlenecks [22]. Allocating all the reaming resources for a job may not always lead to the best performance with respect to resource utilization [51] or execution time [18].

As we show an example in Fig. 1, when assigning one vcore to the Bayesian Classification job, the CPU (i.e., vcore) is the principle bottleneck, as adding one more vcore reduces the execution time drastically. However, when we keep adding more vcores (more than 3 vcores), the performance improves negligibly, as the bottleneck might turn to disk or network I/O. Assigning more vcores while gaining a slight improvement is not cost-effective.

Therefore, only running time (e.g., [12, 28]) as a metric is not adequate for multi-goal recommendations, especially for cost-effectiveness. To aid this, we introduce the resource time metric and our recommendation models use this metric to find near-optimal configurations with a custom range search algorithm (details in Sect. 4.3).

Contribution 3: Comprehensive experiments We run three suits of HiBench benchmarks [27], namely Mirco, ML, and SQL. SimCost is extensively evaluated. With SimCost, a production environment requires only to simulate jobs with a small amount of data for prediction. As fewer resources and data are required, the ratio of training cost is considerably low (average 7.92% and 13.64% training ratios (defined in Eq. 7) in terms of 80% and 90% confident interval, respectively). Experiments conducted on 12 benchmark workloads show that SimCost yields less than 5% error on average prediction accuracy and achieves the cost-effective recommendation on performance (up to 6x gain on resource time).

This long paper is an extended version of a 4-page conference paper carrying the title Cost-effective Resource Provisioning for Spark Workloads in CIKM’19 [11]. Besides an elaboration of contents in general, this extended version paper is with more details of the theoretical analysis on our approach. Also, it expands the empirical evaluation with more workloads and data.

The remainder of the paper is organized as follows. Section 2 provides the preliminaries and Sect. 3 describes the problem statement. Section 4 details the approaches in memory assurance, performance prediction, and parameter recommendation. Section 5 evaluates our approach with benchmark workloads. Section 6 investigates the related work. Section 7 discusses the possible extensions, concludes the current work, and directs the future work.

2 Preliminaries

In this section, we provide the necessary preliminaries and define our problem.

Spark is an open-source data processing engine that is designed for the fast and efficient processing of large volumes of data. Apache Spark has two primary abstractions: Directed Acyclic Graph (DAG), which manages the scheduling of Spark jobs, and Resilient Distributed Datasets (RDD), which stores the data used for computation. This paper explores the cost of running a Spark job and investigates strategies for optimizing resource allocation and scheduling for improved efficiency.

Spark Job A Spark job is a unit of work that the Apache Spark engine executes in a distributed computing cluster. It consists of a series of tasks that are performed on data stored in the cluster, and may involve complex transformations, filtering, aggregations, and machine learning algorithms. A Spark job is a user application that is sliced into stages. A stage is a set of independent tasks all computing the same function, where all the tasks have the same shuffle dependencies. Each Stage can either be a shuffle map stage, in which case its tasks’ results are input for another stage, or a result stage, in which case its tasks directly compute the action that initiated a real computation (e.g. count(), save(), etc).

Executor An executor in Spark is a fraction of the NodeManager capacity and it is allocated by the Resource manager Yarn. An executor involves two types of resources, which are CPU and memory. In Yarn, the virtual core abstraction (vcore) represents the CPU resource. The executor size reveals how many resources are available in each unit and the number of executors reveals the total allocated resources. We list the key parameters that will be discussed in this paper as follows: (i) executor-memory amount of memory per executor process; (ii) executor-vcore the number of vcores per executor process (iii) num-executor: number of executors. These configurable parameters involve only vcore and memory resources. We denote V and M for the total assigned vcore and memory. V, for example, equals to the product of executor-vcore and num-executor.

The memory in Yarn is natively isolated. The usage of memory will not exceed the allocated amount. Otherwise, Yarn will kill the executor. The vcore isolation is forced by Control Groups (CGroups [14]), which guarantees executors to be limited in their range of resource allocation. The isolation of memory and vcore is the fundamental requirement of our performance prediction models of Spark applications.

Benchmark HiBench [27] is a big data benchmark suite that helps to evaluate different big data frameworks concerning speed and system resource utilization. We evaluate three types of benchmarks (Mirco, ML, and SQL) with twelve workloads, namely Wordcount (WC), Sort, Terasort (TS), k-means Clustering (Kmeans), Bayesian Classification (Bayes), Support Vector Machine (SVM), Linear Regression (LR), Latent Dirichlet Allocation (LDA), Principal Components Analysis (PCA), Join, Aggregate, and Scan. We characterize these workloads from HiBench in Table 1 according to their resource consumption and other properties. For example, machine learning workloads consume memory and CPU intensively. Mirco and SQL workloads often can be computed in a single iteration, and sort operations involve huge data movement among the nodes.

Table 1

Characteristics of experimental workloads

Workload	Memory int-	CPU intensive	One pass	Heavy shuffle
Wordcount		$\checkmark$	$\checkmark$
Sort			$\checkmark$	$\checkmark$
TeraSort			$\checkmark$	$\checkmark$
K-means	$\checkmark$	$\checkmark$
Bayesian	$\checkmark$	$\checkmark$
SVM	$\checkmark$	$\checkmark$
LR		$\checkmark$
LDA	$\checkmark$	$\checkmark$
PCA	$\checkmark$	$\checkmark$
Join			$\checkmark$
Aggregation			$\checkmark$
Scan			$\checkmark$

3 Problem statement

We first introduce the performance metric for evaluation. The execution time is a primitive yet essential performance metric for evaluation. Most metrics may compose of multiple factors. Our prediction model is feasible to extend with other time-related metrics while in this paper, we introduce two of the most important metrics, namely resource time (e.g., used in [17]) and throughput (e.g., used in [2]).

Resource time, $T_R$. Job running time can represent the speed of the computation. However, it is inadequate to present resource consumption [17]. Nowadays, the pay-as-use cloud is widely used, e.g., Amazon’s on-demand pricing. Given the job execution time T and total amount allocated resource of vcore V and memory M, we define memory time $T_M$ ($=M\times T$), vcore time $T_V$ ($=V\times T$), and resource time $T_R$ as follows:

$$\begin{aligned} \begin{aligned} T_R=T_V + \theta T_M \\ \end{aligned} \end{aligned}$$

(1)

where $\theta$ can reveal the ratio of per GB memory price and per vcore price in practice.

We choose $T_R$ as our performance metric for two reasons. Firstly, $T_R$ reveals resource utilization. $T_R$ quantifies how many units of the actual resource have been used, relating resource cost in the Cloud, and power consumption [51] in the cluster. Intuitively for a job, lower resource cost leads to lower $T_R$ and hence higher resource utilization. Secondly, it captures the fairness [16] in sharing resources among applications.

We explore the vcore and memory parameter values to achieve a minimum $T_R$. Given a set of vcore and memory parameter values (V,M) and its running time performance T, $v_{rec}$ and $m_{rec}$ achieves minimum resource time, meaning that:

$$\begin{aligned} (v_{rec}, m_{rec}) = argmin _{v \in V,m \in M} T_R(v,m) \end{aligned}$$

(2)

We then introduce the formal problem statement for evaluation, as well as our goal for the problem. The problem can be described in the following:

Input (i) A set of workloads/jobs $\mathbb {J}$ (the individual job $\mathcal {J}$ with the data size D); (ii) available cluster resource, i.e., available nodes, vcore, and memory; (iii) requirements defined by users (e.g., deadline T).
Output For each $\mathcal {J}$, we recommend (i) number of vcores V; (ii) amount of memory M
Objective Cost-effective parameter values V and M.

We show an example of the primary inputs and outputs in Table 2. The executor size (Vcore(V), memory(G)) represents the vcore and memory configuration for each executor while the executor number describes the configuration of the executor number in total. The essence of the goal is to provision resource [8, 45] towards optimal performance, i.e., minimize resource time in Eq. (2). The goal often subjects to satisfying user-defined requirements, e.g., deadlines. To achieve the goal, we build a predictable cost model for predicting runtime by running the jobs with a tiny of data. We describe the input and output for the prediction model as follows:

Prediction input (i) Logs of sample run (input size, output size, intermediate input size, shuffle size.); (ii) resources profiles (cluster resource information, i.e., available vcore and memory, number of nodes).
Prediction output Performance, i.e., execution time

We expatiate resource profile and its collection methods in Sect. 4.2.2. The performance prediction model enables us to recommend parameters based on different performance metrics, e.g., execution time and resource time.

Table 2

Example of main inputs and outputs

Input	Job type	kmeans	Sort	Join
	Input data size	9GB	10GB	5GB
	Deadline	200 s	100 s	100 s
Output	Executor number	2	1	1
Output	Executor size	10V 12 G	10V 5 G	5V 5 G

4 SimCost

Figure 2 shows the overall components of SimCost and the workflow between the components. First (Preparation), based on the data size and available resources, we sample and split the input data. We later will execute the job with a small portion of such sampled data. Second (Memory Assurance), we repeat the simulation to assure the memory amount, thus predicting the needed memory for the original data size. Assured memory amount in our context means sufficient memory to avoid out-of-memory errors for non-caching jobs or data fitting into memory for caching jobs. Thirdly (Prediction), we build the analytical cost model to analyze sampled data, which records the profiles of jobs, to predict the performance with larger data size and cluster. We repeatedly execute the job with different samples to mitigate the skewness of data distribution. Last (Recommendation), we can aim to optimize a performance metric, i.e., runtime, resource cost, and throughput. We construct the configuration space from available resources. We introduce an algorithm that iteratively searches resource configuration until we achieve near-optimal performance. We analytically describe the procedures in the following parts.

4.1 Simulation

Here, we simulate job executions with a tiny portion of input data to predict the performance with the original input data. We leverage a sampling technique to distributively select the data fractions. We employ Independent Bernoulli Sampling [41], which assumes independent and identical distribution, to sample input data. Given original sample X, sampling probability p with sample $X'$, estimated result $\hat{X}$ can be derived as follows:

$$\begin{aligned} \begin{aligned} E[X']=X\times p \\ \hat{X}=\frac{1}{p}\times X' \\ \end{aligned} \end{aligned}$$

(3)

By the sampled data, the job are executed to estimate the runtime by a Cost Model, i.e., $\hat{T}=Cost(\hat{X},V,N)$, where $\hat{X}$, V, and N are the data size, vcores, and numbers of nodes, respectively.

It is tricky to intelligently determine the number of samples, because too less samples may be insufficient to catch the profile of the job and it also existed the challenges of:

Data skewness the performance of a job can vary with two different data samples with the same data size.

Runtime deviation the performance of a job can vary with exactly the same data and resource configuration.

To address these two challenges, Monte Carlo (MC) Simulation (e.g., [5]) with replacement sampling is adopted. It means that we first select one sample from the whole dataset where each sample has an equal probability of being included in the sample, then select from the whole dataset where previously selected sample still has the same probability. With the independent and identically distributed estimations, the Central Limit Theorem (CLT) holds, i.e., the mean $\hat{\mu }_{\hat{X}}$ converges to a normal distribution, as n approaches infinity:

$$\begin{aligned} \hat{\mu }_{\hat{X}} \sim \mathcal {N} (E[\hat{X}], Var[\hat{X}]/n) \end{aligned}$$

(4)

The cost model can be extrapolated into a linear form [48], i.e. $T=Cost(D,V,N)=\alpha D+\beta$ when given V and N. So we can predict runtime $\hat{T}=\alpha \hat{X}+\beta$ by given sample data $\hat{X}$. The mean of runtime $\hat{\mu }_{\hat{T}}$ converges to a normal distribution as simulation n approaches infinity:

$$\begin{aligned} \hat{\mu }_{\hat{T}} \sim \mathcal {N} (\alpha E[\hat{X}+\beta ], Var[\alpha ^2\hat{X}]/n)\ \ i.e., \ \mathcal {N} (E[\hat{T}], Var[\hat{T}]/n) \end{aligned}$$

(5)

Equation (5) provides the confident interval (CI) of cost model’s estimates as follows $CI(C_m) = [L_{C_m},U_{C_m}] = \left[ \hat{\mu }_{C_m}-t_* \dfrac{\hat{\sigma }_{C_m}}{\sqrt{n}},\hat{\mu }_{C_m}+t_* \dfrac{\hat{\sigma }_{C_m}}{\sqrt{n}}\right]$. Given the estimated results $\{\hat{T_1}, \hat{T_2},\ldots , \hat{T_n}\}$ and the mean runtime $\bar{T}$ from n sample simulation, the variance formulation is $\hat{\sigma }_{C_m}^2=\frac{1}{n-1}\sum _{i=1}^{n}(\hat{T_i}-\bar{T})^2$. The simulation terminates when the given confidence level holds:

$$\begin{aligned} t_*\ \frac{\hat{\sigma }_{C_m}}{\root \of {n}} < \epsilon \end{aligned}$$

(6)

We use t-distribution [30] to indicate the confidence interval. For instance, it would be 80% confidence interval with $t_*=1.325$, $n=20$ and $\epsilon =0.1$. Once the Eq. (6) holds, we can say the probability of the estimated value falling into $\Big [\bar{T}-t_*\ \frac{\hat{\sigma }_{C_m}}{\root \of {n}}, \bar{T}+t_*\ \frac{\hat{\sigma }_{C_m}}{\root \of {n}}\Big ]$ will be 80%.

Training cost The training cost accumulates resource time of the sample runs. We describe the training cost of a job as the ratio of resource times for simulations and actual run. It can be expressed as follows:

$$\begin{aligned} C_{training} = \frac{\sum _{i=1}^{n}{T_{r_i}}}{T_R} \end{aligned}$$

(7)

where $T_{r_i}$ is the resource time of $i\mathrm{{th}}$ simulation; n is the number of simulations; $T_R$ is the resource time for the actual run, and is fixed for a job with one configuration. Note that there is a trade-off between training cost and the confidence of the prediction. Intuitively, higher confidence implies higher prediction accuracy and requires more simulations. In the rest of the paper, we use simulation and training interchangeably. Also worth mentioning is that the sampling cost, which may need to re-partition the data once, is very low since the data size of workloads is only up to tens of gigabytes.

4.2 Prediction

Previous work like Ernest [48] predicts the performance of a job at the node level. In contrast, we predict the performance at a specific resource parameter level which needs to specify the amount of memory and the number of vcores, performing at a task granularity. The core part of solving the problem is to estimate the execution time $\hat{T}$ when given a job and input data with parameter values of vcore V and memory M. We first determine the memory requirement for the input data, then build the cost model to predict the job performance with varied vcore which is the most performance-effective factor [39]. We assure memory first, as the lack of sufficient memory often leads to slow completion (e.g., disk I/0 for iteration) or failure (e.g., out of memory) of a job while extra memory barely helps (e.g., Bayesian classification in Fig. 3).

4.2.1 Memory assurance

Designing a proper sampling experiment can reveal the real need for memory. We pick resource time metric to achieve memory configuration values as it reveals cost-effectiveness while extra memory often does not help and large memory may cause long GC time [3]. The actual run result of samples for memory resembles an elbow curve [32]. The elbow point, where the derivative of the curve function has a huge jump, is considered as the near-optimal memory value for the best resource time. Note-worthily, this near-optimal memory size is only for resource time. However, a user may choose a different memory size according to her metrics, such as job success, runtime, and throughput.

In SimCost, we initialize the memory parameter values with 0.5 − 2.5$\times$¹ of sample data size, as the needed memory is often linear to the size of input data. For ML jobs, we expect an elbow curve for the memory performance as shown in Fig. 3. However, this curve can be flat as well (e.g., in Mirco and SQL jobs). In this case, the beginning region of the curve is more important. We re-experiment the samples in this region to detect the actual memory requirement. Let P be a set of distinct sampling probabilities $\{p_1,p_2,\ldots p_i,\ldots ,p_n\}$, where $p_i$ is the i-th probability for sample. For each $p_i$, we obtained assured memory $m_i$ by the elbow point, and $M = \{m_1,m_2,\ldots m_i,\ldots ,m_n\}$. Then by a regression model, we can predict the assured memory $\hat{M}$ for input data when given P and M. We employ a linear regression model, as it practically works great for our benchmark workloads in Table 1.

Note that $m_i$ is the value of assured memory for cache, called cache memory for convenience. We pay attention to the performance of the available cache memory out of the allocated executor memory. The cache memory $M_{cache}$ is allocated from the executor memory $M_{executor}$. It is calculated by the cache fraction $\theta _{cache}$ and the safe reserved memory $M_{safe}$, i.e, $M_{cache} = (M_{executor} - M_{safe})\times \theta _{cache}$ [1]. By default $\theta _{cache}=0.6$ and reserved memory varies a bit as the executor memory changes. Therefore in our recommendation, we approximate executor memory by two times of cache memory for simplicity, simply because such approximate allocation is rounding up so that it still guarantees sufficient memory to cache data.

Example 1

We consider an example of Bayesian classification. The detail of the workload is described in Table 4. Figure 3 shows the training results with different memory sizes and sample probabilities. The plots in the first row represent the running time of samples while the plots in the second row present the resource time. By elbow method, we select the memory size $m=0.9$ and $m=1.8$ for $p=1/30$ and $p=1/15$, respectively.

4.2.2 Performance prediction

Now we introduce the cost model for job performance prediction. The cost model considers not only the computation time but also the input, output, and shuffle time. The computation time is mainly restricted by the number of vcores, whereas others are mainly restricted by the network bandwidth and disk read/write performance. We formally define the cost model as follows.

Cost/prediction model Let $\mathcal {J}$, D, V, and N be the specific job, data size, number of vcores, and number of nodes, respectively. Given D, V, and N for $\mathcal {J}$, a cost function $Cost_\mathcal {J}(\cdot )$ returns the running time of the job, $T_{\mathcal {J}}$. A spark job consists of many stages, and each stage consists of parallel tasks. For a stage, the longest execution time of a task is recorded, as it means the completion time. We then describe the model in the following:

$$\begin{aligned} \begin{aligned} T_{\mathcal {J}}&= Cost_\mathcal {J}(D,V,N) = O_{\mathcal {J}} + \sum T_{\mathcal {S}}(D,V,N) \end{aligned} \end{aligned}$$

(8)

where $O_{\mathcal {J}}$ stands for the overhead of preparing and finishing for stages. We describe the cost function of a stage in the following:

$$\begin{aligned} T_{\mathcal {S}}(D,V,N) = O_{\mathcal {S}}(N) + \Big \lceil \frac{P}{V}\Big \rceil \sum T_{\mathcal {T}}\Big (\frac{D}{P}\Big ) \end{aligned}$$

(9)

where $O_{\mathcal {S}}$ stands for the overhead of preparing and finishing tasks in a stage. It increases as N increases because of the communication overhead. P is the number of partitions (tasks), and Spark distributes one thread vcore to each partition of the task. $\frac{P}{V}$ means that the computing time is negatively linear to vcore [48]. The task consists of overhead, computing, and shuffling steps. We describe the cost function of a task in the following:

$$\begin{aligned} T_{\mathcal {T}}(D) = O_{\mathcal {T}} + T_{\mathcal {C}}(D) + T_{\mathcal{S}\mathcal{H}}(D) \end{aligned}$$

(10)

where $O_{\mathcal {T}}$ stands for the overhead of preparing and finishing a task. $T_{\mathcal{S}\mathcal{H}}$ is the time for shuffling. We describe the cost function of computing and shuffling part in the following:

$$\begin{aligned} \begin{aligned} T_{\mathcal {C}}(D) = \frac{D}{C_{ri}} + D \cdot C_{compute} + \frac{D}{C_{wo}} \end{aligned} \end{aligned}$$

(11)

$$\begin{aligned} \begin{aligned} T_{\mathcal{S}\mathcal{H}}(D) = \frac{D \cdot r_{read}}{C_{rs}} + \frac{D \cdot r_{write} }{C_{ws}} \end{aligned} \end{aligned}$$

(12)

where $C_{compute}$, $C_{ri}$, $C_{wo}$, $C_{rs}$, and $C_{ws}$ are the computation, reading input, writing output, shuffle reading, and shuffle writing speeds, respectively. $r_{read}$ and $r_{write}$ are the ratios of shuffle read and shuffle write concerning the data size. We explain how we learn these parameters of the prediction model as follows.

Learning the model $C_{compute}$ for different jobs may vary for the prediction model, as the core computation part varies in jobs. $C_{ri}$, $C_{wo}$, $C_{rs}$, and $C_{ws}$ are fixed for the model, as we assume a fixed network and disk I/O. We collect the simulation logs from the history server using Spark REST APIs [1]. From the logs, we record $T'_{\mathcal {C}}$, $D'$, and $V'$, and compute $C_{compute}$ by Eq. (11). We record the logs of input data and shuffle sizes for computing $r_{read}$ and $r_{write}$. These two are the ratios of input data and shuffle data for read/write operations.

In Eqs. (11) and (12), $C_{ri}$, $C_{rs}$, $C_{ws}$, and $C_{wo}$ are computed from local disk speed, HDFS disk speed, and network speed. The following is how we compute the $C_{ri}$, $C_{wo}$, $C_{rs}$ and $C_{ws}$: (i) $C_{ri} = min\{C_{network}, C_{hr}\}$; (ii) $C_{wo} = min\{C_{network}, C_{hw}\}$; (iii) $C_{ws} = min\{C_{network}, C_{ldw}\}$; (iv) $C_{rs} = min\{C_{network}, C_{ldr}\}$; where $C_{network}$, $C_{hr}$, $C_{hw}$, $C_{ldr}$, and $C_{ldw}$ are the profiles of network throughput, HDFS read throughput, HDFS write throughput, local disk read throughput, and local disk write throughput, respectively. These values are firstly recorded by testing the throughput performance with Linux dd command, then judiciously revised by measuring the performance of some stressing workloads (e.g., Sort for shuffling throughput as Sort is shuffle-intensive). We compute these values from the stress test rather than from samples, as the sample runs often do not reach the bottleneck.

After we achieved the above resource profiles, we perform a simulation to calculate the computing speed $C_{compute}$ and overheads of the job ($O_{\mathcal {J}}$), stage ($O_{\mathcal {S}}$), and task ($O_{\mathcal {T}}$) for their preparation and completion. Now we can estimate the running time of the job by inputting larger data size D ($=D'\cdot p$), a larger number of vcores V ($=V'\cdot p$), and a larger number of machines N in Eq. (9). Note that we assume the assured memory is allocated according to the data size as clarified in Sect. 4.2.1.

Profiler Profiling, as a program analysis tool, enables the user to understand workload behavior. It refers to the process of collecting and constructing performance metrics such as shuffle size, output size, computing time, etc. The Profiler built on such collected metrics plays a vital role in performance monitoring and tuning. Traditional profiling collects historical logs to build the profilers (e.g, [20]). However, profiling is costly while running jobs with large input data, as it is difficult to characterize resource costs and the behavior of the jobs from a cold start. Our simulation makes profiling a cost-effective alternative. It further explains the cost model in terms of the performance metric and therefore predicts any of the performance metrics as well towards a larger dataset. Table 3 shows the performance profiles of some example jobs constructed by the profiler.

Table 3

Example of workload profiles

Job	Sample size	Input size (GB)	Intermediate input size’ (GB)	Shuffle size’	Output size’	Running time’ (s)	Throughput’ (MB/s)	Resource time’
Sort	632MB	9.55	9.6	4.2GB	8.5GB	260	37.6	9574.16
WC	3.06GB	91.8	91.8	2.7MB	1MB	224	419	8065.73
Kmeans	382MB	11.2	21.8	1MB	1MB	106	108.1	4453.22
Join	591MB	17.3	17.3	1GB	1GB	159	111.9	5716.84

We run samples to predict (appended with “$'$”) intermediate input size, shuffle size, and output size, thus predicting the running time, throughput, and resource time

4.3 Parameter recommendation

The prediction model estimates runtime performance with the given resource profile and job information. However, the given resource profile may not yield the best performance of the job. Therefore, we aim to recommend the resource profile. We recommend resource profiles based on specific performance metrics; runtime, resource time, or throughput. We first theoretically form and analyze the recommendation model, then provide a practical algorithm to achieve the recommendation.

Optimal runtime T We start with runtime T metric. For clarity, we transform the cost model $T_{\mathcal {J}} = Cost(D:V,N)$ into $T_{\mathcal {J}}(V,N) = \gamma _{0}\frac{1}{V}+\gamma _{1}N+\gamma _{2}$, when D is given. First, if the number of machines N is independent of the number of vcores V, we achieve the lower bound:

$$\begin{aligned} \begin{aligned} L_{T_{\mathcal {J}}} = \lim _{V \rightarrow \infty } \gamma _{0}\frac{1}{V}+\gamma _{1}N+\gamma _{2} = \gamma _{1}N + \gamma _{2} \end{aligned} \end{aligned}$$

(13)

When N is fixed, $T_{\mathcal {J}}$, as V approaches infinity, converges to a constant value. It implies that increasing V reduces $T_{\mathcal {J}}$ because of the reduction of computing time, however, after a certain point increasing V barely helps as the computing time does not dominate any more. In practice, V depends on N, i.e., $V=kN$, where k represents vcores in one node. We substitute N into $\frac{V}{k}$ and achieve $T_{\mathcal {J}}(V) = \gamma _{0}\frac{1 }{V}+\frac{\gamma _{1}}{k}V+\gamma _{2}$.

Lemma 1

Let V be the vcore value in a discrete set of values and $V>0$, $T_{\mathcal {J}}(V) = \gamma _{0}\frac{1 }{V}+\frac{\gamma _{1}}{k}V+\gamma _{2}$ is unimodal with a minimum value, where $\gamma _{0}$,$\gamma _{1}$,$\gamma _{2}$,$k>0$.

Proof

Its derivative is $T'_{\mathcal {J}}(V) = -\gamma _{0}\frac{1}{V^2}+\frac{\gamma _{1}}{k}$. Given $T'_{\mathcal {J}}(V)=0$ and $V>0$, we obtain a unique root $V=\sqrt{\frac{\gamma _{0}k}{\gamma _{1}}}$. When $V<\sqrt{\frac{\gamma _{0}k}{\gamma _{1}}}$ and $V>\sqrt{\frac{\gamma _{0}k}{\gamma _{1}}}$, $T'_{\mathcal {J}}(V)<0$ and $T'_{\mathcal {J}}(V)>0$, respectively. Therefore, $T_{\mathcal {J}}(V)$ is unimodal with a minimum value at $V=\sqrt{\frac{\gamma _{0}k}{\gamma _{1}}}$. $\square$

By $V=\sqrt{\frac{\gamma _{0}k}{\gamma _{1}}}$, we achieve the lower bound of $T_{\mathcal {J}}$:

$$\begin{aligned} \begin{aligned} L_{T_{\mathcal {J}}}&= \gamma _{0}\frac{1}{\sqrt{\frac{\gamma _{0}k}{\gamma _{1}}}}+\frac{\gamma _{1}}{k}\sqrt{\frac{\gamma _{0}k}{\gamma _{1}}}+\gamma _{2}= 2\sqrt{\frac{\gamma _{1}\gamma _{0}}{k}}+\gamma _{2} \end{aligned} \end{aligned}$$

(14)

We achieve the lowest execution time when $V=\sqrt{\frac{\gamma _{0}k}{\gamma _{1}}}$. It means that with more vcores the job unexpectedly performs worse with respect to the execution time, as the high communication cost or other overhead outweighs the reduction of computing time with more nodes.

Nevertheless, the lower bound holds with some assumptions as sufficient conditions. The assumptions are as follows: (i) Fixed disk throughput, which guarantees the same speed of dealing with data in disk I/O operations. (ii) Homogeneous cluster resources (including CPU and network), which guarantees the same speed of computing and transferring data. (iii) Identical task processing, i.e., same task time within a stage, which guarantees the precise prediction in vcore scaling. As the assumption (iii) often does not hold, we design same-wave-scale prediction (detailed later in practical recommendation) in Algorithm 1 to best simulate the linear scaling.

Optimal resource time $T_R$. Recall the resource time $T_R = T_{\mathcal {J}}(V) \times (V + \theta M)$ in Sect. 3. When data size D is given for a workload, M will be fixed, as assuring memory may avoid job failures or performance deterioration. We substitute $\gamma _{3}$ with $\theta M$, and achieve $T_R (V) = \big (\gamma _{0}\frac{1}{V}+\frac{\gamma _{1}}{k}V+\gamma _{2}\times (V+\gamma _{3}\big ) = \frac{\gamma _{1}}{k}V^2 + \big (\frac{\gamma _{1}\gamma _{3}}{k}+\gamma _{2}\big )V+\frac{\gamma _{0}\gamma _{3}}{V} + \gamma _{2}\gamma _{4} + \gamma _{0}$.

Lemma 2

Let V be the vcore value in a discrete set of values and $V>0$, $T_{R}(V) = \frac{\gamma _{1}}{k}V^2 + \big (\frac{\gamma _{1}\gamma _{3}}{k}+\gamma _{2}\big )V+\frac{\gamma _{0}\gamma _{3}}{V} + \gamma _{2}\gamma _{3} + \gamma _{0}$ is unimodal, where $\gamma _{0},\gamma _{1},\gamma _{2},\gamma _{3},k>0$.

Proof

Its derivative is $T_{R}'(V)=2\frac{\gamma _{1}}{k}V + \big (\frac{\gamma _{1}\gamma _{3}}{k}+\gamma _{2}\big )-\frac{\gamma _{0}\gamma _{3}}{V^2}$. It is not trivial to get the solution of $T_{R}'(V)=0$. Instead, we prove $T_{R}'(V)=0$ exists as the only solution when $V>0$. The second derivative of $T_{R}(V)$ is $T_{R}''(V)=2\frac{\gamma _{1}}{k} + 2\frac{\gamma _{0}\gamma _{3}}{V^3}$. When $V>0$, $T_{R}''(V)>0$, meaning that $T_{R}'(V)$ is a monotonic function in $(0,\infty ]$. $\lim _{V \rightarrow -\infty }T_{R}'(V)=-\infty <0$ and $\lim _{V \rightarrow \infty }T_{R}'(V)=\infty >0$. There exists $x_0$ such that ${R}'(x_0)=0$ and $T_{R}'(V)<0$ when $V<x_0$, while $T_{R}'(V)>0$ when $V>x_0$. Hence, $T_{R}(V)$ is unimodal with a minimum value at $V=x_0$. $\square$

With $V=x_{0}$, which is the root of $T_{R}'(V)=0$, we provide the lower bound of $T_{R}$ as follows:

$$\begin{aligned} \begin{aligned} L_{T_R} = \frac{\gamma _{1}}{k}x_{0}^2 + \Bigg (\frac{\gamma _{1}\gamma _{3}}{k}+\gamma _{2}\Bigg )x_{0}+\frac{\gamma _{0}\gamma _{3}}{x_{0}} + \gamma _{2}\gamma _{3} + \gamma _{0} \end{aligned} \end{aligned}$$

(15)

The lowest resource time are obtained when assigning $V=x_{0}$, meaning that when we increase vcore when $V\ge x_{0}$, the job deteriorates performance concerning $T_{R}$ conversely. The reasons are twofold: (i) after passing a turning point, the resource of vcore is no more a bottleneck, meaning that less waiting time happens for the job computation; (ii) We need to increase N if we need more V, which may lead to higher overhead for example in a shuffle stage. In production, $V=x_{0}$ often is insufficient to meet user’s deadline $T_D$. Ideally, we obtain $V=V_D$ such that $T_{\mathcal {J}} \le T_D$, i.e., we obtain the lowest resource time, $T_R(V_D)$, with a deadline $T_D$.

Optimal throughput C We have defined throughput $C = \frac{D}{T_{\mathcal {J}}}$ in Sect. 3. With $T_{\mathcal {J}}=\gamma _{0}\frac{1 }{V}+\frac{\gamma _{1}}{k}V+\gamma _{2}$, we achieve $C(V)=\frac{\gamma _{4}}{\gamma _{0}\frac{1}{V}+\frac{\gamma _{1}}{k}V+\gamma _{2}}$, where $\gamma _{4}=D$.

Lemma 3

Let V be the vcore value in a discrete set of values and $V>0$, $C(V)=\frac{\gamma _{4}}{\gamma _{0}\frac{1}{V}+\frac{\gamma _{1}}{k}V+\gamma _{2}}$ is unimodal, where $\gamma _{0},\gamma _{1},\gamma _{2},\gamma _{4},k>0$.

Proof

It is derivative is $C'(V)=\frac{\gamma _{1}-\gamma _{0}kV^2}{\gamma _{0}kV^2+\beta {1}+\gamma _{2}kV}$. Given $C'(V)=0$ and $V>0$, we obtain a unique root $V=\sqrt{\frac{\gamma _{0}k}{\gamma _{1}}}$. When $V<\sqrt{\frac{\gamma _{0}k}{\gamma _{1}}}$ and $V>\sqrt{\frac{\gamma _{0}k}{\gamma _{1}}}$, $T'_{job}(V)>0$ and $T'_{job}(V)<0$, respectively. Therefore, C(V) is unimodal with a maximum value at $V=\sqrt{\frac{\gamma _{0}k}{\gamma _{1}}}$. $\square$

With $V=\sqrt{\frac{\gamma _{0}k}{\gamma _{1}}}$, we achieve the upper bound of throughput as follows:

$$\begin{aligned} \begin{aligned} U_{C} = \frac{\gamma _{4}}{2\sqrt{\frac{\gamma _{0}\gamma _{1}}{k}} + \gamma _{2}} \end{aligned} \end{aligned}$$

(16)

We achieve the highest throughput, when we assign $V=\sqrt{\frac{\gamma _{0}k}{\gamma _{1}}}$. In other words, if we assign more vcore than that, the runtime of a job increases, thus affects the throughput negatively, as the throughput is inversely correlated to the runtime. Figure 4 graphically summarizes the performance bottlenecks as vcore changes.

Practical recommendation Theoretically, iterative algorithms (e.g., Golden section search [40]) can be directly adopted to compute the optimal values for unimodal function. However, in our context, the number of vcores is not continuous and such iteration requires reshaping the sample probability, which may lead to (i) extra works on reshaping input data, (ii) different runtime performance because of the partition variety, and (iii) the third assumption does not hold practically as it requires the tasks to be identical. Furthermore, the parallel tasks in a stage can be scheduled in more than one waves, and the performance of the parallel tasks in the first wave varies distinctly from subsequent waves due to task scheduling preparation overheads [44, 50]. Therefore, we design experiments with fixed partition size per vcore, ensuring the same number of waves. This enables us to predict the performance of jobs with larger datasets and more vcores in a relatively accurate way because the prediction considers the simulation with the same number of waves. Consequently, we propose our own search algorithm (Algorithm 1) to find the near optimal vcore configuration. Before we introduce the algorithm, we provide a support lemma of unimodal function that later will guarantee our search algorithm can stop properly.

Lemma 4

Given a unimodal function $f(\cdot )$ with a minima, a vcore domain $\mathcal {D}_V$, and $(d+1)$ parameter values $V=\{v_0,v_1,\ldots ,v_d\}$ (for $v_i$, $v_j$ $\in$ V, if $i < j$ then $v_i < v_j$) where $V \subseteq \mathcal {D}_V$. If $v_k = arg\,\min _{V \in \mathcal {D}_{V}} f(v)$, a range search function $RS(\cdot )$ returns a subrange $\mathcal {D}'_V$ such that:

$$\begin{aligned} v_{opt} \in \mathcal {D}'_V {\left\{ \begin{array}{ll} [v_{k}, v_{k+1}) &{} \quad k = 0 \\ (v_{k-1}, v_{k}] &{} \quad k = d \\ (v_{k-1}, v_{k+1}) &{} otherwise \end{array}\right. } \end{aligned}$$

(17)

Where $v_{opt}$ is the optima of $f(\cdot )$ and $\mathcal {D}'_V$ is the known minimum range for $v_{opt}$.

Proof

$f(\cdot )$ is unimodal with a minima, meaning that $f'(v)<0$ for $v<v_{opt}$ and $f'(v)>0$ for $v>v_{opt}$. For $k\ne 0,d$, $v_k = arg\,\min _{V \in \mathcal {D}_{V}} f(v)$ means $f(v_k)<f(v_{k-1})$ and $f(v_k)<f(v_{k+1})$, indicating $f'(v_{k-1})<0$ and $f'(v_{k+1})>0$. Hence, $f(v)>f(v_{k-1})$ for $v<v_{k-1}$ and $f(v)<f(v_{k+1})$ for $v>v_{k+1}$, meaning that $v_{opt} \in (v_{k-1}, v_{k+1})$. Similarly, we can prove the case of $k=0$ and $k=d$. $\square$

Algorithm 1 is an iterative range search method that narrows down the range to the (near-) optimal value. For simplicity, we assume our problem is to find the local minimizer. The maximizing problem of C(V) can be transformed to a minimizing problem without the loss of generality. In general, if we have not reached the stop step size (line 3), we split the vcore domain into different parameter values by grading degree d (line 4) and split the data into sample partitions (line 5). We then predict the cost for each vcore parameter (line 8) by the time cost from the sample run (line 7). The vcore then is recommended with its minimum predicted cost (lines 9–11). We repeat the step till we reach the stopping criteria (lines 12–13).

We describe our iterativeRS in Algorithm 1 for $T_{\mathcal {J}}(v)$, $T_{R}(v)$, and $\frac{1}{C(v)}$. By iterating executing $RS(\cdot )$, we obtain $v_{opt}$, with at most $log_d(2|\mathcal {D}|)$ rounds, where $|\mathcal {D}|$ is configuration space. Given the parameter range d for one time search and the tolerant error step size $\epsilon$, the stopping criteria is $\big \lceil \frac{|\mathcal {D}|}{d}\big \rceil \le \epsilon$.

Theorem 1

Iterative $RS(\cdot )$ finds $\epsilon$-optimal solution within $\Big \lceil \frac{\ln {\epsilon }-\ln { \left( D+\frac{\ln {D}-\ln {\epsilon }}{\ln {d}-\ln {2}}\right) }}{\ln {d}-\ln {2}} \Big \rceil$ rounds.

Proof

Let n denote the rounds needed to run $RS(\cdot )$, then in worst case, we need to round up and round down n times for upper and lower bound, as v is required to be an integer. So the range is consider to be $(D+2n)$. For each time search by 4, we keep 2 out of d ranges in worst case, meaning the shrinking rate is $\frac{2}{d}$. By given a tolerant error step size $\epsilon$, we have $(D+2n)/\big (\frac{d}{2}\big )^n) \le \epsilon$. By loosing the inequality, we have:

$$\begin{aligned} \begin{aligned} \Big (\frac{d}{2}\Big )^n&\ge \frac{D+2n}{\epsilon } \ge \frac{D}{\epsilon }\\ n \ln {\frac{d}{2}}&\ge \ln {\frac{D}{\epsilon }}\\ n&\ge \frac{\ln {D}-\ln {\epsilon }}{\ln {d}-\ln {2}}\\ \end{aligned} \end{aligned}$$

(18)

We next find what is the upper bound of n:

$$\begin{aligned} \begin{aligned} \Big (\frac{2}{d}\Big )^n&\le \frac{\epsilon }{D+2n} \le \frac{\epsilon }{D+\frac{\ln {D}-\ln {\epsilon }}{\ln {d}-\ln {2}}}\\ n \ln {\frac{2}{d}}&\le \ln {\frac{\epsilon }{D+\frac{\ln {D}-\ln {\epsilon }}{\ln {d}-\ln {2}}}}\\ n&\le \frac{\ln {\epsilon }-\ln { \big (D+\frac{\ln {D}-\ln {\epsilon }}{\ln {d}-\ln {2}}\big ) }}{\ln {2}-\ln {d}}\\ \end{aligned} \end{aligned}$$

(19)

Therefore, $RS(\cdot )$ finds $\epsilon$-optimal solution within $\Big \lceil \frac{\ln {\epsilon }-\ln { (D+\frac{\ln {D}-\ln {\epsilon }}{\ln {d}-\ln {2}}) }}{\ln {d}-\ln {2}}\Big \rceil$ rounds. $\square$

Example 2

We show an example of predicting the performance of Kmeans clustering in Fig. 5. The detail of the workload is described in Table 4. In this case, we run Kmeans with 5 partitions data and vcore parameter $V=\{1,2,3,4,5\}$ to predict the Kmeans performance with 150 partitions and vcore parameter $V=\{30,60,90,120,150\}$. $V=60$ satisfies the given deadline $T=100s$ while with minimum resource time. The predicted $V=60$ matches the best configuration among parameters of actual runs. If the deadline requirement does not exist, both prediction and actual run will select $V=30$ as the best configuration regarding minimum resource time. And if the error gap $\epsilon =30v$ is too big and we are interested in smaller resource time, then we will repeat the same process in a smaller range of configuration, i.e., [1v,60v].

Table 4

Baseline resource and deadline data for Workloads

Benchmark	Workload	Data size	Deadline (s)	Baseline vore and memory
Mirco	Wordcount	91.8GB	200	150 vcores 150 GB
	Sort	9.55GB	300	150 vcores 150 GB
	TeraSort	9.55GB	400	150 vcores 150 GB
ML	K-means	11.2GB	100	150 vcores 150 GB
	Bayesian	14.0GB	200	150 vcores 150 GB
	SVM	26.8GB	150	150 vcores 150 GB
	LR	15.2GB	200	150 vcores 150 GB
	LDA	246.2MB	100	150 vcores 150 GB
	PCA	30.7MB	350	150 vcores 150 GB
SQL	Join	17.3GB	200	150 vcores 150 GB
	Aggregation	17.3GB	200	150 vcores 150 GB
	Scan	17.3GB	200	150 vcores 150 GB

5 Evaluation

In this section, we first evaluate SimCost’s prediction model, and then evaluate SimCost’s parameter recommendation model based on performance metrics.

5.1 Experimental setup

We experiment performance of HiBench workloads in a Spark cluster managed by Yarn. The cluster has 10 resource nodes, each with 32GB of RAM and 2 Intel Xeon E5540 2.53GHz CPUs. Each CPU has four physical cores, so one node has a total of 8 cores. In the view of the Yarn resource manager, we have 10 slaves nodes, and each node is with 16 vcores (by hyperthreading) and 32GB memory. Also, we set up five data nodes for HDFS storage. All the nodes have two bonded 10Gbit/s network interfaces. The versions for Hadoop, Spark, and HiBench [27] are 2.7.3, 2.1.1, and 7.1 respectively. All experiments are with CGroups [14] enforcement on Yarn. So the performance of vcore and memory allocations is expected to be isolated. We maximize the executor size in our evaluation, as fewer executors in one node often yield a better performance [47].

In Table 4, we describe the workloads with deadlines, total resource baseline, and executor size baseline. Note-worthily, HDFS replicates three copies by default and stores the data in a fixed number of files. In our case, the number of files is 30, where each file is split into 5 partitions, providing us the chance to sample $p=1/30$ or $p=1/15$ of the data independently.

5.2 Prediction

We next evaluate SimCost’s cost prediction model for a number of benchmark workloads. We first interpret memory assurance. We then predict the vcores using simulation results. We detail the evaluation of memory and vcore prediction procedures in the following.

Memory assurance We predict the memory parameter value of memory-intensive workloads, e.g., Kmeans, Bayes, and SVM. Their memory requirements vary based on the job characteristics and input data size. It would be less interesting to discuss Micro and SQL workloads in this part as they are one-pass and do not need to cache the data.

In Table 5, we setup three probabilities, $p=1/30$, $p=1/15$, and $p=1/10$. For example in Kmeans, we choose 0.6GB, 1.2GB, and 1.8GB, respectively, and by the elbow points finding, the proper memory values are assured. We next regress the probability to $p=1$, and predict 18GB cache memory needed for the job with original data. Figure 6 shows the actual runtime and resource time of the jobs with their actual input data sizes. We pick memory for the actual input size for the jobs directed by the elbow points of the corresponding plots. The predicted values matched these selected values, which reveal near-optimal resource times. We now can compute the assigned executor memory.

Table 5

ML: memory prediction by different sample sizes and picked best parameter values

Workload	Sample p	Best $T_R$ cache (GB)	Predicted cache (GB)	Recommended (GB)
Kmeans	1/10	1.8	18	36
	1/15	1.2
	1/30	0.6
Bayesian	1/10	2.7	27	54
	1/15	1.8
	1/30	0.9
SVM	1/10	1.8	18	36
	1/15	1.2
	1/30	0.6

Performance prediction Table 4 describes 12 benchmark workloads in 3 different categories with their data sizes. Figure 7 shows the prediction results of these workloads. The dashed curves present the predicted performance and the solid curves represent the actual runs. We show the prediction interval (min and max) by drawing the shadow.

For most of the workloads, the predictions are quite satisfactory, as our cost model captures most of the jobs’ characteristics from sample simulations. For example, the performance of the Bayesian (Bayes) workload gains more improvements in small samples, and the actual runs are performed similarly. Unfortunately, a few of the actual runs performed against the prediction. Sort and Terasort, for example, perform steadily even with more vcores. This is because these two workloads are mainly bounded by the disk or network I/O in any stage of the jobs and adding more vcores does not reduce I/O waiting time. And more threads competing with limited I/O resources do slow down the jobs. Therefore, Bayes and Join performed worse than prediction in all vcore predictions as shown in the figure. Our observation is that overhead has been underestimated, as the I/O context switching or waiting may affect the I/O itself or even the computation.

Besides, some prediction results are within the large shadow interval, e.g., LR and PCA. This implies that the prediction range is larger mainly due to the skewness of data distribution or performance deviation. We show later that LR and PCA need more simulation costs to achieve stable results.

Prediction precision In this part, we use predicted runtime as a precision metric which is divided by the actual runtime. Figure 8 presents the prediction precision of the given 12 workloads. The prediction error is under 20% for the individual workload. The average prediction error is within 5%. In production, this error does not affect the decision-making on choosing the proper vcore configuration.

SimCost is further compared to Ernest [48]. In general, for ML workloads, we predict as good as Ernest for vcores. They ensure sufficient memory and predict vcore performance for CPU-intensive workloads. However, Ernest does not perform well in Mirco and SQL workloads. The reason is of lacking in considering the network and disk bottlenecks as data and resource scales.

Training cost A confidence interval (CI) is determined to stop the simulation or training. We can balance the trade-off between the training cost and the confidence of the prediction by the stopping criteria (defined in Eq. 6). Figure 9 with Sort workload example, depicts the training cost ratio (introduced in Eq. 7) of each performance prediction and CI. For instance, given a CI of 80%, it obtains it by at least four rounds of training, which has less than 5% training cost of the actual run. Similarly, less than 10% of the training cost is required if 90% CI is given. Worth mentioning, the storage cost is negligible, as the REST API collects INFO-level logs which are far less than the data size.

Table 6 shows the training cost ratio for given 12 workloads. When given 80% CI, the accumulated training costs are less than 10% for all the workloads, except for LR. The LR runtime deviating a lot in our experiment may be the skewness of data distribution and the subtle system changes. Likewise, with 90% CI, the accumulated training costs are less than 20%, except for LR and PCA. To sum up, the average accumulated training costs are 7.92% and 13.64% with regards to one round of actual runs for CI = 80% and CI = 90%, respectively.

Table 6

Evaluation of benchmark workloads in terms of resource time and training cost ratio for total parameter recommendation based on different confident intervals (CI)

Type	Workload	Recommendation	$T_R$ speedup (%)	CI = 80% (%)	CI = 90% (%)
Micro	Wordcount	60V18G	217	4.11	4.47
	Sort	30V18G	492	1.99	2.37
	TeraSort	30V18G	536	1.86	2.37
ML	Kmeans	60V36G	229	5.89	6.51
	Bayesian	90V54G	161	7.91	16.89
	SVM	30V36G	440	5.30	5.30
	LR	30V18G	644	34.83	54.77
	LDA	30V18G	657	7.6	11.05
	PCA	30V18G	499	9.36	41.80
SQL	Join	30v18G	360	4.99	5.22
	Aggregation	30v18G	335	5.33	6.06
	Scan	30V18G	423	5.74	6.83

5.3 Parameter recommendation

Table 6 shows the performance speedup via our parameter recommendations. We calculate the speedup in $T_R$ by dividing the performance runtime for the baseline configuration with the performance runtime for the recommended one. We can obtain up to 657% $T_R$ speedup compared to the baseline configurations for given benchmark workloads as shown in Table 4. The recommendation not only satisfies the deadline requirements defined by users but also obtains significant savings in resource and resource time.

To further illustrate the performance of algorithm 1, i.e., iterativeRS we show more detailed processes by the Kmeans workload. Given the step size $\epsilon =5$ of stopping criteria, vcores space $D=150$, and grading degree $d=5$, by Theorem 1, we obtained that in the worst case, it requires 4-time range searches (=$\lceil 3.742\rceil$). Figure 10a shows the three search steps of finding the proper vcore size for the Kmeans workload. With the initializing $\mathcal {D}_V=[0,150]$, $\epsilon _e=5$, and $d=5$, we can cut $\mathcal {D}_V$ into $V=\{0,30,60,90,120,150\}$. We initialize $T_R$ to an infinite number when $v=0$. For the first round of search, $v=30$ is the parameter value with the minimum $T_R$ using Eq. (2). Since $\big \lceil \frac{D=150}{d=5}\big \rceil =30>\epsilon =5$, we continue the search with $\mathcal {D}_V=[0,60]$. Likewise, $v=12$ satisfies the minimum $T_R$ among all the options. Again, since $\frac{60}{5}>\epsilon$, we continue the search with $\mathcal {D}_V=[0,24]$. We get $V=\{0,5,10,15,20,25\}$ by rounding up. This time, $v=15$ is the parameter value with the minimum $T_R$ among all the options. Now, since $\big \lceil \frac{24}{5}\big \rceil <=\epsilon$, stopping criteria is met and we return $v=15$. In Table 6, we have shown the assured memory for Kmeans workload is 36 G. Hence, the recommended configuration (v = 15, m = 36) is $\epsilon$-optimal as shown in Fig. 10b from the actual runs.

5.4 Evaluation summary

We highlight our main findings in the experiments as follows:

The proposed cost model predicts the performance of all benchmark workloads accurately (error rate within 20%, mostly around 5%).
The proposed simulation requires low training cost compared to the actual run (average 7.92% and 13.64% training ratios in terms of 80% and 90% confidence interval, respectively).
iterativeRS finds the near-optimal metric performance within several rounds.

A Profiler captures job execution information at the fine granularity and contributes to job monitoring and tuning. Wu and Gokhale [52] profile optimal configuration parameters for Hadoop workloads and recommend parameters for a new job from a similar workload’s configuration. A number of research profiled Hadoop’s small phases, e.g. Map, Reduce, Shuffle to optimize the job performance [20, 42]. Chiba and Onodera [13] profiled JVM performance to optimize TPC-H benchmarks on Spark.

There exist several performance prediction techniques [9, 21, 24, 37] that rely on machine learning models (e.g., [2, 10, 18, 48, 54]) and cost models (e.g., [6, 20, 26, 34, 43, 44, 50]). The machine-learning approaches apply black-box methods, requiring correct identification of both features and algorithms that affect the performance and characteristics of jobs respectively. In contrast, the cost-based analytical techniques apply white box methods. They require deep insight of the jobs, data, and frameworks. Irrespective of the techniques, most of them require historical logs from previous executions or hold the assumption that the jobs repeatedly executed. However, the training cost for machine learning approaches may be very expensive for the new jobs. In this paper, we proposed a simulation-based cost prediction model, which is more detail and requires low training cost to achieve reliable prediction compared to other cost models (e.g., [48, 50]).

Job performance metrics enables to evaluate the recommendation models. Execution time is a commonly used metric and it can be reasonably used to evaluate the performance of jobs in a cluster with fixed amount of resources. It is indeed a meaningful metric in realistically scheduling [28, 29]. However, assigning a huge amount of resources while gaining negligible wins is not cost-effective. It has been shown that simply applying excessive resource yields a poor benefit or even degraded performance due to overheads [18, 47]. Therefore, we introduce resource time (in Sect. 3) as one of our metrics, representing both the resource utilization and resource cost [25]. It is a helpful indicator for renting pay-as-you-go services and can assist in making cost-effective decisions on vcore and memory configuration.

Job performance tuning also plays an essential role in optimizing the performance of big data analytical platforms. For Hadoop, such tuning parameters include the number of mappers and reducers, the number of splits, etc, [23, 43]. Anastasios et al. [17] evaluated the effect of the number of partitions in spark. Ahsan et al. [3] stated that the JVM garbage collection (GC) overhead affects the job performance with large input data size. In this paper, we pay more attention to the executor parameters on vcore and memory to minimize the resource cost or to optimize the resource allocation [7]. Such cost-based parameters are valuable for those users who run big data analytics on either a cloud with a pay-as-you-go pricing model like Amazon or a cluster in multi-tenant manner [51].

7 Discussion, conclusion, and future work

We informally label our workloads’ characteristics based on our observations and experiments in Sect. 2. If the audiences cannot find the same workload, they may consider the defined characteristics and observe the most similar representative workload. Such characteristics may help to classify unknown workloads and can be formally described. For example, we can define a workload as a CPU-intensive job if the computation time occupies more than 50% of the execution time. A workload is described as a massive shuffle job if it spends more than 50% of the execution time on shuffling. The ratio of computing time, read/write time, or shuffle time are reliable indicators for users to set up or configure cluster resources beforehand. Such characteristics may assist scheduler to assign bottleneck resources accordingly.

This work predicts and recommends the total amount of the vcore and memory resource. We maximize the executor size, i.e., number of vcores and amount of memory, in our evaluation configuration because fewer executors in one node often yield a better performance, as the data exchanges reduces among the executors [47]. Therefore, such configuration is suitable for computation and memory intensive jobs. However, such configuration can not be always guaranteed better performance in a production environment. During evaluation, we have found that large number of smaller executors boost the performance of I/O intensive jobs, as the scheduler by default distributes the executors into more machines (disks).

In summary, our proposed SimCost utilizes simulation techniques to accurately predict job performance while keeping training costs low. Specifically, we employ Monte Carlo simulation to leverage a small amount of data and resources for reliable predictions on larger datasets and clusters. Notably, our method achieves high accuracy despite the low training costs. Empirical experiments with 12 benchmark workloads demonstrate that our cost model yields less than 5% error on average prediction accuracy and can result in up to 6x resource cost savings through its recommendations.

Our framework is effective in general cases, but the prediction accuracy varies slightly in different workloads and scenarios. In our future work, we would like to explore more workloads, e.g., deep learning algorithms. We also will focus on modeling the competition of unconfigurable resources (e.g., disk and network) in the case of multiple parallel jobs.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

previous article Multi-model query languages: taming the variety of big data

next article Balanced parallel triangle enumeration with an adaptive algorithm

Heuristic data based on our experiments.

Apache Spark REST API. https://spark.apache.org/docs/latest/monitoring.html

Aken, D.V., Pavlo, A., Gordon, G.J., Zhang, B.: Automatic database management system tuning through large-scale machine learning. In: SIGMOD Conference, pp. 1009–1024. ACM (2017)

Awan, A.J., Brorsson, M.: Vlassov, V., Ayguadé, E.: Architectural impact on performance of in-memory data analytics: apache spark case study. CoRR https://arxiv.org/1604.08484 (2016)

Bao, L., Liu, X., Chen, W.: Learning-based automatic parameter tuning for big data analytics frameworks. In: BigData, pp. 181–190. IEEE (2018)

Binder, K.: Monte Carlo Simulations in Statistical Physics, Encyclopedia of Complexity and Systems Science, pp. 5667–5677. Springer, New York (2009)

Bruno, N., Jain, S., Zhou, J.: Continuous cloud-scale query optimization and processing. Proc. VLDB Endow. 6(11), 961–972 (2013)CrossRef

Chaisiri, S., Lee, B., Niyato, D.: Optimization of resource provisioning cost in cloud computing. IEEE Trans. Serv. Comput. 5(2), 164–177 (2012)CrossRef

Chen, K., Powers, J., Guo, S., Tian, F.: CRESP: towards optimal resource provisioning for mapreduce computing in public clouds. IEEE Trans. Parallel Distrib. Syst. 25(6), 1403–1412 (2014)CrossRef

Chen, Y.: Performance tuning and query optimization for big data management (2021)

10.

Chen, Y., Goetsch, P., Hoque, M.A., Lu, J., Tarkoma, S.: d-simplexed: adaptive delaunay triangulation for performance modeling and prediction on big data analytics. IEEE Trans. Big Data (2019)

11.

Chen, Y., Lu, J., Chen, C., Hoque, M., Tarkoma, S.: Cost-effective resource provisioning for spark workloads. In: CIKM, pp. 2477–2480. ACM (2019)

12.

Cheng, D., Zhou, X., Xu, Y., Liu, L., Jiang, C.: Deadline-aware mapreduce job scheduling with dynamic resource availability. IEEE Trans. Parallel Distrib. Syst. 30(4), 814–826 (2019)CrossRef

13.

Chiba, T., Onodera, T.: Workload characterization and optimization of TPC-H queries on apache spark. In: ISPASS, pp. 112–121. IEEE Computer Society (2016)

15.

Control Groups (CGroups). https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt

14.

Genkin, M., Dehne, F., Pospelova, M., Chen, Y., Navarro, P.: Automatic, on-line tuning of YARN container memory and CPU parameters. In: HPCC/SmartCity/DSS, pp. 317–324. IEEE Computer Society (2016)

16.

Ghodsi, A., Zaharia, M., Hindman, B., Konwinski, A., Shenker, S., Stoica, I.: Dominant resource fairness: fair allocation of multiple resource types. In: NSDI, USENIX Association (2011)

17.

Gounaris, A., Kougka, G., Tous, R., Montes, C.T., Torres, J.: Dynamic configuration of partitioning in spark applications. IEEE Trans. Parallel Distrib. Syst. 28(7), 1891–1904 (2017)CrossRef

18.

Gunther, N.J., Puglia, P., Tomasette, K.: Hadoop superlinear scalability. ACM Queue 13(5), 20 (2015)CrossRef

19.

Hernández, Á.B., Perez, M.S., Gupta, S., Muntés-Mulero, V.: Using machine learning to optimize parallelism in big data applications. Future Gener. Comput. Syst. 86, 1076–1092 (2018)CrossRef

20.

Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of mapreduce programs. PVLDB 4(11), 1111–1122 (2011)

21.

Herodotou, H., Chen, Y., Lu, J.: A survey on automatic parameter tuning for big data processing systems. ACM Comput. Surv. 53(2), 43:1-43:37 (2020)

22.

Herodotou, H., Dong, F., Babu, S.: No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics. In: SoCC, p. 18. ACM (2011)

23.

Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F.B., Babu, S.: Starfish: a self-tuning system for big data analytics. In: CIDR, pp. 261–272. www.cidrdb.org (2011)

24.

Herodotou, H., Odysseos, L., Chen, Y., Lu, J.: Automatic performance tuning for distributed data stream processing systems. In: ICDE, pp. 3194–3197. IEEE (2022)

25.

Huang, B., Babu, S., Yang, J.: Cumulon: optimizing statistical data analysis in the cloud. In: SIGMOD Conference, pp. 1–12. ACM (2013)

26.

Huang, B., Boehm M., Tian, Y., Reinwald, B., Tatikonda, S., Reiss, F.R.: Resource elasticity for large-scale machine learning. In: SIGMOD Conference, pp. 137–152. ACM (2015)

27.

Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In: ICDE Workshops, pp. 41–51. IEEE Computer Society (2010)

28.

Huang, Z., Balasubramanian, B., Wang, M., Lan, T., Chiang, M., Tsang, D.H.K.: Need for speed: CORA scheduler for optimizing completion-times in the cloud. In: INFOCOM, pp. 891–899. IEEE (2015)

29.

Huang, Z., Weinberg, S.M., Zheng, L., Joe-Wong, C., Chiang, M.: Discovering valuations and enforcing truthfulness in a deadline-aware scheduler. In: INFOCOM, pp. 1–9. IEEE (2017)

30.

Hurst, S.: The characteristic function of the student t distribution. Research report: statistics research report/Centre for mathematics and its applications (Canberra) (1995)

31.

Jia, Z., Xue, C., Chen, G., Zhan, J., Zhang, L., Lin, Y., Hofstee, P.: Auto-tuning spark big data workloads on POWER8: prediction-based dynamic SMT threading. In: PACT, pp. 387–400. ACM (2016)

32.

Ketchen, D.J., Shook, C.L.: The application of cluster analysis in strategic management research: an analysis and critique. Strateg. Manag. J. 17(6), 441–458 (1996)CrossRef

33.

Krishna, R., Tang, C., Sullivan, K.J., Ray, B.: Conex: efficient exploration of big-data system configurations for better performance. CoRR https://arxiv.org/1910.09644 (2019)

34.

Li, B., Mazur, E., Diao, Y., McGregor, A., Shenoy, P.J.: A platform for scalable one-pass analytics using mapreduce. In: SIGMOD Conference, pp. 985–996. ACM (2011)

35.

Li, M., Zeng, L., Meng, S., Tan, J., Zhang, L., Butt, A.R., Fuller, N.C.: MRONLINE: mapreduce online performance tuning. In: HPDC, pp. 165–176. ACM (2014)

36.

Li, Y.L., Dong, J.: Study and improvement of mapreduce based on hadoop. Comput. Eng. Des. 33(8), 3110–3116 (2012)

37.

Lu, J., Chen, Y., Herodotou, H., Babu, S.: Speedup your analytics: automatic parameter tuning for databases and big data systems. Proc. VLDB Endow. 12(12), 1970–1973 (2019)CrossRef

38.

Nair, V., Menzies, T., Siegmund, N., Apel, S.: Using bad learners to find good configurations. In: ESEC/SIGSOFT FSE, pp. 257–267. ACM (2017)

39.

Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., Chun, B.: Making sense of performance in data analytics frameworks. In: NSDI, pp. 293–307. USENIX Association (2015)

40.

Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Golden section search in one dimension. Numerical Recipes in C: The Art of Scientific Computing (1992)

41.

Royall, R.M.: On finite population sampling theory under certain linear regression models. Biometrika 57(2), 377–387 (1970)CrossRef

42.

Shi, J., Qiu, Y., Minhas, U.F., Jiao, L., Wang, C., Reinwald, B., Özcan, F.: Clash of the titans: mapreduce vs. spark for large scale data analytics. PVLDB 8(13), 2110–2121 (2015)

43.

Shi, J., Zou, J., Lu, J., Cao, Z., Li, S., Wang, C.: Mrtuner: a toolkit to enable holistic optimization for mapreduce jobs. PVLDB 7(13), 1319–1330 (2014)

44.

Singhal, R., Singh, P.: Performance assurance model for applications on SPARK platform. In: TPCTC, vol. 10661 of Lecture Notes in Computer Science, pp. 131–146. Springer (2017)

45.

Soror, A.A., Minhas, U.F., Aboulnaga, A., Salem, K., Kokosielis, P., Kamath, S.: Automatic virtual machine configuration for database workloads. In: SIGMOD Conference, pp. 953–966. ACM (2008)

46.

Tan, J., Zhang, T., Li, F., Chen, J., Zheng, Q., Zhang, P., Qiao, H., Shi, Y., Cao, W., Zhang, R.: ibtune: individualized buffer tuning for large-scale cloud databases. PVLDB 12(10), 1221–1234 (2019)

47.

Tous, R., Gounaris A., Tripiana C., Torres J., Girona S., Ayguadé, E., Labarta, J., Becerra, Y., Carrera, D., Valero, M.: Spark deployment and performance evaluation on the marenostrum supercomputer. In: Big Data, pp. 299–306. IEEE (2015)

48.

Venkataraman, S., Yang Z., Franklin M.J., Recht, B., Stoica, I.: Ernest: Efficient performance prediction for large-scale advanced analytics. In: NSDI, pp. 363–378. USENIX Association (2016)

49.

Wang, G., Xu J., He, B.: A novel method for tuning configuration parameters of spark based on machine learning. In: 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 586–593. IEEE (2016)

50.

Wang, K., Khan, M.M.H.: Performance prediction for apache spark platform. In: HPCC/CSS/ICESS, pp. 166–173. IEEE (2015)

51.

Wirtz, T., Ge, R.: Improving mapreduce energy efficiency for computation intensive workloads. In: IGCC, pp. 1–8. IEEE Computer Society (2011)

52.

Wu, D., Gokhale, A.S.: A self-tuning system based on application profiling and performance analysis for optimizing hadoop mapreduce cluster configuration. In: HiPC, pp. 89–98. IEEE Computer Society (2013)

53.

Ye, T., Kalyanaraman, S.: A recursive random search algorithm for large-scale network parameter configuration. In: SIGMETRICS, pp. 196–205. ACM (2003)

54.

Yigitbasi, N., Willke, T.L., Liao, G., Epema D.H.J.: Towards machine learning-based auto-tuning of mapreduce. In: MASCOTS, pp. 11–20. IEEE Computer Society (2013)

55.

Yu, Z., Bei, Z., Qian, X.: Datasize-aware high dimensional configurations auto-tuning of in-memory cluster computing. In: Proc. of the 23rd Intl. Conf. on Architectural Support for Programming Languages and Operating Systems ASPLOS, pp. 564–577. ACM (2018)

56.

Zhang, J., Liu, Y., Zhou, K., Li, G., Xiao, Z., Cheng, B., Xing, J., Wang, Y., Cheng, T., Liu, L., Ran, M., Li Z.: An end-to-end automatic cloud database tuning system using deep reinforcement learning. In: SIGMOD Conference, pp. 415–432. ACM (2019)

57.

Zhu, Y., Liu, J., Guo, M., Bao, Y., Ma, W., Liu, Z., Song, K., Yang, Y.: BestConfig: tapping the performance potential of systems via automatic configuration tuning. In: Proc. of the 8th ACM Symp. on Cloud Computing (SoCC), pp. 338–350. ACM (2017)

Title: SimCost: cost-effective resource provision prediction and recommendation for spark workloads
Authors: Yuxing Chen
Mohammad A. Hoque
Pengfei Xu
Jiaheng Lu
Sasu Tarkoma
Publication date: 22-06-2023
Publisher: Springer US
Published in: Distributed and Parallel Databases / Issue 1/2024
Print ISSN: 0926-8782
Electronic ISSN: 1573-7578
DOI: https://doi.org/10.1007/s10619-023-07436-y

Workload	Memory int-	CPU intensive	One pass	Heavy shuffle
Wordcount		\(\checkmark\)	\(\checkmark\)
Sort			\(\checkmark\)	\(\checkmark\)
TeraSort			\(\checkmark\)	\(\checkmark\)
K-means	\(\checkmark\)	\(\checkmark\)
Bayesian	\(\checkmark\)	\(\checkmark\)
SVM	\(\checkmark\)	\(\checkmark\)
LR		\(\checkmark\)
LDA	\(\checkmark\)	\(\checkmark\)
PCA	\(\checkmark\)	\(\checkmark\)
Join			\(\checkmark\)
Aggregation			\(\checkmark\)
Scan			\(\checkmark\)

Springer Professional

SimCost: cost-effective resource provision prediction and recommendation for spark workloads

Abstract

Publisher's Note

1 Introduction

2 Preliminaries

3 Problem statement

4 SimCost

4.1 Simulation

4.2 Prediction

4.2.1 Memory assurance

4.2.2 Performance prediction

4.3 Parameter recommendation

5 Evaluation

5.1 Experimental setup

5.2 Prediction

5.3 Parameter recommendation

5.4 Evaluation summary

7 Discussion, conclusion, and future work

Publisher's Note

Premium Partner

Springer Professional

Abstract

Publisher's Note

1 Introduction

2 Preliminaries

3 Problem statement

4 SimCost

4.1 Simulation

4.2 Prediction

4.2.1 Memory assurance

4.2.2 Performance prediction

4.3 Parameter recommendation

5 Evaluation

5.1 Experimental setup

5.2 Prediction

5.3 Parameter recommendation

5.4 Evaluation summary

6 Related work

7 Discussion, conclusion, and future work

Publisher's Note

Premium Partner