Introduction
Objectives and contributions
Document organization
Concepts and definitions
Singh and Chana
Jennings and Stadler
Manvi and Shyam
-
Provisioning: Assignment of resources to a workload.
-
Allocation: Distribution of resources among competing workloads.
-
Adaptation: Ability to dynamically adjust resources to fulfill workload requirements.
-
Mapping: Correspondence between resources required by the workload and resources provided by the cloud infrastructure.
-
Modeling: Framework that helps to predict the resource requirements of a workload by representing the most important attributes of resource management, such as states, transitions, inputs, and outputs within a given environment.
-
Estimation: Guess of the actual resources required for executing a workload.
-
Discovery: Identification of a list of resources that are available for workload execution.
-
Brokering: Negotiation of resources through an agent to ensure their availability at the right time to execute the workload.
-
Scheduling: A timetable of events and resources, determining when a workload should start or end depending on its duration, predecessor activities, predecessor relationships, and resources allocated.
Other definitions
Intercorrelation and consolidation
Work | Summary | Actors | QoS | SLA |
---|---|---|---|---|
[108] | Three phases: provisioning, scheduling, monitoring | Cloud provider (infra. and workload mgmt.) and cloud consumer (end user) | Yes | Yes |
[52] | Organized in three tiers (one per role) and a total of 15 different stages, including scheduling, provisioning, pricing, profiling, and monitoring. | Cloud provider (infra. mgmt.), cloud user (broker), end user (execute workload) | Yes | Yes |
[72] | Nine components: provisioning, allocation, adaptation, mapping, modeling, estimation, discovery, brokering, and scheduling | No specific actors are identified; implicit assumption of at least two roles: cloud provider and cloud user | Yes | No |
[80] | Two tasks: procurement and release of resources. Two objectives: performance isolation and efficient use of hardware. Seven metrics: energy, SLA, load, network load, profit, hybrid clouds, and mobile clouds. | No specific actors are identified; implicit assumption of at least two roles: cloud provider and cloud user | No | Yes |
[75] | Three aspects affected: performance, functionality, and cost. Five policy classes: admission control, capacity allocation, load balancing, energy optimization, and QoS. | Cloud provider and cloud user. | Yes | No |
[125] | Predicting workload to enable workload consolidation while meeting SLAs. | No specific actors | Yes | Yes |
[69] | Two main activities: matching and scheduling. | No specific actors | Yes | Yes |
Work | Explicit phases or steps |
---|---|
[108] | Provisioning, Scheduling, Monitoring |
[52] | Profiling, Pricing, Provisioning, Estimation, Scheduling, Monitoring |
[72] | Provisioning, Allocation, Adaptation, Mapping, Modeling, Estimation, Discovery, Brokering, Scheduling |
-
Enable task execution; and
-
Optimize infrastructural efficiency based on a set of specified objectives.
Resource management taxonomy
Relevant work
-
Best-effort: Optimize one objective while ignoring other factors such as QoS requirements.
-
Deadline-constrained: Scheduling based on the trade-off between execution time and monetary cost under a deadline constraint.
-
Budget-constrained: The objective is to finish a workflow as fast as possible at given budget.
-
Multi-criteria: Several objectives are taken into account.
-
Workflow-as-a-service: Multiple workflow instances submitted to the resource manager.
-
Robust scheduling: Able to absorb uncertainties such as performance fluctuation and failure.
-
Hybrid environment: Able to address requirements of hybrid clouds.
-
Data-intensive: Data-aware workflow scheduling.
-
Energy-aware: Able to save energy while optimizing execution.
-
Cost-based: Organized in multi-QoS, virtualization-based, application-based, and scalability-based.
-
Time-based: Organized in deadline-based and combination of deadline and budget.
-
Compromised Cost-Time: Based either on workflows or workloads.
-
Bargaining-based: Organized in market-oriented, auction, and negotiation.
-
QoS-based: Based on several QoS aspects, including security and resource utilization.
-
SLA-based: Based on several SLA types, including workload and autonomic aspects.
-
Energy-based: Combined with deadlines and SLAs.
-
Optimization-based: Optimization of several combinations of parameters.
-
Nature Inspired and Bio-Inspired: Including genetic algorithms and ant colony approaches.
-
Dynamic: Several combinations of aspects with dynamic management.
-
Rule-based: Special cases for failures and hybrid clouds.
-
Adaptive-based: Prediction-based and Bin-Packing strategies.
Proposed taxonomy
-
Makespan/Time: encompasses all aspects related to run time and time-based optimization.
-
Deadline: encompasses aspects also related to time but associated to predefined limits to finish a workflow – the central idea is not to finish the execution of a workflow as fast as possible, but simply to address a specific deadline and possibly save resources (i.e., reduce resource allocation) as long as the deadline is met.
-
Cost/Budget: encompasses all aspects related to financial cost and benefits, such as cost minimization and budget limitation.
-
Data-Intensive: works that effectively encompass one or more aspects inherent to data-intensive workflows.
-
Dynamic: works that employ some form of dynamic mechanism to continuously adjust the scheduling decision. This is a typical method to address issues related to unpredictability, such as performance fluctuation.
-
Reliability: works that encompass some form of reliability-related aspect, such as selecting nodes in a way to minimize the chances of failure, or providing mechanisms to circumvent failures.
-
Security: works that consider any aspect of security (in the sense of confidentiality).
-
Energy: energy-aware scheduling mechanisms.
-
Hybrid/Multicloud: works that address requirements of hybrid clouds and multicloud scenarios.
-
Workload/Workflow: works that address requirements for scheduling workflows on clouds.
Survey
-
Fully addressed: The work provides a solution that focuses on addressing the specific aspect, with clear mechanisms to cover it and potentially with experiments showing the effectiveness. For instance, [15] explicitly defines mechanisms to address the requirements of hybrid clouds.Table 4Summary of identified related work classified using the consolidated taxonomyWorkHighlightsMKDLCTDTDYRLSCENHMWL[105]Dynamic level scheduling (DLS)x..ox....x[126]Wide-area scheduling with dynamic load balancing...xx...o.[57]Dynamic Critical Path (DCP)o...x....x[99]Integration to conventional schedulers.........o.[2]ELISA, decentralized dynamic algorithm....x...o.[51]Hierarchical schedulingo........o[27]Federation of resource traders........o.[117]HEFT (Heterogeneous Earliest Finish Time)x.........[113]Redundantly distribute job to multiple sites to increase backfilling........o.[30]Performance and reliability optimizationx.x.xx...x[16]Reduce maximum job waiting time in the queuex.......o.[3]Community of peers for brokering....o...o.[49]Fault-tolerant scheduling.....x...x[79]Dynamic, deadline, energy.x..x..x.x[97]Rescheduling policiesx.o.x....x[58]Auction-based scheduling.........o.[138]Deadline partitioning.xo......x[121]Dynamic voltage scaling.xo.o..x.x[137]Genetic algorithm to optimize cost with deadline constraint.xx......x[149]Merge multiple DAGsx........x[104]Makespan and robustnessx....x...x[102]Load balancing on arrival....o...o.[98]LOSS and GAIN approachesx.x......x[43]Performance and reliability optimizationx....x...x[31]Reliable HEFTx....x...x[139]Minimize execution time and costxxx......x[71]Dynamic scheduling....xo...x[90]Dynamic storage mgmt.o..x.....x[55]Energy and deadline.x..o..x.x[145]Forecast prototype and SLA compensation..x.......[146]Historical information, forecasting..x.......[47]Delegated matchmaking, local vs remote usage.o..o...o.[29]Improve average response timeo.......o.[142]Float time amortization.xo......x[142]Based on HEFTx........x[83]Bandwidth speedup, data-intensiveo..x.....x[89]Makespan and energyx......x.x[133]MQMW (Multiple QoS scheduling of Multi-Workflows)x.x.x....x[84]RASA (Resource-Aware Scheduling Algorithmx.........[59]Decentralized model that improves makespanx.......o.[35]Fuzzy approach for decentralized gridso.......o.[93]Backfilling strategy based on dynamic informationx...x...o.[23]Ant Colony Optimizationxxx..x...x[140]Path-based deadline partition.x.......x[141]Greedy time-cost distribution..x......x[61]Optimize makespan and resource utilizationx...x....x[114]Similar to YU et al., 2007xxx......x[13]Data stagingxo.x.....x[96]QoS-aware, cost and execution timex.x.......[153]Based on genetic algorithm; increase resource utilization....x.....[100]Cost-based..x.......[67]Time-cost-based, instance-intensive workflowsxxx......x[82]Particle swarm optimization heuristic;...xx....x[94]Brokering for multiple grids.........o.[122]Bidding system for resource selectiono.........[128]PSO to minimize cost with deadline constraintoxx......x[39]Optimize makespan and costx.x......x[88]Dynamic programmingx.x..x...x[25]Dynamic scheduling....xo...x[11]Energy efficiency....o..x..[131]Reputation-based QoS provisioningoox.......[74]Deadline, budget, auto-scaling.xx.o.....[64]SHEFT (Scalable HEFT)x..xx....x[119]OWS (Optimal Workflow Scheduling);x........x[132]Justice-based schedulingx...o...o.[151]Budget-constrained HEFT..x.x....x[62]CCSH to minimize makespan and costx.x......x[19]Deadline optimization based on delaying.x..x....x[73]Multiple DAGs; deadline-basedoxo.x....x[15]Hybrid clouds; iteratively resch. tasks until mksp.; deadlinexxo.x...xx[60]Makespan and energyx......x.x[78]Makespan and energyx...o..x.x[116]MapReduce on public clouds.xx.......[50]Multi-tier applicationso.o.......[143]Auction-based, cloud-provider viewpoint..........[40]Heterogeneous workloads.x........[54]SLA management, improve resource utilization.oo......o[107]Multi-cloud, cost optimization..x.....x.[37]Multi-objective, cost constraints..x.....x.[144]Backtracking and continuous cost evaluationo.x.x....x[33]Multi-objective schedulingx.xo.x.x.x[12]Pareto-based; execution time and costx.x......x[118]Combination of DAG merging techniquesx........x[70]Auto-scaling of resourcesoxxo.....x[124]Fault-tolerant scheduling.xx.ox...x[120]Deadline-driven, scientific applications, hybrid clouds.x......xo[148]Energy-aware, scheduling delay.o..o..x..[20]Aneka platform; QoS-driven, hybrid.x..o...x.[48]Cost minimization, deadline.xx.......[26]Negotiation/bargaining.xx.......[129]Market orientedx.x......x[46]Community-aware decentralized dynamic schedulingo...o...o.[1]Partial Critical Path (PCP)oxo......x[65]Minimize end-to-end delayo.x......x[152]Monte Carlo approachx.o.x....x[103]Power aware schedulingx...o..x.x[134]Particle swarm optimizationx.x.o..x.x[130]Data-intensive, energy-awarex..xo..x.x[42]Rule-based.oo.....x.[38]Energy, deadline.xo....x..[41]Bag of tasks, time and costx.x.......[106]SLA-based cost model; powero.x....o..[136]Cost management..o.......[5]Predict Earliest Finish Time (PEFT)x..o.....x[14]Cat Swarm Optimization..xx.....x[95]PSO considering performance variation and VM boot time.xx.x....x[4]Aggregation-based budget distribution..x......x[87]Critical-path heuristicxxx..x...x[86]Spot instances.xx.xx...x[10]Fault-tolerance..x.ox....[56]Behavioral-based estimation..o.x.....[154]Multiple workflows, optimize time and costxxx......x[68]Multi-cloud, enhanced workflow modeloxxx....xx
-
Partially addressed: The work provides mechanisms that could be used to address the specific aspect, even if not explicitly mentioned in the work. For instance, [42] does not directly address deadline and cost aspects, but the solution proposed could be used to cover them with slight operational modifications.
-
Not addressed: The work does not address the aspect.
Further analysis
Work | Data transfers and imbalance | Dynamic scheduling | Hybrid and Multicloud | Workflow support |
---|---|---|---|---|
PANDEY et al., 2010 | Transfers are evaluated via workflow DAG and resource allocation; transfer imbalance is not addressed. | Only addresses fluctuations in the transfer costs. Other aspects such as performance fluctuations and reliability are not mentioned. | No explicit support or experiments. | Modeled as DAGs; richer characterizations are not supported. |
LIN; LU, 2011 | Transfer capacity of nodes in the same network are assumed to be uniform. Transfer imbalance is discarded. | Not addressed. | No explicit support or experiments | Supported; no details included. |
XU et al., 2009 | Transfers and data properties are not explicitly addressed. | Not addressed. | No explicit support or experiments. | Multiple workflows supported via common merging point; simple DAG modeling. |
WEISSMAN; GRIMSHAW, 1996 | Data locality is a scheduling constraint; worker must be assigned closer to data. | Two levels: local and global. Rescheduling is first handled on local level. Details are not provided. | Design for wide-area systems (pre-dates cloud computing). | No explicit support. |
CHEN; ZHANG, 2009 | Data communication and transfers are not explicitly addressed. | Not addressed. | No explicit support or experiments. | Simplified DAG model without edge costs. |
RODRIGUEZ; BUYYA, 2014 | Rigidly modeled; fixed costs for transfers and no cost for local I/O. | Not addressed. | Not addressed. | DAG with fixed transfer costs and computation costs based on FLOPS. |
FARD et al., 2012 | Transfers are considered but contention effects are not. Energy calculations ignore transfer times. | Not addressed. | No explicit support or experiments. | DAG with fixed transfer costs; not details on task costs. |
MALAWSKI et al., 2012 | Algorithm does not consider the size of input data; transfer time is part of computation. | Initial scheduling plus periodic adjusting depending on amount of idle resources. | No explicit support or experiments. | DAG with fixed transfer costs and computation costs with slight variability. |
SAKELLARIOU; ZHAO, 2004 | Linear variation to amount of input data size. | Immediately before execution of tasks and bound to a condition to minimize number of reschedules. | Not addressed; solution originally designed for grids. | DAG with computation and transfer costs modeled with linear variation w.r.t. amount of input. |
WANG; CHEN, 2012 | Not addressed. DAG does not specify transfer costs. | Not addressed. | No explicit support. | DAG with tasks and implicit costs. No transfer costs and no more complex characterization. |
POOLA et al., 2014a | Based on data size and one value for network bandwidth. | Not addressed. | No explicit support. | DAG with task cost based on number of instructions. |
BITTENCOURT; MADEIRA, 2011 | Based on data size and fixed network bandwidth values among nodes. | Two-step scheduling: static, then including public cloud to address deadline. | Initial scheduling step considers private resources; public resources are used if necessary. | DAG with compute cost based on number of instructions. |
VECCHIOLA et al., 2012 | Not specified. | Not addressed. | Public resources used if necessary. | Supported, but no details provided. |
Gaps and challenges
Data-intensive loads
Hybrid and multicloud scenarios
Rescheduling and performance fluctuations
Reliability
Conclusion
-
Lack of mechanisms to address the particularities of data-intensive workflows, especially considering that future trends point to the direction of I/O workflows with intensive data movement and with reliability-related mechanisms highly dependent on I/O as well.
-
Lack of mechanisms to address the particularities of large-scale cloud setups with more complex environments in terms of resource heterogeneity and distribution, such as hybrid and multicloud scenarios, which are expected to be the main drivers for large-scale utilization of cloud – scientific workflows being one important instance.
-
Lack of mechanisms to address the fluctuations in workflow progress due to performance variation and reliability, both phenomena that can be partially or even fully addressed by implementing controlled rescheduling policies.
-
Lack of reliability mechanisms based on actual and measurable metrics that can be derived from documentation and from collecting information of the system.