Modeling and performance analysis of large scale IaaS Clouds

https://doi.org/10.1016/j.future.2012.06.005Get rights and content

Abstract

For Cloud based services to support enterprise class production workloads, Mainframe like predictable performance is essential. However, the scale, complexity, and inherent resource sharing across workloads make the Cloud management for predictable performance difficult. As a first step towards designing Cloud based systems that achieve such performance and realize the service level objectives, we develop a scalable stochastic analytic model for performance quantification of Infrastructure-as-a-Service (IaaS) Cloud. Specifically, we model a class of IaaS Clouds that offer tiered services by configuring physical machines into three pools with different provisioning delay and power consumption characteristics. Performance behaviors in such IaaS Clouds are affected by a large set of parameters, e.g., workload, system characteristics and management policies. Thus, traditional analytic models for such systems tend to be intractable. To overcome this difficulty, we propose a multi-level interacting stochastic sub-models approach where the overall model solution is obtained iteratively over individual sub-model solutions. By comparing with a single-level monolithic model, we show that our approach is scalable, tractable, and yet retains high fidelity. Since the dependencies among the sub-models are resolved via fixed-point iteration, we prove the existence of a solution. Results from our analysis show the impact of workload and system characteristics on two performance measures: mean response delay and job rejection probability.

Highlights

► We propose an interacting sub-models approach to model a Cloud with multiple pools. ► Dependencies among the analytic sub-models are resolved via fixed-point iteration. ► Closed-form solutions of sub-models are shown whenever feasible. ► Our approach is shown to be scalable and accurate compared to a monolithic model. ► We discuss how developed models can be used for what-if analysis in a Cloud.

Introduction

An Infrastructure-as-a-Service (IaaS) Cloud, such as Amazon EC2 [1], IBM SmartCloud Enterprise, IBM SmartCloud Enterprise +[2], [3], and IBM Smart Business Desktop Cloud [4], delivers on-demand operating system (OS) instances provisioning computational resources in the form of virtual machines deployed in the Cloud provider’s data center. Such Cloud based services are gaining popularity leading to increasing business competitions and hence, performance and dependability guarantees are becoming critical. Providers of IaaS Clouds (e.g., IBM and Amazon) offer service level agreements (SLA) to Cloud users. Violations of such SLAs can cause loss of revenue and business reputation. We observe that, for most of the IaaS Cloud providers, offered SLAs are in terms of guaranteed availability. As technology and business models for Cloud services are getting mature, in future, users will also expect SLAs on Cloud performance besides availability. However, performance of an IaaS Cloud depends on a large number of factors including: (i) nature of physical infrastructure (e.g., CPU, RAM, disk characteristics), (ii) nature of virtual infrastructure (e.g., hypervisor characteristics), (iii) nature of management and automation tools (e.g., request provisioning steps), (iv) workload (e.g., arrival rate, request types) and (v) available capacity (e.g., number of physical machines). Hence, systematic performance assessment of Cloud infrastructure is difficult and non-trivial.

This paper presents a scalable analytic approach for model driven performance analysis of an IaaS Cloud. Traditional measurement based performance evaluation requires extensive experimentation with each workload and system configuration and may not be feasible in terms of cost due to the sheer scale of Cloud. A simulation model can be developed and solved but in contrast with an analytic model, it might be time consuming as the generation of statistically significant results may require many simulation runs [5]. A stochastic model is a more attractive alternative because of lower relative cost of solving the model while covering large parameter space. However, such stochastic analytic models are presumed not to scale well when dealing with the rising complexity associated with Cloud service architectures. Simplifying the model to make it more tractable can result in lower fidelity and, in the process, the effects of important parameters affecting the service performance metrics may not be captured [6]. To overcome this difficulty, in our recent research [7], we described a scalable approach using interacting stochastic sub-models where an overall solution is composed by iteration over individual sub-model solutions. We quantified effects of changes in workload (e.g., job arrival rate, job service rate) and system capacity (e.g., number of physical machines, number of virtual machines per physical machine) on service quality as measured by mean response delay and job rejection probability. In this paper, we extend our previous research in several directions:

(1) A monolithic Cloud performance model is constructed using our variant of stochastic Petri net (SPN) called stochastic reward net (SRN) [8]. We compare the scalability and accuracy of the interacting stochastic sub-models approach proposed in [7], w.r.t. the monolithic model. Our analysis shows that the monolithic model becomes intractable and fails to produce results as the scale of Cloud increases, while interacting sub-models approach quickly provides model solutions without significantly compromising the accuracy. Thus, we provide an analytic verification and validation of the proposed approach.

(2) Closed-form solutions of sub-models are shown whenever feasible. Such closed-form expressions can complement the use of analytic modeling software packages such as SHARPE [9] and SPNP [10], when dealing with large number of model states.

(3) Since dependencies among the sub-models are resolved via fixed-point iteration, in this paper, we prove the existence of a solution for the associated fixed-point equation.

(4) Numerical results are expanded and discussions are presented on how the proposed model can be extended to include different Cloud management aspects. Our developed model can be applied in capacity planning, forecasting, sensitivity analysis to find bottlenecks, what-if analysis or in an overall design optimization problem during design, development, testing and operational phases of an IaaS Cloud.

The rest of the paper is organized as follows. In Section 2, we present a system description, assumptions, and formulate the problems of interest. Section 3 describes interacting sub-models approach for performance analysis of IaaS Cloud. An equivalent monolithic model is described in Section 4. Numerical results are presented in Section 5. Generalizations of the interacting sub-models approach and future avenues of research are outlined in Section 6. Related research is highlighted in Section 7. Finally, we conclude this work in Section 8. Appendix A summarizes the symbols used in the paper and detailed steps of interacting sub-model closed form solutions are shown in Appendix B.

Section snippets

System description, assumptions, and problem formulation

In an IaaS Cloud, when a request is processed, a pre-built image is used to deploy one or more Virtual Machine (VM) instances or a pre-deployed VM may be customized and made available to the requester. VMs are deployed on Physical Machines (PMs) each of which may be shared by multiple VMs. The deployed VMs are provisioned with request specific CPU, RAM, and disk capacity. This process of provisioning and deploying VMs involves delays which may be reduced by various optimization techniques. One

Proposed interacting stochastic sub-models approach

Our main motivation behind using an interacting sub-models approach is the following. A global monolithic model to capture all the details of a Cloud service tends to be complex. As we will show in Section 4, even by using methods for the automated generation of models such as SRNs, such models become intractable and may not scale to large sized Clouds. Hence, we resort to an interacting sub-models approach to reduce the complexity of analysis without significantly affecting the accuracy. A

Monolithic model for IaaS cloud

We use SRN to construct a monolithic model for IaaS Cloud. SRNs [26] are extensions of GSPNs [27]. Key features of SRNs are: (1) each transition may have a guard function; the transition is enabled in a marking only if its guard function evaluates to true; (2) marking dependent arc multiplicities are allowed; (3) marking dependent firing rates are allowed; (4) transitions can be assigned different priorities; (5) besides traditional output measures obtained from a GSPN, such as throughput of a

Numerical results and discussions

We exercise the interacting sub-models described earlier to compute two QoS metrics: (1) job rejection probability and (2) mean response delay. Using SHARPE [9], we solve the models and show the effects of changing job arrival rates, mean job service times, and system capacity (number of PMs in each pool, number of VMs on each PM) on QoS measures.

Values of key parameters. We assume a wide range of values for our model parameters so that the model can represent a large variety of IaaS Cloud

Model generalizations and future research

Model generalizations. We discuss three system management aspects w.r.t. which the proposed scalable stochastic models can be generalized.

(1) Dealing with larger number of pools. In this paper, we described performance analysis of a specific type of IaaS Cloud with three pools of PMs. The PMs were grouped into pools to maintain a balance between power consumption and performance behavior. In general, to offer a differentiated service, Cloud providers might create a larger number of PM pools. We

Related research

Ostermann et al. [34] carried out a measurement-based performance evaluation of the Amazon EC2 [1] in the context of scientific computing. Through experimentations, they showed that performance of Amazon EC2 is insufficient for scientific computing. In [35], Deelman et al. presented a similar study of performance and cost of executing the montage workflow on Clouds. Yigitbasi et al. [36] proposed an experimental framework to analyze the resource acquisition and resource release times in Amazon

Conclusions

In this paper, we quantify the effects of variations in workload (e.g., job arrival rate, job service rate) and system capacity (PMs per pool, VMs per PM) on the performance of a class of IaaS Clouds where PMs are organized in a set of pools for management and power consumption costs reduction. Using interacting stochastic sub-models, we propose a fast method suitable for analyzing the service quality of large sized IaaS Clouds. We discuss how our approach can reduce model complexity both in

Rahul Ghosh is a Ph.D. candidate in Electrical and Computer Engineering at Duke University, USA. He received his M.S. in Computer Engineering from Duke University in 2009. Prior to this, he received his B.E. in Electronics and Telecommunication from Jadavpur University, India, in 2007. His research interests include stochastic processes, queuing systems, Markov chains, performance and dependability analysis of large scale computer systems. Rahul’s Ph.D. Thesis research is focused on developing

References (54)

  • G. Ciardo et al.

    A decomposition approach for stochastic reward net models

    Performance Evaluation

    (1993)
  • H. Choi et al.

    Markov Regenerative Stochastic Petri Nets

    Performance Evaluation

    (1994)
  • Amazon EC2, 2006....
  • IBM SmartCloud Enterprise, 2011....
  • IBM cloud computing from Wikipedia: 2012....
  • IBM Smart Business Desktop Cloud, 2012....
  • R. Buyya, R. Ranjan, R.N. Calheiros, Modeling and simulation of scalable cloud computing environments and the cloudsim...
  • N. Sato, K.S. Trivedi, Stochastic modeling of composite web services for closed-form analysis of their performance and...
  • R. Ghosh, K.S. Trivedi, V.K. Naik, D.S. Kim, Performability analysis for infrastructure-as-a-service cloud: an...
  • G. Ciardo, J. Muppala, K.S. Trivedi, SPNP: stochastic Petri net package, in: Third Int. Workshop on Petri Nets and...
  • K.S. Trivedi et al.

    SHARPE at the age of twenty two

    ACM Sigmetrics Performance Evaluation Review

    (2009)
  • C. Hirel et al.

    SPNP: stochastic petri nets. Version 6

    Lecture Notes in Computer Science

    (2000)
  • W. Jansen, T. Grance, Guidelines on security and privacy in public cloud computing, Technical Report, NIST Special...
  • D. Faatz, L. Pizette, Information Security in the Clouds, MITRE Corporation Technical Report Case # 10-3208,...
  • A. Celesti, F. Tusa, M. Villari, A. Puliafito, How to enhance cloud architectures to enable cross-federation, in: IEEE...
  • A. Ganapathi, Y. Chen, A. Fox, R. Katz, D. Patterson, Statistics-driven workload modeling for the cloud, in: SMDB,...
  • P. Garbacki, V. Naik, Efficient resource virtualization and sharing strategies for heterogeneous grid environments, in:...
  • K.S. Trivedi

    Probability and Statistics with Reliability, Queuing and Computer Science Applications

    (2001)
  • D. Wang, R.M. Fricks, K.S. Trivedi, Dealing with non-exponential distributions in dependability models, in: G. Kotsis...
  • C.H. Sauer et al.

    Computer Systems Performance Modeling

    (1981)
  • N. Sato, K.S. Trivedi, Accurate and efficient stochastic reliability analysis of composite services using their compact...
  • P.J. Burke

    Output of a queuing system

    Operations Research

    (1956)
  • Virtual Computing Lab at North Carolina State University,...
  • L. Tomek et al.

    Fixed-Point Iteration in Availability Modeling

  • V. Mainkar et al.

    Sufficient conditions for existence of a fixed point in stochastic reward net-based iterative models

    IEEE Transactions on Software Engineering

    (1996)
  • J.M. Ortega et al.

    Iterative Solution of Nonlinear Equations in Several Variables

    (1970)
  • G. Ciardo

    Automated generation and analysis of Markov reward models using stochastic reward nets

  • Cited by (0)

    Rahul Ghosh is a Ph.D. candidate in Electrical and Computer Engineering at Duke University, USA. He received his M.S. in Computer Engineering from Duke University in 2009. Prior to this, he received his B.E. in Electronics and Telecommunication from Jadavpur University, India, in 2007. His research interests include stochastic processes, queuing systems, Markov chains, performance and dependability analysis of large scale computer systems. Rahul’s Ph.D. Thesis research is focused on developing scalable analytic models for Infrastructure-as-a-Service Cloud. During his Ph.D., he worked as a research intern at IBM T.J. Watson Research Center. At IBM Research, his work was focused on cost optimization, capacity planning and risk analysis of Cloud systems.

    Francesco Longo was born in Messina on November 16, 1982. He received his Degree in Computer Engineering from the University of Messina (Italy) in November 2007, final score 110/110 summa cum laude. The title of his thesis was “Symbolic representation of the reachability graph of non-Markovian stochastic Petri net”. From September 2007 to June 2008 he worked at the University of Messina within the PON Project “Progetto per l’Implementazione e lo Sviluppo di una e-Infrastruttura in Sicilia basata sul paradigma della grid (PI2S2)” with the aim of designing and implementing a QoS management system in Grid environment. In June 2008 he received a Master’s degree in Open Source and Computer Security. He received his Ph.D. in “Advanced Technologies for Information Engineering” at the University of Messina in April 2011. Between May 2010 and October 2010 he spent a period as a visiting scholar in the United States at the Duke University (Durham, NC) where he had the opportunity to collaborate with Prof. Kishor S. Trivedi in the modeling of Cloud systems and in the quantitative evaluation of their performance and availability. Since 2010 he is a teaching assistant for the subject “Valutazione delle prestazioni” (Performance evaluation) at the Faculty of Engineering, University of Messina. He is now a post doc researcher within the Vision Cloud European project at the University of Messina. The project has the aim of building a new storage cloud infrastructure and framework. His main research interests include performance and reliability evaluation of distributed systems (mainly Grid and Cloud) with main attention to the use of non-Markovian stochastic Petri nets.

    Vijay K. Naik is a Research Staff Member at IBM T. J. Watson Research Center. He has been an active researcher in the area of distributed & fault tolerant computing and service computing. Over the past few years, he has been providing leadership in developing innovative technologies for cloud computing that are now incorporated in IBM provided solutions. Currently, he is leading research teams to advance the frontiers of hybrid cloud computing, workload migration to cloud systems, transformation of enterprise IT to cloud computing, performance modeling and analysis of cloud based services & systems, and quantifying the economics of cloud computing. He received his Ph.D. degree in computer science from Duke University in 1988. Previously he has worked at Google and ICASE, NASA Langley Research Center.

    Kishor S. Trivedi holds the Hudson Chair in the Department of Electrical and Computer Engineering at Duke University. He has been on the Duke faculty since 1975. He is the author of a well-known text entitled, Probability and Statistics with Reliability, Queuing and Computer Science Applications, published by John Wiley. He has also published two other books entitled, Performance and Reliability Analysis of Computer Systems, published by Kluwer Academic Publishers and Queueing Networks and Markov Chains, John Wiley. He is a Fellow of the Institute of Electrical and Electronics Engineers. He has published over 470 articles and has supervised 43 Ph.D. dissertations. He is the recipient of IEEE Computer Society Technical Achievement Award for his research on Software Aging and Rejuvenation. He works closely with industry in carrying our reliability/availability analysis, providing short courses and in the development and dissemination of software packages such as SHARPE and SPNP.

    This research was supported in part by IBM Research and the US National Science Foundation under grant NSF-CNS-08-31325.

    View full text