research-article

Public Access

Optimizing the Cost of Executing Mixed Interactive and Batch Workloads on Transient VMs

Authors:
Pradeep Ambati

University of Massachusetts, Amherst, Amherst, MA, USA

University of Massachusetts, Amherst, Amherst, MA, USA
View Profile

,
David Irwin

University of Massachusetts, Amherst, Amherst, MA, USA

University of Massachusetts, Amherst, Amherst, MA, USA
View Profile

Proceedings of the ACM on Measurement and Analysis of Computing Systems Volume 3 Issue 2Article No.: 28pp 1–24https://doi.org/10.1145/3341617.3326143

Published:19 June 2019Publication History

Proceedings of the ACM on Measurement and Analysis of Computing Systems

Abstract

Container Orchestration Platforms (COPs), such as Kubernetes, are increasingly used to manage large-scale clusters by automating resource allocation between applications encapsulated in containers. Increasingly, the resources underlying COPs are virtual machines (VMs) dynamically acquired from cloud platforms. COPs may choose from many different types of VMs offered by cloud platforms, which differ in their cost, performance, and availability. In particular, while transient VMs cost significantly less than on-demand VMs, platforms may revoke them at any time, causing them to become unavailable. While transient VMs' price is attractive, their unreliability is a problem for COPs designed to support mixed workloads composed of, not only delay-tolerant batch jobs, but also long-lived interactive services with high availability requirements. To address the problem, we design TR-Kubernetes, a COP that optimizes the cost of executing mixed interactive and batch workloads on cloud platforms using transient VMs. To do so, TR-Kubernetes enforces arbitrary availability requirements specified by interactive services despite transient VM unavailability by acquiring many more transient VMs than necessary most of the time, which it then leverages to opportunistically execute batch jobs when excess resources are available. When cloud platforms revoke transient VMs, TR-Kubernetes relies on existing Kubernetes functions to internally revoke resources from batch jobs to maintain interactive services' availability requirements. We show that TR-Kubernetes requires minimal extensions to Kubernetes, and is capable of lowering the cost (by 53%) and improving the availability (99.999%) of a representative interactive/batch workload on Amazon EC2 when using transient compared to on-demand VMs.

References

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. 2016. TensorFlow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 265--283.Google Scholar
C. Babcock. 2015. Amazon's 'Virtual CPU'? You Figure it Out, In Information Week.Google Scholar
B. Burns, B. Grant, D. Oppenheimer, E. Brewer, and J. Wilkes. 2016. Borg, Omega, and Kubernetes. ACM Queue - Containers, Vol. 14, 1 (January-February 2016). Google ScholarDigital Library
B. Cully, G. Lefebvre, D. Meyer, M. Feeley, N. Hutchinson, and A. Warfield. 2008. Remus: High Availability via Asynchronous Virtual Machine Replication. In NSDI . Google ScholarDigital Library
B. Ghit and D. Epema. 2017. Better Safe than Sorry: Grappling with Failures of In-Memory Data Analytics Frameworks. In HPDC.Google Scholar
A. Harlap, A. Tumanov, A. Chung, G. Ganger, and P. Gibbons. 2017. Proteus: Agile ML Elasticity through Tiered Reliability in Dynamic Resource Markets. In European Conference on Computer Systems (EuroSys).Google Scholar
B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. Joseph, R. Katz, S. Shenker, and I. Stoica. 2011. Mesos: A Platform for Fine-grained Resource Sharing in the Data Center. In NSDI. Google ScholarDigital Library
B. Huang, N. Jarrett, S. Babu, S. Mukherjee, and J. Yang. 2015. Cumulon: Matrix-Based Data Analytics in the Cloud with Spot Instances. Proceedings of the VLDB Endowment (PVLDB), Vol. 9, 3 (November 2015). Google ScholarDigital Library
D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. 2013. Naiad: A Timely Dataflow System. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP '13). ACM, New York, NY, USA, 439--455. Google ScholarDigital Library
Charles Reiss, John Wilkes, and Joseph L. Hellerstein. 2011. Google cluster-usage traces: format+schema. Technical Report. Google Inc., Mountain View, CA, USA. Revised 2014--11--17 for version 2.1. Posted at https://github.com/google/cluster-data.Google Scholar
P. Sharma, T. Guo, X. He, D. Irwin, and P. Shenoy. 2016. Flint: Batch-Interactive Data-Intensive Processing on Transient Servers. In European Conference on Computer Systems (EuroSys). Google ScholarDigital Library
P. Sharma, D. Irwin, and P. Shenoy. 2017. Portfolio-driven Resource Management for Transient Cloud Servers. In International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS).Google Scholar
P. Sharma, S. Lee, T. Guo, D. Irwin, and P. Shenoy. 2015. SpotCheck: Designing a Derivative IaaS Cloud on the Spot Market. In European Conference on Computer Systems (EuroSys).Google Scholar
R. Singh, D. Irwin, P. Shenoy, and K.K. Ramakrishnan. 2013. Yank: Enabling Green Data Centers to Pull the Plug. In Symposium on Networked Systems Design and Implementation (NSDI). Google ScholarDigital Library
S. Subramanya, T. Guo, P. Sharma, D. Irwin, and P. Shenoy. 2015. SpotOn: A Batch Computing Service for the Spot Market. In Symposium on Cloud Computing (SoCC).Google Scholar
A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes. 2015. Large-scale Cluster Management at Google with Borg. In European Conference on Computer Systems (EuroSys). Google ScholarDigital Library
John Wilkes. 2011. More Google cluster data. Google research blog. Posted at http://googleresearch.blogspot.com/2011/11/more-google-cluster-data.html.Google Scholar
Z. Xu, C. Stewart, N. Deng, and X. Wang. 2016. Blending On-Demand and Spot Instances to Lower Costs for In-Memory Storage. In International Conference on Computer Communications (Infocom).Google Scholar
Y. Yan, Y. Gao, Z. Guo, B. Chen, and T. Moscibroda. 2016. TR-Spark: Transient Computing for Big Data Analytics. In Symposium on Cloud Computing (SoCC). Google ScholarDigital Library
Y. Yang, G. Kim, W. Song, Y. Lee, A. Chung, Z. Qian, B. Cho, and B. Chun. 2017. Pado: A Data Processing Engine for Harnessing Transient Resources in Datacenters. In European Conference on Computer Systems (EuroSys). Google ScholarDigital Library
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In NSDI. Google ScholarDigital Library
L. Zheng, C. Joe-Wong, C. Tan, M. Chiang, and X. Wang. 2015. How to Bid the Cloud. In ACM SIGCOMM Conference (SIGCOMM). Google ScholarDigital Library

Index Terms

Optimizing the Cost of Executing Mixed Interactive and Batch Workloads on Transient VMs
1. Software and its engineering
  1. Software organization and properties
    1. Software system structures
      1. Distributed systems organizing principles
        Cloud computing

Recommendations

Optimizing the Cost of Executing Mixed Interactive and Batch Workloads on Transient VMs
SIGMETRICS '19: Abstracts of the 2019 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems

Container Orchestration Platforms (COPs), such as Kubernetes, are increasingly used to manage large-scale clusters by automating resource allocation between applications encapsulated in containers. Increasingly, the resources underlying COPs are virtual ...
Read More
Optimizing the Cost of Executing Mixed Interactive and Batch Workloads on Transient VMs

Container Orchestration Platforms (COPs), such as Kubernetes, are increasingly used to manage large-scale clusters by automating resource allocation between applications encapsulated in containers. Increasingly, the resources underlying COPs are virtual ...
Read More
Performance Analysis of Network I/O Workloads in Virtualized Data Centers

Server consolidation and application consolidation through virtualization are key performance optimizations in cloud-based service delivery industry. In this paper, we argue that it is important for both cloud consumers and cloud providers to understand ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the ACM on Measurement and Analysis of Computing Systems Volume 3, Issue 2
June 2019
683 pages
EISSN:2476-1249
DOI:10.1145/3341617
Editors:
Augustin Chaintreau
Columbia University
,
Thomas Bonald
Telecom ParisTech
,
Nick Duffield
Texas A&M University
Issue’s Table of Contents
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 June 2019
Published in pomacs Volume 3, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cloud computing
resource allocation
run time management and scheduling
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 501
  Total Downloads
- Downloads (Last 12 months)97
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Optimizing the Cost of Executing Mixed Interactive and Batch Workloads on Transient VMs

Proceedings of the ACM on Measurement and Analysis of Computing Systems

Abstract

References

Cited By

Index Terms

Recommendations

Optimizing the Cost of Executing Mixed Interactive and Batch Workloads on Transient VMs

Optimizing the Cost of Executing Mixed Interactive and Batch Workloads on Transient VMs

Performance Analysis of Network I/O Workloads in Virtualized Data Centers

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Optimizing the Cost of Executing Mixed Interactive and Batch Workloads on Transient VMs

Proceedings of the ACM on Measurement and Analysis of Computing Systems

Abstract

References

Cited By

Index Terms

Recommendations

Optimizing the Cost of Executing Mixed Interactive and Batch Workloads on Transient VMs

Optimizing the Cost of Executing Mixed Interactive and Batch Workloads on Transient VMs

Performance Analysis of Network I/O Workloads in Virtualized Data Centers

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media