skip to main content
research-article
Public Access

Optimizing the Cost of Executing Mixed Interactive and Batch Workloads on Transient VMs

Authors Info & Claims
Published:19 June 2019Publication History
Skip Abstract Section

Abstract

Container Orchestration Platforms (COPs), such as Kubernetes, are increasingly used to manage large-scale clusters by automating resource allocation between applications encapsulated in containers. Increasingly, the resources underlying COPs are virtual machines (VMs) dynamically acquired from cloud platforms. COPs may choose from many different types of VMs offered by cloud platforms, which differ in their cost, performance, and availability. In particular, while transient VMs cost significantly less than on-demand VMs, platforms may revoke them at any time, causing them to become unavailable. While transient VMs' price is attractive, their unreliability is a problem for COPs designed to support mixed workloads composed of, not only delay-tolerant batch jobs, but also long-lived interactive services with high availability requirements. To address the problem, we design TR-Kubernetes, a COP that optimizes the cost of executing mixed interactive and batch workloads on cloud platforms using transient VMs. To do so, TR-Kubernetes enforces arbitrary availability requirements specified by interactive services despite transient VM unavailability by acquiring many more transient VMs than necessary most of the time, which it then leverages to opportunistically execute batch jobs when excess resources are available. When cloud platforms revoke transient VMs, TR-Kubernetes relies on existing Kubernetes functions to internally revoke resources from batch jobs to maintain interactive services' availability requirements. We show that TR-Kubernetes requires minimal extensions to Kubernetes, and is capable of lowering the cost (by 53%) and improving the availability (99.999%) of a representative interactive/batch workload on Amazon EC2 when using transient compared to on-demand VMs.

References

  1. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. 2016. TensorFlow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 265--283.Google ScholarGoogle Scholar
  2. C. Babcock. 2015. Amazon's 'Virtual CPU'? You Figure it Out, In Information Week.Google ScholarGoogle Scholar
  3. B. Burns, B. Grant, D. Oppenheimer, E. Brewer, and J. Wilkes. 2016. Borg, Omega, and Kubernetes. ACM Queue - Containers, Vol. 14, 1 (January-February 2016). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. B. Cully, G. Lefebvre, D. Meyer, M. Feeley, N. Hutchinson, and A. Warfield. 2008. Remus: High Availability via Asynchronous Virtual Machine Replication. In NSDI . Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. Ghit and D. Epema. 2017. Better Safe than Sorry: Grappling with Failures of In-Memory Data Analytics Frameworks. In HPDC.Google ScholarGoogle Scholar
  6. A. Harlap, A. Tumanov, A. Chung, G. Ganger, and P. Gibbons. 2017. Proteus: Agile ML Elasticity through Tiered Reliability in Dynamic Resource Markets. In European Conference on Computer Systems (EuroSys).Google ScholarGoogle Scholar
  7. B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. Joseph, R. Katz, S. Shenker, and I. Stoica. 2011. Mesos: A Platform for Fine-grained Resource Sharing in the Data Center. In NSDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. B. Huang, N. Jarrett, S. Babu, S. Mukherjee, and J. Yang. 2015. Cumulon: Matrix-Based Data Analytics in the Cloud with Spot Instances. Proceedings of the VLDB Endowment (PVLDB), Vol. 9, 3 (November 2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. 2013. Naiad: A Timely Dataflow System. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP '13). ACM, New York, NY, USA, 439--455. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Charles Reiss, John Wilkes, and Joseph L. Hellerstein. 2011. Google cluster-usage traces: format+schema. Technical Report. Google Inc., Mountain View, CA, USA. Revised 2014--11--17 for version 2.1. Posted at https://github.com/google/cluster-data.Google ScholarGoogle Scholar
  11. P. Sharma, T. Guo, X. He, D. Irwin, and P. Shenoy. 2016. Flint: Batch-Interactive Data-Intensive Processing on Transient Servers. In European Conference on Computer Systems (EuroSys). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. Sharma, D. Irwin, and P. Shenoy. 2017. Portfolio-driven Resource Management for Transient Cloud Servers. In International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS).Google ScholarGoogle Scholar
  13. P. Sharma, S. Lee, T. Guo, D. Irwin, and P. Shenoy. 2015. SpotCheck: Designing a Derivative IaaS Cloud on the Spot Market. In European Conference on Computer Systems (EuroSys).Google ScholarGoogle Scholar
  14. R. Singh, D. Irwin, P. Shenoy, and K.K. Ramakrishnan. 2013. Yank: Enabling Green Data Centers to Pull the Plug. In Symposium on Networked Systems Design and Implementation (NSDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Subramanya, T. Guo, P. Sharma, D. Irwin, and P. Shenoy. 2015. SpotOn: A Batch Computing Service for the Spot Market. In Symposium on Cloud Computing (SoCC).Google ScholarGoogle Scholar
  16. A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes. 2015. Large-scale Cluster Management at Google with Borg. In European Conference on Computer Systems (EuroSys). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. John Wilkes. 2011. More Google cluster data. Google research blog. Posted at http://googleresearch.blogspot.com/2011/11/more-google-cluster-data.html.Google ScholarGoogle Scholar
  18. Z. Xu, C. Stewart, N. Deng, and X. Wang. 2016. Blending On-Demand and Spot Instances to Lower Costs for In-Memory Storage. In International Conference on Computer Communications (Infocom).Google ScholarGoogle Scholar
  19. Y. Yan, Y. Gao, Z. Guo, B. Chen, and T. Moscibroda. 2016. TR-Spark: Transient Computing for Big Data Analytics. In Symposium on Cloud Computing (SoCC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Y. Yang, G. Kim, W. Song, Y. Lee, A. Chung, Z. Qian, B. Cho, and B. Chun. 2017. Pado: A Data Processing Engine for Harnessing Transient Resources in Datacenters. In European Conference on Computer Systems (EuroSys). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In NSDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. L. Zheng, C. Joe-Wong, C. Tan, M. Chiang, and X. Wang. 2015. How to Bid the Cloud. In ACM SIGCOMM Conference (SIGCOMM). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Optimizing the Cost of Executing Mixed Interactive and Batch Workloads on Transient VMs

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Proceedings of the ACM on Measurement and Analysis of Computing Systems
      Proceedings of the ACM on Measurement and Analysis of Computing Systems  Volume 3, Issue 2
      June 2019
      683 pages
      EISSN:2476-1249
      DOI:10.1145/3341617
      Issue’s Table of Contents

      Copyright © 2019 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 19 June 2019
      Published in pomacs Volume 3, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader