review-article

Banking on decoupling: budget-driven sustainability for HPC applications on auction-based clouds

Author:
Moussa Taifi

Temple University, Philadelphia, PA

Temple University, Philadelphia, PA
View Profile

Authors Info & Claims

ACM SIGOPS Operating Systems Review Volume 47 Issue 2July 2013pp 41–50https://doi.org/10.1145/2506164.2506172

Published:23 July 2013Publication History

ACM SIGOPS Operating Systems Review

Abstract

Cloud providers are auctioning their excess capacity using dynamically priced virtual instances. These spot instances provide significant savings compared to on-demand or fixed price instances. The users willing to use these resources are asked to provide a maximum bid price per hour, and the cloud provider runs the instances as long as the market price is below the user's bid price. By using such resources, the users are exposed explicitly to failures, and need to adapt their applications to provide some level of fault tolerance. In this paper, we expose the effect of bidding in the case of virtual HPC clusters composed of spot instances. We describe the interesting effect of uniform versus non-uniform bidding in terms of both the failure rate and the failure model. We propose an initial attempt to deal with the problem of predicting the runtime of a parallel application under various bidding strategies and various system parameters. We describe the relationship between bidding strategies and programming models, and we build a preliminary optimization model that uses real price traces from Amazon Web Services as inputs, as well as instrumented values related to the processing and network capacities of cluster instances on the EC2 services. Our results show preliminary insights into the relationship between non-uniform bidding and application scaling strategies.

References

A.M. Agbaria and R. Friedman. Starfish: fault-tolerant dynamic MPI programs on clusters of workstations. In High Performance Distributed Computing, 1999. Proceedings. The Eighth International Symposium on, pages 167--176, 1999. Google ScholarDigital Library
O. Agmon Ben-Yehuda, M. Ben-Yehuda, A. Schuster, and D. Tsafrir. Deconstructing Amazon EC2 spot instance pricing. Technion Israel Institute of Technology, Tech. Rep. CS-2011-09, 2011.Google Scholar
O. Agmon Ben-Yehuda, M. Ben-Yehuda, A. Schuster, and D. Tsafrir. The resource-as-a-service (RaaS) cloud. In USENIX Conference on Hot Topics in Cloud Computing (HotCloud). USENIX, 2012. Google ScholarDigital Library
A. Andrzejak, D. Kondo, and Sangho Yi. Decision model for cloud computing under sla constraints. In Proc. IEEE Int Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS) Symp, pages 257--266, 2010. Google ScholarDigital Library
D. Borthakur. The Hadoop distributed file system: Architecture and design. Hadoop Project Website, 2007.Google Scholar
N. Chohan, C. Castillo, M. Spreitzer, M. Steinder, A. Tantawi, and C. Krintz. See spot run: using spot instances for MapReduce workflows. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, pages 7--7. USENIX Association, 2010. Google ScholarDigital Library
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008. Google ScholarDigital Library
J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.H. Bae, J. Qiu, and G. Fox. Twister: a runtime for iterative MapReduce. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pages 810--818. ACM, 2010. Google ScholarDigital Library
G. Fagg and J. Dongarra. FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 346--353, 2000. Google ScholarCross Ref
R.L. Graham, S.E. Choi, D.J. Daniel, N.N. Desai, R.G. Minnich, C.E. Rasmussen, L.D. Risinger, and M.W. Sukalski. A network-failure-tolerant message-passing system for terascale clusters. International Journal of Parallel Programming, 31(4):285--303, 2003. Google ScholarDigital Library
J. Hursey, J. M. Squyres, T. I. Mattox, and A. Lumsdaine. The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI. In Proc. IEEE Int. Parallel and Distributed Processing Symp. IPDPS 2007, pages 1--8, 2007.Google ScholarCross Ref
B. Javadi and R. Buyya. Comprehensive statistical analysis and modeling of spot instances in public cloud environments. Technical report, Technical Report CLOUDS-TR-2011-1, The University of Melbourne, 2011.Google Scholar
M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny. Checkpoint and migration of UNIX processes in the Condor distributed processing system. Technical report, Technical Report, 1997.Google Scholar
M. Mazzucco and M. Dumas. Achieving Performance and Availability Guarantees with Spot Instances. In High Performance Computing and Communications (HPCC), 2011 IEEE 13th International Conference on, pages 296--303. IEEE, 2011. Google ScholarDigital Library
A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System. In Proc. Int High Performance Computing, Networking, Storage and Analysis (SC) Conf. for, pages 1--11, 2010. Google ScholarDigital Library
S. Rao, L. Alvisi, and H.M. Vin. Egida: An extensible toolkit for low-overhead fault-tolerance. In Fault-Tolerant Computing, 1999. Digest of Papers. Twenty-Ninth Annual International Symposium on, pages 48--55. IEEE, 1999. Google ScholarDigital Library
Justin Y. Shi. Decoupling as a Foundation for Large Scale Parallel Processing. In Proceedings of 2009 High Performance Computing and Communications, Seoul, Korea, 2009. Google ScholarDigital Library
J.Y. Shi, M. Taifi, and A. Khreishah. Resource planning for parallel processing in the cloud. In High Performance Computing and Communications (HPCC), 2011 IEEE 13th International Conference on, pages 828--833. IEEE, 2011. Google ScholarDigital Library
J.Y. Shi, M. Taifi, and A. Khreishah. Program Scalability Analysis for HPC Cloud: Applying Amdahl Law to NAS Benchmarks. In Supercomputing Conference (SC) Workshops, 2012 1st Sustainable HPC Cloud Computing (SHPCLOUD12), 2012. Google ScholarDigital Library
Y. Shi. Program scalability analysis. In International Conference on Distributed and Parallel Processing, Geogetown University, Washington D.C., October 1997.Google Scholar
Georg Stellner. CoCheck: Checkpointing and Process Migration for MPI. In Proceedings of the 10th International Parallel Processing Symposium, IPPS'96, pages 526--531, Washington, DC, USA, 1996. IEEE Computer Society. Google ScholarDigital Library
M. Taifi, J. Shi, and A. Khreishah. SpotMPI: A Framework for Auction-Based HPC Computing Using Amazon Spot Instances. Algorithms and Architectures for Parallel Processing, pages 109--120, 2011. Google ScholarDigital Library
W. Voorsluys and R. Buyya. Reliable provisioning of spot instances for compute-intensive applications. Arxiv preprint arXiv:1110.5969, 2011.Google Scholar
W. Voorsluys, S. Garg, and R. Buyya. Provisioning spot market cloud resources to create cost-effective virtual clusters. Algorithms and Architectures for Parallel Processing, pages 395--408, 2011. Google ScholarDigital Library
S. Yi, D. Kondo, and A. Andrzejak. Reducing costs of Spot Instances via checkpointing in the Amazon elastic compute cloud. In 2010 IEEE 3rd International Conference on Cloud Computing, pages 236--243. IEEE, 2010. Google ScholarDigital Library
G. Zheng, L. Shi, and L.V. Kalé. FTC-CHARM++: An in-memory checkpoint-based fault tolerant runtime for CHARM++ and mpi. In Cluster Computing, 2004 IEEE International Conference on, pages 93--103. IEEE, 2004. Google ScholarDigital Library

Index Terms

Banking on decoupling: budget-driven sustainability for HPC applications on auction-based clouds

Recommendations

Banking on Decoupling: Budget-Driven Sustainability for HPC Applications on EC2 Spot Instances
SRDS '12: Proceedings of the 2012 IEEE 31st Symposium on Reliable Distributed Systems

Cloud providers are auctioning their excess capacity using dynamically priced virtual instances. These spot instances provide significant savings compared to on-demand or fixed price instances. The users willing to use these resources are asked to ...
Read More
How Small and Medium Enterprises SMEs Should Bid for Spot Instances of Amazon's EC2 Cloud

In cloud service provisioning, spot instances are spare slots for which it has no pre-booking, unlike reserved or on-demand instances for which a cloud service provider CSP has a priori booking. CSPs like Amazon prefer spot instance approach to sell ...
Read More
Deconstructing Amazon EC2 Spot Instance Pricing

Cloud providers possessing large quantities of spare capacity must either incentivize clients to purchase it or suffer losses. Amazon is the first cloud provider to address this challenge, by allowing clients to bid on spare capacity and by granting ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGOPS Operating Systems Review Volume 47, Issue 2
July 2013
69 pages
ISSN:0163-5980
DOI:10.1145/2506164
Issue’s Table of Contents

Copyright © 2013 Author
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 July 2013
Check for updates
Author Tags
auction-based cloud computing
cloud virtual clusters
cloud-based fault tolerance
cost-aware optimization models
decoupling parallel programming models
spot instances
Qualifiers
- review-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 160
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Banking on decoupling: budget-driven sustainability for HPC applications on auction-based clouds

ACM SIGOPS Operating Systems Review

Abstract

References

Cited By

Index Terms

Recommendations

Banking on Decoupling: Budget-Driven Sustainability for HPC Applications on EC2 Spot Instances

How Small and Medium Enterprises SMEs Should Bid for Spot Instances of Amazon's EC2 Cloud

Deconstructing Amazon EC2 Spot Instance Pricing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Banking on decoupling: budget-driven sustainability for HPC applications on auction-based clouds

ACM SIGOPS Operating Systems Review

Abstract

References

Cited By

Index Terms

Recommendations

Banking on Decoupling: Budget-Driven Sustainability for HPC Applications on EC2 Spot Instances

How Small and Medium Enterprises SMEs Should Bid for Spot Instances of Amazon's EC2 Cloud

Deconstructing Amazon EC2 Spot Instance Pricing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media