skip to main content
review-article

Banking on decoupling: budget-driven sustainability for HPC applications on auction-based clouds

Published:23 July 2013Publication History
Skip Abstract Section

Abstract

Cloud providers are auctioning their excess capacity using dynamically priced virtual instances. These spot instances provide significant savings compared to on-demand or fixed price instances. The users willing to use these resources are asked to provide a maximum bid price per hour, and the cloud provider runs the instances as long as the market price is below the user's bid price. By using such resources, the users are exposed explicitly to failures, and need to adapt their applications to provide some level of fault tolerance. In this paper, we expose the effect of bidding in the case of virtual HPC clusters composed of spot instances. We describe the interesting effect of uniform versus non-uniform bidding in terms of both the failure rate and the failure model. We propose an initial attempt to deal with the problem of predicting the runtime of a parallel application under various bidding strategies and various system parameters. We describe the relationship between bidding strategies and programming models, and we build a preliminary optimization model that uses real price traces from Amazon Web Services as inputs, as well as instrumented values related to the processing and network capacities of cluster instances on the EC2 services. Our results show preliminary insights into the relationship between non-uniform bidding and application scaling strategies.

References

  1. A.M. Agbaria and R. Friedman. Starfish: fault-tolerant dynamic MPI programs on clusters of workstations. In High Performance Distributed Computing, 1999. Proceedings. The Eighth International Symposium on, pages 167--176, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. O. Agmon Ben-Yehuda, M. Ben-Yehuda, A. Schuster, and D. Tsafrir. Deconstructing Amazon EC2 spot instance pricing. Technion Israel Institute of Technology, Tech. Rep. CS-2011-09, 2011.Google ScholarGoogle Scholar
  3. O. Agmon Ben-Yehuda, M. Ben-Yehuda, A. Schuster, and D. Tsafrir. The resource-as-a-service (RaaS) cloud. In USENIX Conference on Hot Topics in Cloud Computing (HotCloud). USENIX, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Andrzejak, D. Kondo, and Sangho Yi. Decision model for cloud computing under sla constraints. In Proc. IEEE Int Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS) Symp, pages 257--266, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Borthakur. The Hadoop distributed file system: Architecture and design. Hadoop Project Website, 2007.Google ScholarGoogle Scholar
  6. N. Chohan, C. Castillo, M. Spreitzer, M. Steinder, A. Tantawi, and C. Krintz. See spot run: using spot instances for MapReduce workflows. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, pages 7--7. USENIX Association, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.H. Bae, J. Qiu, and G. Fox. Twister: a runtime for iterative MapReduce. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pages 810--818. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Fagg and J. Dongarra. FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 346--353, 2000. Google ScholarGoogle ScholarCross RefCross Ref
  10. R.L. Graham, S.E. Choi, D.J. Daniel, N.N. Desai, R.G. Minnich, C.E. Rasmussen, L.D. Risinger, and M.W. Sukalski. A network-failure-tolerant message-passing system for terascale clusters. International Journal of Parallel Programming, 31(4):285--303, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Hursey, J. M. Squyres, T. I. Mattox, and A. Lumsdaine. The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI. In Proc. IEEE Int. Parallel and Distributed Processing Symp. IPDPS 2007, pages 1--8, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  12. B. Javadi and R. Buyya. Comprehensive statistical analysis and modeling of spot instances in public cloud environments. Technical report, Technical Report CLOUDS-TR-2011-1, The University of Melbourne, 2011.Google ScholarGoogle Scholar
  13. M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny. Checkpoint and migration of UNIX processes in the Condor distributed processing system. Technical report, Technical Report, 1997.Google ScholarGoogle Scholar
  14. M. Mazzucco and M. Dumas. Achieving Performance and Availability Guarantees with Spot Instances. In High Performance Computing and Communications (HPCC), 2011 IEEE 13th International Conference on, pages 296--303. IEEE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System. In Proc. Int High Performance Computing, Networking, Storage and Analysis (SC) Conf. for, pages 1--11, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Rao, L. Alvisi, and H.M. Vin. Egida: An extensible toolkit for low-overhead fault-tolerance. In Fault-Tolerant Computing, 1999. Digest of Papers. Twenty-Ninth Annual International Symposium on, pages 48--55. IEEE, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Justin Y. Shi. Decoupling as a Foundation for Large Scale Parallel Processing. In Proceedings of 2009 High Performance Computing and Communications, Seoul, Korea, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J.Y. Shi, M. Taifi, and A. Khreishah. Resource planning for parallel processing in the cloud. In High Performance Computing and Communications (HPCC), 2011 IEEE 13th International Conference on, pages 828--833. IEEE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J.Y. Shi, M. Taifi, and A. Khreishah. Program Scalability Analysis for HPC Cloud: Applying Amdahl Law to NAS Benchmarks. In Supercomputing Conference (SC) Workshops, 2012 1st Sustainable HPC Cloud Computing (SHPCLOUD12), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Y. Shi. Program scalability analysis. In International Conference on Distributed and Parallel Processing, Geogetown University, Washington D.C., October 1997.Google ScholarGoogle Scholar
  21. Georg Stellner. CoCheck: Checkpointing and Process Migration for MPI. In Proceedings of the 10th International Parallel Processing Symposium, IPPS'96, pages 526--531, Washington, DC, USA, 1996. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Taifi, J. Shi, and A. Khreishah. SpotMPI: A Framework for Auction-Based HPC Computing Using Amazon Spot Instances. Algorithms and Architectures for Parallel Processing, pages 109--120, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. W. Voorsluys and R. Buyya. Reliable provisioning of spot instances for compute-intensive applications. Arxiv preprint arXiv:1110.5969, 2011.Google ScholarGoogle Scholar
  24. W. Voorsluys, S. Garg, and R. Buyya. Provisioning spot market cloud resources to create cost-effective virtual clusters. Algorithms and Architectures for Parallel Processing, pages 395--408, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Yi, D. Kondo, and A. Andrzejak. Reducing costs of Spot Instances via checkpointing in the Amazon elastic compute cloud. In 2010 IEEE 3rd International Conference on Cloud Computing, pages 236--243. IEEE, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. G. Zheng, L. Shi, and L.V. Kalé. FTC-CHARM++: An in-memory checkpoint-based fault tolerant runtime for CHARM++ and mpi. In Cluster Computing, 2004 IEEE International Conference on, pages 93--103. IEEE, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Banking on decoupling: budget-driven sustainability for HPC applications on auction-based clouds

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader