skip to main content
10.1145/3190508.3190517acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article

Optimus: an efficient dynamic resource scheduler for deep learning clusters

Published:23 April 2018Publication History

ABSTRACT

Deep learning workloads are common in today's production clusters due to the proliferation of deep learning driven AI services (e.g., speech recognition, machine translation). A deep learning training job is resource-intensive and time-consuming. Efficient resource scheduling is the key to the maximal performance of a deep learning cluster. Existing cluster schedulers are largely not tailored to deep learning jobs, and typically specifying a fixed amount of resources for each job, prohibiting high resource efficiency and job performance. This paper proposes Optimus, a customized job scheduler for deep learning clusters, which minimizes job training time based on online resource-performance models. Optimus uses online fitting to predict model convergence during training, and sets up performance models to accurately estimate training speed as a function of allocated resources in each job. Based on the models, a simple yet effective method is designed and used for dynamically allocating resources and placing deep learning tasks to minimize job completion time. We implement Optimus on top of Kubernetes, a cluster manager for container orchestration, and experiment on a deep learning cluster with 7 CPU servers and 6 GPU servers, running 9 training jobs using the MXNet framework. Results show that Optimus outperforms representative cluster schedulers by about 139% and 63% in terms of job completion time and makespan, respectively.

References

  1. 2006. Caltech 256 Dataset. http://www.vision.caltech.edu/Image_Datasets/Caltech256/. (2006).Google ScholarGoogle Scholar
  2. 2009. The CIFAR-10 Dataset. https://www.cs.toronto.edu/~kriz/cifar.html. (2009).Google ScholarGoogle Scholar
  3. 2014. HDFS. https://wiki.apache.org/hadoop/HDFS. (2014).Google ScholarGoogle Scholar
  4. 2014. Kaggle NDSB1 Dataset. https://www.kaggle.com/c/datasciencebowl/data. (2014).Google ScholarGoogle Scholar
  5. 2014. Overfitting and Regularization. https://alliance.seas.upenn.edu/~cis520/dynamic/2017/wiki/index.php?n=Lectures.Overfitting. (2014).Google ScholarGoogle Scholar
  6. 2014. Perplexity Versus Error Rate. https://nlpers.blogspot.hk/2014/05/perplexity-versus-error-rate-for.html. (2014).Google ScholarGoogle Scholar
  7. 2014. SciPy NNLS. https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.nnls.html. (2014).Google ScholarGoogle Scholar
  8. 2015. Google Cluster Workload Traces. https://github.com/google/cluster-data. (2015).Google ScholarGoogle Scholar
  9. 2015. LibriSpeech ASR Corpus. http://www.openslr.org/12/. (2015).Google ScholarGoogle Scholar
  10. 2017. etcd. https://github.com/coreos/etcd. (2017).Google ScholarGoogle Scholar
  11. 2017. Hadoop CapacityScheduler. https://hadoop.apache.org/docs/r2.7.4/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html. (2017).Google ScholarGoogle Scholar
  12. 2017. ImageNetDataset. http://www.image-net.org. (2017).Google ScholarGoogle Scholar
  13. 2017. KAGGLE-DSB Model. https://github.com/apache/incubator-mxnet/tree/master/example/kaggle-ndsb1. (2017).Google ScholarGoogle Scholar
  14. 2017. Kubernetes. https://kubernetes.io. (2017).Google ScholarGoogle Scholar
  15. 2017. MXNet Neural Machine Translation. https://github.com/awslabs/sockeye. (2017).Google ScholarGoogle Scholar
  16. 2017. MXNet Official Examples. https://github.com/apache/incubator-mxnet/tree/master/example. (2017).Google ScholarGoogle Scholar
  17. 2017. PaddlePaddle. http://www.paddlepaddle.org. (2017).Google ScholarGoogle Scholar
  18. 2017. Penn Tree Bank Dataset. https://catalog.ldc.upenn.edu/ldc99t42. (2017).Google ScholarGoogle Scholar
  19. 2017. Run Deep Learning with PaddlePaddle on Kubernetes. http://blog.kubernetes.io/2017/02/run-deep-learning-with-paddlepaddle-on-kubernetes.html. (2017).Google ScholarGoogle Scholar
  20. 2017. Stochastic Gradient Descent. https://en.wikipedia.org/wiki/Stochastic_gradient_descent. (2017).Google ScholarGoogle Scholar
  21. 2017. WMT 2017. http://www.statmt.org/wmt17/. (2017).Google ScholarGoogle Scholar
  22. 2017. Word Language Model. https://github.com/apache/incubator-mxnet/tree/master/example/gluon/word_language_model. (2017).Google ScholarGoogle Scholar
  23. Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. TensorFlow: A System for Large-Scale Machine Learning. In Proc. of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Omid Alipourfard, Hongqiang Harry Liu, Jianshu Chen, Shivaram Venkataraman, Minlan Yu, and Ming Zhang. 2017. CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics. In Proc. of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. 2016. Deep Speech 2: End-to-end Speech Recognition in English and Mandarin. In Proc. of the 33th International Conference on Machine Learning (ICML). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Pang Bo and Lee Lillian. 2005. Movie Review Data. https://www.cs.cornell.edu/people/pabo/movie-review-data/. (2005).Google ScholarGoogle Scholar
  27. Jianmin Chen, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. 2016. Revisiting Distributed Synchronous SGD. arXiv preprint arXiv:1604.00981 (April 2016).Google ScholarGoogle Scholar
  28. Valentin Dalibard, Michael Schaarschmidt, and Eiko Yoneki. 2017. BOAT: Building Auto-tuners with Structured Bayesian Optimization. In Proc. of the 26th International Conference on World Wide Web (WWW). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. 2012. Large Scale Distributed Deep Networks. In Proc. of the 25th Advances in Neural Information Processing Systems (NIPS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Pamela Delgado, Diego Didona, Florin Dinu, and Willy Zwaenepoel. 2016. Job-aware Scheduling in Eagle: Divide and Stick to Your Probes. In Proc. of the 7th ACM Symposium on Cloud Computing (SoCC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Matteo Dell'Amico, Damiano Carra, Mario Pastorelli, and Pietro Michiardi. 2014. Revisiting Size-Based Scheduling with Estimated Job Sizes. In Proc. of the 22th IEEE International Symposium on Modelling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Andrew D Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo Fonseca. 2012. Jockey: Guaranteed Job Latency in Data Parallel Clusters. In Proc. of the 7th ACM European Conference on Computer Systems (Eurosys). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. 2017. Convolutional Sequence to Sequence Learning. In Proc. of the 34th International Conference on Machine Learning (ICML).Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. 2011. Dominant Resource Fairness: Fair Allocation of Multiple Resource Types. In Proc. of the 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Ionel Gog, Malte Schwarzkopf, Adam Gleave, Robert NM Watson, and Steven Hand. 2016. Firmament: Fast, Centralized Cluster Scheduling at Scale. In Proc. of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. In arXiv preprint arXiv:1706.02677.Google ScholarGoogle Scholar
  37. Robert Grandl, Ganesh Ananthanarayanan, Srikanth Kandula, Sriram Rao, and Aditya Akella. 2014. Multi-Resource Packing for Cluster Schedulers. In Proc. of ACM SIGCOMM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Aaron Harlap, Alexey Tumanov, Andrew Chung, Gregory R Ganger, and Phillip B Gibbons. 2017. Proteus: Agile ML Elasticity Through Tiered Reliability in Dynamic Resource Markets. In Proc. of the 12th ACM European Conference on Computer Systems (EuroSys). Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proc. of the 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  40. Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D Joseph, Randy H Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In Proc. of the 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Botong Huang, Matthias Boehm, Yuanyuan Tian, Berthold Reinwald, Shirish Tatikonda, and Frederick R Reiss. 2015. Resource Elasticity for Large-Scale Machine Learning. In Proc. of ACM SIGMOD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Virajith Jalaparti, Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, and Matthew Caesar. 2015. Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can. In Proc. of ACM SIGCOMM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Jie Jiang, Lele Yu, Jiawei Jiang, Yuhong Liu, and Bin Cui. 2017. Angel: a New Large-Scale Machine Learning System. National Science Review (2017), nwx018.Google ScholarGoogle Scholar
  44. Sangeetha Abdu Jyothi, Carlo Curino, Ishai Menache, Shravan Matthur Narayanamurthy, Alexey Tumanov, Jonathan Yaniv, Íñigo Goiri, Subru Krishnan, Janardhan Kulkarni, and Sriram Rao. 2016. Morpheus: Towards Automated SLOs for Enterprise Clusters. In Proc. of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Jin Kyu Kim, Qirong Ho, Seunghak Lee, Xun Zheng, Wei Dai, Garth A Gibson, and Eric P Xing. 2016. STRADS: A Distributed Framework for Scheduled Model Parallel Machine Learning. In Proc. of the 11th European Conference on Computer Systems (Eurosys). Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proc. of 19th SIGDAT Conference on Empirical Methods in Natural Language Processing (EMNLP).Google ScholarGoogle ScholarCross RefCross Ref
  47. Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank Hutter. 2017. Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets. In Proc. of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS).Google ScholarGoogle Scholar
  48. Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In Proc. of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Mahoney Matt. 2017. text8. http://mattmahoney.net/dc/. (2017).Google ScholarGoogle Scholar
  50. Azalia Mirhoseini, Hieu Pham, Quoc Le, Mohammad Norouzi, Samy Bengio, Benoit Steiner, Yuefeng Zhou, Naveen Kumar, Rasmus Larsen, and Jeff Dean. 2017. Device Placement Optimization with Reinforcement Learning. In Proc. of the 34th International Conference on Machine Learning (ICML).Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous Methods for Deep Reinforcement Learning. In Proc. of the 33th International Conference on Machine Learning (ICML). Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Adrian Daniel Popescu, Andrey Balmin, Vuk Ercegovac, and Anastasia Ailamaki. 2013. PREDIcT: Towards Predicting the Runtime of Large-Scale Iterative Analytics. Proceedings of the VLDB Endowment (PVLDB) 6, 14 (2013), 1678--1689. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Kaushik Rajan, Dharmesh Kakadia, Carlo Curino, and Subru Krishnan. 2016. PerfOrator: Eloquent Performance Models for Resource Optimization. In Proc. of the 7th ACM Symposium on Cloud Computing (SoCC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014. A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval. In Proc. of the 23th ACM International Conference on Conference on Information and Knowledge Management (CIKM). Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Alexander Smola and Shravan Narayanamurthy. 2010. An Architecture for Parallel Topic Models. Proceedings of the VLDB Endowment (PVLDB) 3, 1-2 (2010), 703--710. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Evan R Sparks, Ameet Talwalkar, Daniel Haas, Michael J Franklin, Michael I Jordan, and Tim Kraska. 2015. Automating Model Search for Large Scale Machine Learning. In Proc. of the 6th ACM Symposium on Cloud Computing (SoCC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Peng Sun, Yonggang Wen, Nguyen Binh Duong Ta, and Shengen Yan. 2017. Towards Distributed Machine Learning in Shared Clusters: A Dynamically-Partitioned Approach. In Proc. of the 3rd IEEE International Conference on Smart Computing (SMARTCOMP).Google ScholarGoogle ScholarCross RefCross Ref
  58. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the Inception Architecture for Computer Vision. In Proc. of the 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  59. Chen Tianqi, Li Mu, Li Yutian, Lin Min, Wang Naiyan, Wang Minjie, Xiao Tianjun, Xu Bing, Zhang Chiyuan, and Zhang Zheng. 2016. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. In Proc. of NIPS Workshop on Machine Learning Systems (LearningSys).Google ScholarGoogle Scholar
  60. Alexey Tumanov, Timothy Zhu, Jun Woo Park, Michael A Kozuch, Mor Harchol-Balter, and Gregory R Ganger. 2016. TetriSched: Global Rescheduling with Adaptive Plan-Ahead in Dynamic Heterogeneous Clusters. In Proc. of the 11th ACM European Conference on Computer Systems (Eurosys). Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Vinod Kumar Vavilapalli, Arun C Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, et al. 2013. Apache Hadoop Yarn: Yet Another Resource Negotiator. In Proc. of the 4th annual Symposium on Cloud Computing (SoCC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Shivaram Venkataraman, Zongheng Yang, Michael Franklin, Benjamin Recht, and Ion Stoica. 2016. Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics. In Proc. of the 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. 2015. Large-Scale Cluster Management at Google with Borg. In Proc. of the 10th ACM European Conference on Computer Systems (Eurosys). Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Tyczynski Wojciech. 2017. Kubernetes Scalability. http://blog.kubernetes.io/2017/03/scalability-updates-in-kubernetes-1.6.html. (2017).Google ScholarGoogle Scholar
  65. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation. arXiv preprint arXiv:1609.08144 (2016).Google ScholarGoogle Scholar
  66. Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated Residual Transformations for Deep Neural Networks. In Proc. of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  67. Eric P Xing, Qirong Ho, Wei Dai, Jin-Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. 2015. Petuum: A New Platform for Distributed Machine Learning on Big Data. In Proc. of the 21th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig. 2017. The Microsoft 2016 Conversational Speech Recognition System. In Proc. of the 42th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).Google ScholarGoogle ScholarCross RefCross Ref
  69. Feng Yan, Olatunji Ruwase, Yuxiong He, and Trishul Chilimbi. 2015. Performance Modeling and Scalability Optimization of Distributed Deep Learning Systems. In Proc. of the 21th ACM International Conference on Knowledge Discovery and Data Mining (KDD). Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Proc. of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Haoyu Zhang, Logan Stafman, Andrew Or, and Michael J Freedman. 2017. SLAQ: Quality-Driven Scheduling for Distributed Machine Learning. In Proc. of the 8th ACM Symposium on Cloud Computing (SoCC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Zhuo Zhang, Chao Li, Yangyu Tao, Renyu Yang, Hong Tang, and Jie Xu. 2014. Fuxi: a Fault-Tolerant Resource Management and Job Scheduling System at Internet Scale. Proceedings of the VLDB Endowment (PVLDB) 7, 13 (2014), 1393--1404. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Optimus: an efficient dynamic resource scheduler for deep learning clusters

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        EuroSys '18: Proceedings of the Thirteenth EuroSys Conference
        April 2018
        631 pages
        ISBN:9781450355841
        DOI:10.1145/3190508

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 23 April 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        EuroSys '18 Paper Acceptance Rate43of262submissions,16%Overall Acceptance Rate241of1,308submissions,18%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader