research-article

Optimus: an efficient dynamic resource scheduler for deep learning clusters

Authors:
Yanghua Peng

The University of Hong Kong

The University of Hong Kong
View Profile

,
Yixin Bao

The University of Hong Kong

The University of Hong Kong
View Profile

,
Yangrui Chen

The University of Hong Kong

The University of Hong Kong
View Profile

,
Chuan Wu

The University of Hong Kong

The University of Hong Kong
View Profile

,
Chuanxiong Guo

Bytedance Inc.

Bytedance Inc.
View Profile

EuroSys '18: Proceedings of the Thirteenth EuroSys ConferenceApril 2018Article No.: 3Pages 1–14https://doi.org/10.1145/3190508.3190517

Published:23 April 2018Publication History

EuroSys '18: Proceedings of the Thirteenth EuroSys Conference

Pages 1–14

ABSTRACT

Deep learning workloads are common in today's production clusters due to the proliferation of deep learning driven AI services (e.g., speech recognition, machine translation). A deep learning training job is resource-intensive and time-consuming. Efficient resource scheduling is the key to the maximal performance of a deep learning cluster. Existing cluster schedulers are largely not tailored to deep learning jobs, and typically specifying a fixed amount of resources for each job, prohibiting high resource efficiency and job performance. This paper proposes Optimus, a customized job scheduler for deep learning clusters, which minimizes job training time based on online resource-performance models. Optimus uses online fitting to predict model convergence during training, and sets up performance models to accurately estimate training speed as a function of allocated resources in each job. Based on the models, a simple yet effective method is designed and used for dynamically allocating resources and placing deep learning tasks to minimize job completion time. We implement Optimus on top of Kubernetes, a cluster manager for container orchestration, and experiment on a deep learning cluster with 7 CPU servers and 6 GPU servers, running 9 training jobs using the MXNet framework. Results show that Optimus outperforms representative cluster schedulers by about 139% and 63% in terms of job completion time and makespan, respectively.

References

2006. Caltech 256 Dataset. http://www.vision.caltech.edu/Image_Datasets/Caltech256/. (2006).Google Scholar
2009. The CIFAR-10 Dataset. https://www.cs.toronto.edu/~kriz/cifar.html. (2009).Google Scholar
2014. HDFS. https://wiki.apache.org/hadoop/HDFS. (2014).Google Scholar
2014. Kaggle NDSB1 Dataset. https://www.kaggle.com/c/datasciencebowl/data. (2014).Google Scholar
2014. Overfitting and Regularization. https://alliance.seas.upenn.edu/~cis520/dynamic/2017/wiki/index.php?n=Lectures.Overfitting. (2014).Google Scholar
2014. Perplexity Versus Error Rate. https://nlpers.blogspot.hk/2014/05/perplexity-versus-error-rate-for.html. (2014).Google Scholar
2014. SciPy NNLS. https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.nnls.html. (2014).Google Scholar
2015. Google Cluster Workload Traces. https://github.com/google/cluster-data. (2015).Google Scholar
2015. LibriSpeech ASR Corpus. http://www.openslr.org/12/. (2015).Google Scholar
2017. etcd. https://github.com/coreos/etcd. (2017).Google Scholar
2017. Hadoop CapacityScheduler. https://hadoop.apache.org/docs/r2.7.4/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html. (2017).Google Scholar
2017. ImageNetDataset. http://www.image-net.org. (2017).Google Scholar
2017. KAGGLE-DSB Model. https://github.com/apache/incubator-mxnet/tree/master/example/kaggle-ndsb1. (2017).Google Scholar
2017. Kubernetes. https://kubernetes.io. (2017).Google Scholar
2017. MXNet Neural Machine Translation. https://github.com/awslabs/sockeye. (2017).Google Scholar
2017. MXNet Official Examples. https://github.com/apache/incubator-mxnet/tree/master/example. (2017).Google Scholar
2017. PaddlePaddle. http://www.paddlepaddle.org. (2017).Google Scholar
2017. Penn Tree Bank Dataset. https://catalog.ldc.upenn.edu/ldc99t42. (2017).Google Scholar
2017. Run Deep Learning with PaddlePaddle on Kubernetes. http://blog.kubernetes.io/2017/02/run-deep-learning-with-paddlepaddle-on-kubernetes.html. (2017).Google Scholar
2017. Stochastic Gradient Descent. https://en.wikipedia.org/wiki/Stochastic_gradient_descent. (2017).Google Scholar
2017. WMT 2017. http://www.statmt.org/wmt17/. (2017).Google Scholar
2017. Word Language Model. https://github.com/apache/incubator-mxnet/tree/master/example/gluon/word_language_model. (2017).Google Scholar
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. TensorFlow: A System for Large-Scale Machine Learning. In Proc. of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Google ScholarDigital Library
Omid Alipourfard, Hongqiang Harry Liu, Jianshu Chen, Shivaram Venkataraman, Minlan Yu, and Ming Zhang. 2017. CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics. In Proc. of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI). Google ScholarDigital Library
Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. 2016. Deep Speech 2: End-to-end Speech Recognition in English and Mandarin. In Proc. of the 33th International Conference on Machine Learning (ICML). Google ScholarDigital Library
Pang Bo and Lee Lillian. 2005. Movie Review Data. https://www.cs.cornell.edu/people/pabo/movie-review-data/. (2005).Google Scholar
Jianmin Chen, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. 2016. Revisiting Distributed Synchronous SGD. arXiv preprint arXiv:1604.00981 (April 2016).Google Scholar
Valentin Dalibard, Michael Schaarschmidt, and Eiko Yoneki. 2017. BOAT: Building Auto-tuners with Structured Bayesian Optimization. In Proc. of the 26th International Conference on World Wide Web (WWW). Google ScholarDigital Library
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. 2012. Large Scale Distributed Deep Networks. In Proc. of the 25th Advances in Neural Information Processing Systems (NIPS). Google ScholarDigital Library
Pamela Delgado, Diego Didona, Florin Dinu, and Willy Zwaenepoel. 2016. Job-aware Scheduling in Eagle: Divide and Stick to Your Probes. In Proc. of the 7th ACM Symposium on Cloud Computing (SoCC). Google ScholarDigital Library
Matteo Dell'Amico, Damiano Carra, Mario Pastorelli, and Pietro Michiardi. 2014. Revisiting Size-Based Scheduling with Estimated Job Sizes. In Proc. of the 22th IEEE International Symposium on Modelling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS). Google ScholarDigital Library
Andrew D Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo Fonseca. 2012. Jockey: Guaranteed Job Latency in Data Parallel Clusters. In Proc. of the 7th ACM European Conference on Computer Systems (Eurosys). Google ScholarDigital Library
Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. 2017. Convolutional Sequence to Sequence Learning. In Proc. of the 34th International Conference on Machine Learning (ICML).Google ScholarDigital Library
Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. 2011. Dominant Resource Fairness: Fair Allocation of Multiple Resource Types. In Proc. of the 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI). Google ScholarDigital Library
Ionel Gog, Malte Schwarzkopf, Adam Gleave, Robert NM Watson, and Steven Hand. 2016. Firmament: Fast, Centralized Cluster Scheduling at Scale. In Proc. of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI). Google ScholarDigital Library
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. In arXiv preprint arXiv:1706.02677.Google Scholar
Robert Grandl, Ganesh Ananthanarayanan, Srikanth Kandula, Sriram Rao, and Aditya Akella. 2014. Multi-Resource Packing for Cluster Schedulers. In Proc. of ACM SIGCOMM. Google ScholarDigital Library
Aaron Harlap, Alexey Tumanov, Andrew Chung, Gregory R Ganger, and Phillip B Gibbons. 2017. Proteus: Agile ML Elasticity Through Tiered Reliability in Dynamic Resource Markets. In Proc. of the 12th ACM European Conference on Computer Systems (EuroSys). Google ScholarDigital Library
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proc. of the 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D Joseph, Randy H Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In Proc. of the 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI). Google ScholarDigital Library
Botong Huang, Matthias Boehm, Yuanyuan Tian, Berthold Reinwald, Shirish Tatikonda, and Frederick R Reiss. 2015. Resource Elasticity for Large-Scale Machine Learning. In Proc. of ACM SIGMOD. Google ScholarDigital Library
Virajith Jalaparti, Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, and Matthew Caesar. 2015. Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can. In Proc. of ACM SIGCOMM. Google ScholarDigital Library
Jie Jiang, Lele Yu, Jiawei Jiang, Yuhong Liu, and Bin Cui. 2017. Angel: a New Large-Scale Machine Learning System. National Science Review (2017), nwx018.Google Scholar
Sangeetha Abdu Jyothi, Carlo Curino, Ishai Menache, Shravan Matthur Narayanamurthy, Alexey Tumanov, Jonathan Yaniv, Íñigo Goiri, Subru Krishnan, Janardhan Kulkarni, and Sriram Rao. 2016. Morpheus: Towards Automated SLOs for Enterprise Clusters. In Proc. of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Google ScholarDigital Library
Jin Kyu Kim, Qirong Ho, Seunghak Lee, Xun Zheng, Wei Dai, Garth A Gibson, and Eric P Xing. 2016. STRADS: A Distributed Framework for Scheduled Model Parallel Machine Learning. In Proc. of the 11th European Conference on Computer Systems (Eurosys). Google ScholarDigital Library
Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proc. of 19th SIGDAT Conference on Empirical Methods in Natural Language Processing (EMNLP).Google ScholarCross Ref
Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank Hutter. 2017. Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets. In Proc. of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS).Google Scholar
Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In Proc. of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Google ScholarDigital Library
Mahoney Matt. 2017. text8. http://mattmahoney.net/dc/. (2017).Google Scholar
Azalia Mirhoseini, Hieu Pham, Quoc Le, Mohammad Norouzi, Samy Bengio, Benoit Steiner, Yuefeng Zhou, Naveen Kumar, Rasmus Larsen, and Jeff Dean. 2017. Device Placement Optimization with Reinforcement Learning. In Proc. of the 34th International Conference on Machine Learning (ICML).Google ScholarDigital Library
Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous Methods for Deep Reinforcement Learning. In Proc. of the 33th International Conference on Machine Learning (ICML). Google ScholarDigital Library
Adrian Daniel Popescu, Andrey Balmin, Vuk Ercegovac, and Anastasia Ailamaki. 2013. PREDIcT: Towards Predicting the Runtime of Large-Scale Iterative Analytics. Proceedings of the VLDB Endowment (PVLDB) 6, 14 (2013), 1678--1689. Google ScholarDigital Library
Kaushik Rajan, Dharmesh Kakadia, Carlo Curino, and Subru Krishnan. 2016. PerfOrator: Eloquent Performance Models for Resource Optimization. In Proc. of the 7th ACM Symposium on Cloud Computing (SoCC). Google ScholarDigital Library
Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014. A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval. In Proc. of the 23th ACM International Conference on Conference on Information and Knowledge Management (CIKM). Google ScholarDigital Library
Alexander Smola and Shravan Narayanamurthy. 2010. An Architecture for Parallel Topic Models. Proceedings of the VLDB Endowment (PVLDB) 3, 1-2 (2010), 703--710. Google ScholarDigital Library
Evan R Sparks, Ameet Talwalkar, Daniel Haas, Michael J Franklin, Michael I Jordan, and Tim Kraska. 2015. Automating Model Search for Large Scale Machine Learning. In Proc. of the 6th ACM Symposium on Cloud Computing (SoCC). Google ScholarDigital Library
Peng Sun, Yonggang Wen, Nguyen Binh Duong Ta, and Shengen Yan. 2017. Towards Distributed Machine Learning in Shared Clusters: A Dynamically-Partitioned Approach. In Proc. of the 3rd IEEE International Conference on Smart Computing (SMARTCOMP).Google ScholarCross Ref
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the Inception Architecture for Computer Vision. In Proc. of the 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
Chen Tianqi, Li Mu, Li Yutian, Lin Min, Wang Naiyan, Wang Minjie, Xiao Tianjun, Xu Bing, Zhang Chiyuan, and Zhang Zheng. 2016. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. In Proc. of NIPS Workshop on Machine Learning Systems (LearningSys).Google Scholar
Alexey Tumanov, Timothy Zhu, Jun Woo Park, Michael A Kozuch, Mor Harchol-Balter, and Gregory R Ganger. 2016. TetriSched: Global Rescheduling with Adaptive Plan-Ahead in Dynamic Heterogeneous Clusters. In Proc. of the 11th ACM European Conference on Computer Systems (Eurosys). Google ScholarDigital Library
Vinod Kumar Vavilapalli, Arun C Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, et al. 2013. Apache Hadoop Yarn: Yet Another Resource Negotiator. In Proc. of the 4th annual Symposium on Cloud Computing (SoCC). Google ScholarDigital Library
Shivaram Venkataraman, Zongheng Yang, Michael Franklin, Benjamin Recht, and Ion Stoica. 2016. Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics. In Proc. of the 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI). Google ScholarDigital Library
Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. 2015. Large-Scale Cluster Management at Google with Borg. In Proc. of the 10th ACM European Conference on Computer Systems (Eurosys). Google ScholarDigital Library
Tyczynski Wojciech. 2017. Kubernetes Scalability. http://blog.kubernetes.io/2017/03/scalability-updates-in-kubernetes-1.6.html. (2017).Google Scholar
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation. arXiv preprint arXiv:1609.08144 (2016).Google Scholar
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated Residual Transformations for Deep Neural Networks. In Proc. of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
Eric P Xing, Qirong Ho, Wei Dai, Jin-Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. 2015. Petuum: A New Platform for Distributed Machine Learning on Big Data. In Proc. of the 21th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Google ScholarDigital Library
Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig. 2017. The Microsoft 2016 Conversational Speech Recognition System. In Proc. of the 42th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).Google ScholarCross Ref
Feng Yan, Olatunji Ruwase, Yuxiong He, and Trishul Chilimbi. 2015. Performance Modeling and Scalability Optimization of Distributed Deep Learning Systems. In Proc. of the 21th ACM International Conference on Knowledge Discovery and Data Mining (KDD). Google ScholarDigital Library
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Proc. of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI). Google ScholarDigital Library
Haoyu Zhang, Logan Stafman, Andrew Or, and Michael J Freedman. 2017. SLAQ: Quality-Driven Scheduling for Distributed Machine Learning. In Proc. of the 8th ACM Symposium on Cloud Computing (SoCC). Google ScholarDigital Library
Zhuo Zhang, Chao Li, Yangyu Tao, Renyu Yang, Hong Tang, and Jie Xu. 2014. Fuxi: a Fault-Tolerant Resource Management and Job Scheduling System at Internet Scale. Proceedings of the VLDB Endowment (PVLDB) 7, 13 (2014), 1393--1404. Google ScholarDigital Library

Index Terms

Optimus: an efficient dynamic resource scheduler for deep learning clusters
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing
2. Computing methodologies
  1. Machine learning

Recommendations

Learning scheduling algorithms for data processing clusters
SIGCOMM '19: Proceedings of the ACM Special Interest Group on Data Communication

Efficiently scheduling data processing jobs on distributed compute clusters requires complex algorithms. Current systems use simple, generalized heuristics and ignore workload characteristics, since developing and tuning a scheduling policy for each ...
Read More
GPU Job Scheduling based on Deep Reinforcement Learning
HP3C '23: Proceedings of the 2023 7th International Conference on High Performance Compilation, Computing and Communications

The development of GPU and machine learning workloads has been profound; however, resource management and allocation have become difficult problems. The methods used nowadays are not suitable for handling numerous data. Inspired by recent advancements ...
Read More
Toward balanced and sustainable job scheduling for production supercomputers

Job scheduling on production supercomputers is complicated by diverse demands of system administrators and amorphous characteristics of workloads. Specifically, various scheduling goals such as queuing efficiency and system utilization are usually ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EuroSys '18: Proceedings of the Thirteenth EuroSys Conference
April 2018
631 pages
ISBN:9781450355841
DOI:10.1145/3190508
General Chair:
Rui Oliveira,
Program Chairs:
Pascal Felber,
Y. Charlie Hu
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 April 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
deep learning
resource management
Qualifiers
- research-article
Conference

Acceptance Rates
EuroSys '18 Paper Acceptance Rate43of262submissions,16%Overall Acceptance Rate241of1,308submissions,18%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 20
  Total Citations
  View Citations
- 4,716
  Total Downloads
- Downloads (Last 12 months)395
- Downloads (Last 6 weeks)45
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Optimus: an efficient dynamic resource scheduler for deep learning clusters

EuroSys '18: Proceedings of the Thirteenth EuroSys Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Learning scheduling algorithms for data processing clusters

GPU Job Scheduling based on Deep Reinforcement Learning

Toward balanced and sustainable job scheduling for production supercomputers