ABSTRACT
Deep learning workloads are common in today's production clusters due to the proliferation of deep learning driven AI services (e.g., speech recognition, machine translation). A deep learning training job is resource-intensive and time-consuming. Efficient resource scheduling is the key to the maximal performance of a deep learning cluster. Existing cluster schedulers are largely not tailored to deep learning jobs, and typically specifying a fixed amount of resources for each job, prohibiting high resource efficiency and job performance. This paper proposes Optimus, a customized job scheduler for deep learning clusters, which minimizes job training time based on online resource-performance models. Optimus uses online fitting to predict model convergence during training, and sets up performance models to accurately estimate training speed as a function of allocated resources in each job. Based on the models, a simple yet effective method is designed and used for dynamically allocating resources and placing deep learning tasks to minimize job completion time. We implement Optimus on top of Kubernetes, a cluster manager for container orchestration, and experiment on a deep learning cluster with 7 CPU servers and 6 GPU servers, running 9 training jobs using the MXNet framework. Results show that Optimus outperforms representative cluster schedulers by about 139% and 63% in terms of job completion time and makespan, respectively.
- 2006. Caltech 256 Dataset. http://www.vision.caltech.edu/Image_Datasets/Caltech256/. (2006).Google Scholar
- 2009. The CIFAR-10 Dataset. https://www.cs.toronto.edu/~kriz/cifar.html. (2009).Google Scholar
- 2014. HDFS. https://wiki.apache.org/hadoop/HDFS. (2014).Google Scholar
- 2014. Kaggle NDSB1 Dataset. https://www.kaggle.com/c/datasciencebowl/data. (2014).Google Scholar
- 2014. Overfitting and Regularization. https://alliance.seas.upenn.edu/~cis520/dynamic/2017/wiki/index.php?n=Lectures.Overfitting. (2014).Google Scholar
- 2014. Perplexity Versus Error Rate. https://nlpers.blogspot.hk/2014/05/perplexity-versus-error-rate-for.html. (2014).Google Scholar
- 2014. SciPy NNLS. https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.nnls.html. (2014).Google Scholar
- 2015. Google Cluster Workload Traces. https://github.com/google/cluster-data. (2015).Google Scholar
- 2015. LibriSpeech ASR Corpus. http://www.openslr.org/12/. (2015).Google Scholar
- 2017. etcd. https://github.com/coreos/etcd. (2017).Google Scholar
- 2017. Hadoop CapacityScheduler. https://hadoop.apache.org/docs/r2.7.4/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html. (2017).Google Scholar
- 2017. ImageNetDataset. http://www.image-net.org. (2017).Google Scholar
- 2017. KAGGLE-DSB Model. https://github.com/apache/incubator-mxnet/tree/master/example/kaggle-ndsb1. (2017).Google Scholar
- 2017. Kubernetes. https://kubernetes.io. (2017).Google Scholar
- 2017. MXNet Neural Machine Translation. https://github.com/awslabs/sockeye. (2017).Google Scholar
- 2017. MXNet Official Examples. https://github.com/apache/incubator-mxnet/tree/master/example. (2017).Google Scholar
- 2017. PaddlePaddle. http://www.paddlepaddle.org. (2017).Google Scholar
- 2017. Penn Tree Bank Dataset. https://catalog.ldc.upenn.edu/ldc99t42. (2017).Google Scholar
- 2017. Run Deep Learning with PaddlePaddle on Kubernetes. http://blog.kubernetes.io/2017/02/run-deep-learning-with-paddlepaddle-on-kubernetes.html. (2017).Google Scholar
- 2017. Stochastic Gradient Descent. https://en.wikipedia.org/wiki/Stochastic_gradient_descent. (2017).Google Scholar
- 2017. WMT 2017. http://www.statmt.org/wmt17/. (2017).Google Scholar
- 2017. Word Language Model. https://github.com/apache/incubator-mxnet/tree/master/example/gluon/word_language_model. (2017).Google Scholar
- Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. TensorFlow: A System for Large-Scale Machine Learning. In Proc. of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Google ScholarDigital Library
- Omid Alipourfard, Hongqiang Harry Liu, Jianshu Chen, Shivaram Venkataraman, Minlan Yu, and Ming Zhang. 2017. CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics. In Proc. of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI). Google ScholarDigital Library
- Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. 2016. Deep Speech 2: End-to-end Speech Recognition in English and Mandarin. In Proc. of the 33th International Conference on Machine Learning (ICML). Google ScholarDigital Library
- Pang Bo and Lee Lillian. 2005. Movie Review Data. https://www.cs.cornell.edu/people/pabo/movie-review-data/. (2005).Google Scholar
- Jianmin Chen, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. 2016. Revisiting Distributed Synchronous SGD. arXiv preprint arXiv:1604.00981 (April 2016).Google Scholar
- Valentin Dalibard, Michael Schaarschmidt, and Eiko Yoneki. 2017. BOAT: Building Auto-tuners with Structured Bayesian Optimization. In Proc. of the 26th International Conference on World Wide Web (WWW). Google ScholarDigital Library
- Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. 2012. Large Scale Distributed Deep Networks. In Proc. of the 25th Advances in Neural Information Processing Systems (NIPS). Google ScholarDigital Library
- Pamela Delgado, Diego Didona, Florin Dinu, and Willy Zwaenepoel. 2016. Job-aware Scheduling in Eagle: Divide and Stick to Your Probes. In Proc. of the 7th ACM Symposium on Cloud Computing (SoCC). Google ScholarDigital Library
- Matteo Dell'Amico, Damiano Carra, Mario Pastorelli, and Pietro Michiardi. 2014. Revisiting Size-Based Scheduling with Estimated Job Sizes. In Proc. of the 22th IEEE International Symposium on Modelling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS). Google ScholarDigital Library
- Andrew D Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo Fonseca. 2012. Jockey: Guaranteed Job Latency in Data Parallel Clusters. In Proc. of the 7th ACM European Conference on Computer Systems (Eurosys). Google ScholarDigital Library
- Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. 2017. Convolutional Sequence to Sequence Learning. In Proc. of the 34th International Conference on Machine Learning (ICML).Google ScholarDigital Library
- Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. 2011. Dominant Resource Fairness: Fair Allocation of Multiple Resource Types. In Proc. of the 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI). Google ScholarDigital Library
- Ionel Gog, Malte Schwarzkopf, Adam Gleave, Robert NM Watson, and Steven Hand. 2016. Firmament: Fast, Centralized Cluster Scheduling at Scale. In Proc. of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI). Google ScholarDigital Library
- Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. In arXiv preprint arXiv:1706.02677.Google Scholar
- Robert Grandl, Ganesh Ananthanarayanan, Srikanth Kandula, Sriram Rao, and Aditya Akella. 2014. Multi-Resource Packing for Cluster Schedulers. In Proc. of ACM SIGCOMM. Google ScholarDigital Library
- Aaron Harlap, Alexey Tumanov, Andrew Chung, Gregory R Ganger, and Phillip B Gibbons. 2017. Proteus: Agile ML Elasticity Through Tiered Reliability in Dynamic Resource Markets. In Proc. of the 12th ACM European Conference on Computer Systems (EuroSys). Google ScholarDigital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proc. of the 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
- Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D Joseph, Randy H Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In Proc. of the 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI). Google ScholarDigital Library
- Botong Huang, Matthias Boehm, Yuanyuan Tian, Berthold Reinwald, Shirish Tatikonda, and Frederick R Reiss. 2015. Resource Elasticity for Large-Scale Machine Learning. In Proc. of ACM SIGMOD. Google ScholarDigital Library
- Virajith Jalaparti, Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, and Matthew Caesar. 2015. Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can. In Proc. of ACM SIGCOMM. Google ScholarDigital Library
- Jie Jiang, Lele Yu, Jiawei Jiang, Yuhong Liu, and Bin Cui. 2017. Angel: a New Large-Scale Machine Learning System. National Science Review (2017), nwx018.Google Scholar
- Sangeetha Abdu Jyothi, Carlo Curino, Ishai Menache, Shravan Matthur Narayanamurthy, Alexey Tumanov, Jonathan Yaniv, Íñigo Goiri, Subru Krishnan, Janardhan Kulkarni, and Sriram Rao. 2016. Morpheus: Towards Automated SLOs for Enterprise Clusters. In Proc. of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Google ScholarDigital Library
- Jin Kyu Kim, Qirong Ho, Seunghak Lee, Xun Zheng, Wei Dai, Garth A Gibson, and Eric P Xing. 2016. STRADS: A Distributed Framework for Scheduled Model Parallel Machine Learning. In Proc. of the 11th European Conference on Computer Systems (Eurosys). Google ScholarDigital Library
- Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proc. of 19th SIGDAT Conference on Empirical Methods in Natural Language Processing (EMNLP).Google ScholarCross Ref
- Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank Hutter. 2017. Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets. In Proc. of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS).Google Scholar
- Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In Proc. of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Google ScholarDigital Library
- Mahoney Matt. 2017. text8. http://mattmahoney.net/dc/. (2017).Google Scholar
- Azalia Mirhoseini, Hieu Pham, Quoc Le, Mohammad Norouzi, Samy Bengio, Benoit Steiner, Yuefeng Zhou, Naveen Kumar, Rasmus Larsen, and Jeff Dean. 2017. Device Placement Optimization with Reinforcement Learning. In Proc. of the 34th International Conference on Machine Learning (ICML).Google ScholarDigital Library
- Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous Methods for Deep Reinforcement Learning. In Proc. of the 33th International Conference on Machine Learning (ICML). Google ScholarDigital Library
- Adrian Daniel Popescu, Andrey Balmin, Vuk Ercegovac, and Anastasia Ailamaki. 2013. PREDIcT: Towards Predicting the Runtime of Large-Scale Iterative Analytics. Proceedings of the VLDB Endowment (PVLDB) 6, 14 (2013), 1678--1689. Google ScholarDigital Library
- Kaushik Rajan, Dharmesh Kakadia, Carlo Curino, and Subru Krishnan. 2016. PerfOrator: Eloquent Performance Models for Resource Optimization. In Proc. of the 7th ACM Symposium on Cloud Computing (SoCC). Google ScholarDigital Library
- Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014. A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval. In Proc. of the 23th ACM International Conference on Conference on Information and Knowledge Management (CIKM). Google ScholarDigital Library
- Alexander Smola and Shravan Narayanamurthy. 2010. An Architecture for Parallel Topic Models. Proceedings of the VLDB Endowment (PVLDB) 3, 1-2 (2010), 703--710. Google ScholarDigital Library
- Evan R Sparks, Ameet Talwalkar, Daniel Haas, Michael J Franklin, Michael I Jordan, and Tim Kraska. 2015. Automating Model Search for Large Scale Machine Learning. In Proc. of the 6th ACM Symposium on Cloud Computing (SoCC). Google ScholarDigital Library
- Peng Sun, Yonggang Wen, Nguyen Binh Duong Ta, and Shengen Yan. 2017. Towards Distributed Machine Learning in Shared Clusters: A Dynamically-Partitioned Approach. In Proc. of the 3rd IEEE International Conference on Smart Computing (SMARTCOMP).Google ScholarCross Ref
- Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the Inception Architecture for Computer Vision. In Proc. of the 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
- Chen Tianqi, Li Mu, Li Yutian, Lin Min, Wang Naiyan, Wang Minjie, Xiao Tianjun, Xu Bing, Zhang Chiyuan, and Zhang Zheng. 2016. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. In Proc. of NIPS Workshop on Machine Learning Systems (LearningSys).Google Scholar
- Alexey Tumanov, Timothy Zhu, Jun Woo Park, Michael A Kozuch, Mor Harchol-Balter, and Gregory R Ganger. 2016. TetriSched: Global Rescheduling with Adaptive Plan-Ahead in Dynamic Heterogeneous Clusters. In Proc. of the 11th ACM European Conference on Computer Systems (Eurosys). Google ScholarDigital Library
- Vinod Kumar Vavilapalli, Arun C Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, et al. 2013. Apache Hadoop Yarn: Yet Another Resource Negotiator. In Proc. of the 4th annual Symposium on Cloud Computing (SoCC). Google ScholarDigital Library
- Shivaram Venkataraman, Zongheng Yang, Michael Franklin, Benjamin Recht, and Ion Stoica. 2016. Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics. In Proc. of the 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI). Google ScholarDigital Library
- Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. 2015. Large-Scale Cluster Management at Google with Borg. In Proc. of the 10th ACM European Conference on Computer Systems (Eurosys). Google ScholarDigital Library
- Tyczynski Wojciech. 2017. Kubernetes Scalability. http://blog.kubernetes.io/2017/03/scalability-updates-in-kubernetes-1.6.html. (2017).Google Scholar
- Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation. arXiv preprint arXiv:1609.08144 (2016).Google Scholar
- Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated Residual Transformations for Deep Neural Networks. In Proc. of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
- Eric P Xing, Qirong Ho, Wei Dai, Jin-Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. 2015. Petuum: A New Platform for Distributed Machine Learning on Big Data. In Proc. of the 21th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Google ScholarDigital Library
- Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig. 2017. The Microsoft 2016 Conversational Speech Recognition System. In Proc. of the 42th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).Google ScholarCross Ref
- Feng Yan, Olatunji Ruwase, Yuxiong He, and Trishul Chilimbi. 2015. Performance Modeling and Scalability Optimization of Distributed Deep Learning Systems. In Proc. of the 21th ACM International Conference on Knowledge Discovery and Data Mining (KDD). Google ScholarDigital Library
- Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Proc. of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI). Google ScholarDigital Library
- Haoyu Zhang, Logan Stafman, Andrew Or, and Michael J Freedman. 2017. SLAQ: Quality-Driven Scheduling for Distributed Machine Learning. In Proc. of the 8th ACM Symposium on Cloud Computing (SoCC). Google ScholarDigital Library
- Zhuo Zhang, Chao Li, Yangyu Tao, Renyu Yang, Hong Tang, and Jie Xu. 2014. Fuxi: a Fault-Tolerant Resource Management and Job Scheduling System at Internet Scale. Proceedings of the VLDB Endowment (PVLDB) 7, 13 (2014), 1393--1404. Google ScholarDigital Library
Index Terms
- Optimus: an efficient dynamic resource scheduler for deep learning clusters
Recommendations
Learning scheduling algorithms for data processing clusters
SIGCOMM '19: Proceedings of the ACM Special Interest Group on Data CommunicationEfficiently scheduling data processing jobs on distributed compute clusters requires complex algorithms. Current systems use simple, generalized heuristics and ignore workload characteristics, since developing and tuning a scheduling policy for each ...
GPU Job Scheduling based on Deep Reinforcement Learning
HP3C '23: Proceedings of the 2023 7th International Conference on High Performance Compilation, Computing and CommunicationsThe development of GPU and machine learning workloads has been profound; however, resource management and allocation have become difficult problems. The methods used nowadays are not suitable for handling numerous data. Inspired by recent advancements ...
Toward balanced and sustainable job scheduling for production supercomputers
Job scheduling on production supercomputers is complicated by diverse demands of system administrators and amorphous characteristics of workloads. Specifically, various scheduling goals such as queuing efficiency and system utilization are usually ...
Comments