ABSTRACT
Big deep neural network (DNN) models trained on large amounts of data have recently achieved the best accuracy on hard tasks, such as image and speech recognition. Training these DNNs using a cluster of commodity machines is a promising approach since training is time consuming and compute-intensive. To enable training of extremely large DNNs, models are partitioned across machines. To expedite training on very large data sets, multiple model replicas are trained in parallel on different subsets of the training examples with a global parameter server maintaining shared weights across these replicas. The correct choice for model and data partitioning and overall system provisioning is highly dependent on the DNN and distributed system hardware characteristics. These decisions currently require significant domain expertise and time consuming empirical state space exploration.
This paper develops performance models that quantify the impact of these partitioning and provisioning decisions on overall distributed system performance and scalability. Also, we use these performance models to build a scalability optimizer that efficiently determines the optimal system configuration that minimizes DNN training time. We evaluate our performance models and scalability optimizer using a state-of-the-art distributed DNN training framework on two benchmark applications. The results show our performance models estimate DNN training time with high estimation accuracy and our scalability optimizer correctly chooses the best configurations, minimizing the training time of distributed DNNs.
Supplemental Material
- A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. In NIPS, 2011.Google ScholarDigital Library
- S. Babu. Towards automatic optimization of mapreduce programs. In SoCC, 2010. Google ScholarDigital Library
- Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2009. Google ScholarDigital Library
- L. Bottou. Large-scale machine learning with stochastic gradient descent. In COMPSTAT, 2010.Google Scholar
- T. Chilimbi, J. Apacible, K. Kalyanaraman, and Y. Suzue. Project adam: Building an efficient and scalable deep learning training system. In OSDI, 2014. Google ScholarDigital Library
- J. Cipar, Q. Ho, J. K. Kim, S. Lee, G. R. Ganger, G. Gibson, K. Keeton, and E. Xing. Solving the straggler problem with bounded staleness. In HotOS, 2013. Google ScholarDigital Library
- D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber. Deep big simple neural nets excel on handwritten digit recognition. In CoRR, 2010.Google Scholar
- A. Coates, B. Huval, T. Wang, D. J. Wu, and A. Y. Ng. Deep learning with cots hpc systems. In ICML, 2013.Google ScholarDigital Library
- A. Coates, H. Lee, and A. Ng. An analysis of single-layer networks in unsupervised feature learning. In AISTATS, 2011.Google Scholar
- R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 12:2493--2537, Nov. 2011. Google ScholarDigital Library
- G. E. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. Trans. Audio, Speech and Lang. Proc., 2012. Google ScholarDigital Library
- Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In NIPS, 2014.Google ScholarDigital Library
- J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. W. Senior, P. A. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In NIPS, 2012.Google ScholarDigital Library
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.Google ScholarCross Ref
- G. Hinton, L. Deng, D. Yu, G. Dahl, A. rahman Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition. Signal Processing Magazine, 2012.Google Scholar
- Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. R. Ganger, and E. P. Xing. More effective distributed ml via a stale synchronous parallel parameter server. In NIPS, 2013.Google ScholarDigital Library
- P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013. Google ScholarDigital Library
- D. Hubel and W. T.N. Receptive fields of single neurons in the cat's striate cortex. Journal of Physiology, 1959.Google Scholar
- G. Jacques-Silva, Z. Kalbarczyk, B. Gedik, H. Andrade, K.-L. Wu, and R. K. Iyer. Modeling stream processing applications for dependability evaluation. In DSN, 2011. Google ScholarDigital Library
- P. Jogalekar and M. Woodside. Evaluating the scalability of distributed systems. IEEE Transactions on Parallel and Distributed Systems, 11(6):589--603, 2000. Google ScholarDigital Library
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.Google ScholarDigital Library
- Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, and A. Ng. Building high-level features using large scale unsupervised learning. In ICML, 2012.Google ScholarDigital Library
- Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pages 2278--2324, 1998.Google ScholarCross Ref
- Y. Li, P. P. Lee, and J. C. Lui. Stochastic modeling of large-scale solid-state storage systems: Analysis, design tradeoffs and optimization. In Sigmetrics, 2013. Google ScholarDigital Library
- D. P. Mandic and J. Chambers. Recurrent Neural Networks for Prediction: Learning Algorithms, Architectures and Stability. John Wiley & Sons, Inc., New York, NY, USA, 2001. Google ScholarCross Ref
- R. McDonald, K. Hall, and G. Mann. Distributed training strategies for the structured perceptron. In NAACL HLT, 2010. Google ScholarDigital Library
- R. Mcdonald, M. Mohri, N. Silberman, D. Walker, and G. S. Mann. Efficient large-scale distributed training of conditional maximum entropy models. In NIPS, 2009.Google Scholar
- F. Niu, B. Recht, C. Re, and S. J. Wright. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. In NIPS, 2011.Google ScholarDigital Library
- G. Singh, C. Kesselman, and E. Deelman. A provisioning model and its comparison with best-effort for performance-cost optimization in grids. In HPDC, 2007. Google ScholarDigital Library
- U. Vazirani, S. Rao, A. Blanca, and A. Prakash. U.C. berkeley, cs270 algorithms. http://www.cs.berkeley.edu/ satishr/cs270/sp11/rough-notes/matching.pdf. Accessed: 2015-02-01.Google Scholar
- F. Yan, O. Ruwase, Y. He, and T. Chilimbi. Performance modeling and scalability optimization of distributed deep learning systems. http://www.cs.wm.edu/ fyan/TechReport.pdf. Accessed: 2015-06-09.Google Scholar
Index Terms
- Performance Modeling and Scalability Optimization of Distributed Deep Learning Systems
Recommendations
Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis
Deep Neural Networks (DNNs) are becoming an important tool in modern computing applications. Accelerating their training is a major challenge and techniques range from distributed algorithms to low-level circuit design. In this survey, we describe the ...
A Survey on Distributed Machine Learning
The demand for artificial intelligence has grown significantly over the past decade, and this growth has been fueled by advances in machine learning techniques and the ability to leverage hardware acceleration. However, to increase the quality of ...
Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques, and Tools
Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-art results in various domains, such as image recognition and natural language processing. One of the reasons for this success is the increasing size of DL models ...
Comments