skip to main content
10.1145/2783258.2783270acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Performance Modeling and Scalability Optimization of Distributed Deep Learning Systems

Published:10 August 2015Publication History

ABSTRACT

Big deep neural network (DNN) models trained on large amounts of data have recently achieved the best accuracy on hard tasks, such as image and speech recognition. Training these DNNs using a cluster of commodity machines is a promising approach since training is time consuming and compute-intensive. To enable training of extremely large DNNs, models are partitioned across machines. To expedite training on very large data sets, multiple model replicas are trained in parallel on different subsets of the training examples with a global parameter server maintaining shared weights across these replicas. The correct choice for model and data partitioning and overall system provisioning is highly dependent on the DNN and distributed system hardware characteristics. These decisions currently require significant domain expertise and time consuming empirical state space exploration.

This paper develops performance models that quantify the impact of these partitioning and provisioning decisions on overall distributed system performance and scalability. Also, we use these performance models to build a scalability optimizer that efficiently determines the optimal system configuration that minimizes DNN training time. We evaluate our performance models and scalability optimizer using a state-of-the-art distributed DNN training framework on two benchmark applications. The results show our performance models estimate DNN training time with high estimation accuracy and our scalability optimizer correctly chooses the best configurations, minimizing the training time of distributed DNNs.

Skip Supplemental Material Section

Supplemental Material

p1355.mp4

mp4

150.2 MB

References

  1. A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. In NIPS, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Babu. Towards automatic optimization of mapreduce programs. In SoCC, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. Bottou. Large-scale machine learning with stochastic gradient descent. In COMPSTAT, 2010.Google ScholarGoogle Scholar
  5. T. Chilimbi, J. Apacible, K. Kalyanaraman, and Y. Suzue. Project adam: Building an efficient and scalable deep learning training system. In OSDI, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Cipar, Q. Ho, J. K. Kim, S. Lee, G. R. Ganger, G. Gibson, K. Keeton, and E. Xing. Solving the straggler problem with bounded staleness. In HotOS, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber. Deep big simple neural nets excel on handwritten digit recognition. In CoRR, 2010.Google ScholarGoogle Scholar
  8. A. Coates, B. Huval, T. Wang, D. J. Wu, and A. Y. Ng. Deep learning with cots hpc systems. In ICML, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Coates, H. Lee, and A. Ng. An analysis of single-layer networks in unsupervised feature learning. In AISTATS, 2011.Google ScholarGoogle Scholar
  10. R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 12:2493--2537, Nov. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. E. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. Trans. Audio, Speech and Lang. Proc., 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In NIPS, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. W. Senior, P. A. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In NIPS, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  15. G. Hinton, L. Deng, D. Yu, G. Dahl, A. rahman Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition. Signal Processing Magazine, 2012.Google ScholarGoogle Scholar
  16. Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. R. Ganger, and E. P. Xing. More effective distributed ml via a stale synchronous parallel parameter server. In NIPS, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. D. Hubel and W. T.N. Receptive fields of single neurons in the cat's striate cortex. Journal of Physiology, 1959.Google ScholarGoogle Scholar
  19. G. Jacques-Silva, Z. Kalbarczyk, B. Gedik, H. Andrade, K.-L. Wu, and R. K. Iyer. Modeling stream processing applications for dependability evaluation. In DSN, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. P. Jogalekar and M. Woodside. Evaluating the scalability of distributed systems. IEEE Transactions on Parallel and Distributed Systems, 11(6):589--603, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, and A. Ng. Building high-level features using large scale unsupervised learning. In ICML, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pages 2278--2324, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  24. Y. Li, P. P. Lee, and J. C. Lui. Stochastic modeling of large-scale solid-state storage systems: Analysis, design tradeoffs and optimization. In Sigmetrics, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. D. P. Mandic and J. Chambers. Recurrent Neural Networks for Prediction: Learning Algorithms, Architectures and Stability. John Wiley & Sons, Inc., New York, NY, USA, 2001. Google ScholarGoogle ScholarCross RefCross Ref
  26. R. McDonald, K. Hall, and G. Mann. Distributed training strategies for the structured perceptron. In NAACL HLT, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. R. Mcdonald, M. Mohri, N. Silberman, D. Walker, and G. S. Mann. Efficient large-scale distributed training of conditional maximum entropy models. In NIPS, 2009.Google ScholarGoogle Scholar
  28. F. Niu, B. Recht, C. Re, and S. J. Wright. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. In NIPS, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. G. Singh, C. Kesselman, and E. Deelman. A provisioning model and its comparison with best-effort for performance-cost optimization in grids. In HPDC, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. U. Vazirani, S. Rao, A. Blanca, and A. Prakash. U.C. berkeley, cs270 algorithms. http://www.cs.berkeley.edu/ satishr/cs270/sp11/rough-notes/matching.pdf. Accessed: 2015-02-01.Google ScholarGoogle Scholar
  31. F. Yan, O. Ruwase, Y. He, and T. Chilimbi. Performance modeling and scalability optimization of distributed deep learning systems. http://www.cs.wm.edu/ fyan/TechReport.pdf. Accessed: 2015-06-09.Google ScholarGoogle Scholar

Index Terms

  1. Performance Modeling and Scalability Optimization of Distributed Deep Learning Systems

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
        August 2015
        2378 pages
        ISBN:9781450336642
        DOI:10.1145/2783258

        Copyright © 2015 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 10 August 2015

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        KDD '15 Paper Acceptance Rate160of819submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%

        Upcoming Conference

        KDD '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader