Skip to main content
Top

2017 | OriginalPaper | Chapter

Distributed Training Large-Scale Deep Architectures

Authors : Shang-Xuan Zou, Chun-Yen Chen, Jui-Lin Wu, Chun-Nan Chou, Chia-Chin Tsao, Kuan-Chieh Tung, Ting-Wei Lin, Cheng-Lung Sung, Edward Y. Chang

Published in: Advanced Data Mining and Applications

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Scale of data and scale of computation infrastructures together enable the current deep learning renaissance. However, training large-scale deep architectures demands both algorithmic improvement and careful system configuration. In this paper, we focus on employing the system approach to speed up large-scale training. Taking both the algorithmic and system aspects into consideration, we develop a procedure for setting mini-batch size and choosing computation algorithms. We also derive lemmas for determining the quantity of key components such as the number of GPUs and parameter servers. Experiments and examples show that these guidelines help effectively speed up large-scale deep learning training.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
GPU instances on Google Compute Engine (GCE) do not support GPU peer-to-peer access, and hence we will defer our GCE experiments till such support is available.
 
2
For each training instance, we need to store the gradients of all model parameters. The aggregated gradients of all model parameters are also required for a specific batch.
 
3
AlexNet achieved 18.2% top-5 error rate in in the ILSVRC-2012 competition, whereas we obtained 21% in our experiments. This is because we did not perform all the tricks for data augmentation and fine-tuning. We choose 25% as the termination criterion to demonstrate convergence behavior when mini-batch sizes are different.
 
4
nvprof only profiles GPU activities, so the CPU activities cannot be analyzed.
 
Literature
1.
go back to reference Abadi, M. et al.: TensorFlow: large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org (2015). Abadi, M. et al.: TensorFlow: large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.​org (2015).
2.
go back to reference Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing capabilities. In: Proceedings of the Spring Joint Computer Conference, 18–20 April 1967, pp. 483–485. ACM (1967) Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing capabilities. In: Proceedings of the Spring Joint Computer Conference, 18–20 April 1967, pp. 483–485. ACM (1967)
4.
go back to reference Bengio, Y.: Practical recommendations for gradient-based training of deep architectures. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 437–478. Springer, Heidelberg (2012). doi:10.1007/978-3-642-35289-8_26 CrossRef Bengio, Y.: Practical recommendations for gradient-based training of deep architectures. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 437–478. Springer, Heidelberg (2012). doi:10.​1007/​978-3-642-35289-8_​26 CrossRef
5.
go back to reference Chang, E., Garcia-Molina, H., Li, C.: 2D BubbleUp: managing parallel disks for media servers. Technical report, Stanford InfoLab (1998) Chang, E., Garcia-Molina, H., Li, C.: 2D BubbleUp: managing parallel disks for media servers. Technical report, Stanford InfoLab (1998)
7.
go back to reference Chen, T. et al.: MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015) Chen, T. et al.: MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:​1512.​01274 (2015)
8.
go back to reference Chilimbi, T.M. et al.: Project adam: building an efficient and scalable deep learning training system. In: OSDI, vol. 14, pp. 571–582 (2014) Chilimbi, T.M. et al.: Project adam: building an efficient and scalable deep learning training system. In: OSDI, vol. 14, pp. 571–582 (2014)
9.
go back to reference Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: a matlab-like environment for machine learning. In: EPFL-CONF-192376 (2011) Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: a matlab-like environment for machine learning. In: EPFL-CONF-192376 (2011)
10.
go back to reference Cong, J., Xiao, B.: Minimizing computation in convolutional neural networks. In: Wermter, S., Weber, C., Duch, W., Honkela, T., Koprinkova-Hristova, P., Magg, S., Palm, G., Villa, A.E.P. (eds.) ICANN 2014. LNCS, vol. 8681, pp. 281–290. Springer, Cham (2014). doi:10.1007/978-3-319-11179-7_36 Cong, J., Xiao, B.: Minimizing computation in convolutional neural networks. In: Wermter, S., Weber, C., Duch, W., Honkela, T., Koprinkova-Hristova, P., Magg, S., Palm, G., Villa, A.E.P. (eds.) ICANN 2014. LNCS, vol. 8681, pp. 281–290. Springer, Cham (2014). doi:10.​1007/​978-3-319-11179-7_​36
12.
go back to reference Dally, W.J.: CNTK: an embedded language for circuit description. Department of Computer Science, California Institute of Technology, Display File Dally, W.J.: CNTK: an embedded language for circuit description. Department of Computer Science, California Institute of Technology, Display File
13.
go back to reference Dean, J. et al.: Large scale distributed deep networks, pp. 1223–1231 (2012) Dean, J. et al.: Large scale distributed deep networks, pp. 1223–1231 (2012)
14.
go back to reference Deng, J. et al.: ImageNet: a large-scale hierarchical image database. In: CVPR 2009 (2009) Deng, J. et al.: ImageNet: a large-scale hierarchical image database. In: CVPR 2009 (2009)
15.
go back to reference Elgohary, A., et al.: Compressed linear algebra for large-scale machine learning. Proc. VLDB Endow. 9(12), 960–971 (2016)CrossRef Elgohary, A., et al.: Compressed linear algebra for large-scale machine learning. Proc. VLDB Endow. 9(12), 960–971 (2016)CrossRef
16.
go back to reference Ghadimi, S., Lan, G.: Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)MathSciNetCrossRefMATH Ghadimi, S., Lan, G.: Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)MathSciNetCrossRefMATH
20.
go back to reference He, K. et al.: Deep residual learning for image recognition, pp. 770–778 (2016) He, K. et al.: Deep residual learning for image recognition, pp. 770–778 (2016)
21.
go back to reference Hinton, G.: A practical guide to training restricted Boltzmann machines. Momentum 9(1), 926 (2010) Hinton, G.: A practical guide to training restricted Boltzmann machines. Momentum 9(1), 926 (2010)
22.
go back to reference Iandola, F.N. et al.: FireCaffe - near-linear acceleration of deep neural network training on compute clusters. In: CVPR, pp. 2592–2600 (2016) Iandola, F.N. et al.: FireCaffe - near-linear acceleration of deep neural network training on compute clusters. In: CVPR, pp. 2592–2600 (2016)
24.
go back to reference Bergstra, J. et al.: Theano: a CPU and GPU math expression compiler (2010) Bergstra, J. et al.: Theano: a CPU and GPU math expression compiler (2010)
25.
go back to reference Jia, Y. et al.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–678 (2014) Jia, Y. et al.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–678 (2014)
26.
27.
go back to reference Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Pereira, F. et al. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates Inc. (2012) Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Pereira, F. et al. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates Inc. (2012)
28.
go back to reference Lavin, A., Gray, S.: Fast algorithms for convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4013–4021 (2016) Lavin, A., Gray, S.: Fast algorithms for convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4013–4021 (2016)
29.
go back to reference Li, M. et al.: Scaling distributed machine learning with the parameter server. In: OSDI (2014) Li, M. et al.: Scaling distributed machine learning with the parameter server. In: OSDI (2014)
31.
go back to reference Mathieu, M., Henaff, M., LeCun, Y.: Fast training of convolutional networks through FFTs. In: CoRR abs/1312.5851 cs.CV (2013) Mathieu, M., Henaff, M., LeCun, Y.: Fast training of convolutional networks through FFTs. In: CoRR abs/1312.5851 cs.CV (2013)
32.
go back to reference Nemhauser, G.L., Wolsey, L.A.: Integer programming and combinatorial optimization. In: Nemhauser, G.L., Savelsbergh, M.W.P., Sigismondi, G.S. (eds.) Constraint Classification for Mixed Integer Programming Formulations. Wiley, Chichester (1992). COAL Bull. 20, 8–12 (1988) Nemhauser, G.L., Wolsey, L.A.: Integer programming and combinatorial optimization. In: Nemhauser, G.L., Savelsbergh, M.W.P., Sigismondi, G.S. (eds.) Constraint Classification for Mixed Integer Programming Formulations. Wiley, Chichester (1992). COAL Bull. 20, 8–12 (1988)
33.
go back to reference Ng, A.Y.: The nuts and bolts of machine learning. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2016) Ng, A.Y.: The nuts and bolts of machine learning. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2016)
34.
37.
38.
go back to reference Zhang, H. et al.: Poseidon: a system architecture for effcient GPU-based deep learning on multiple machines. arXiv preprint arXiv:1512.06216 (2015) Zhang, H. et al.: Poseidon: a system architecture for effcient GPU-based deep learning on multiple machines. arXiv preprint arXiv:​1512.​06216 (2015)
39.
go back to reference Zheng, Z. et al.: SpeeDO: parallelizing stochastic gradient descent for deep convolutional neural network. In: NIPS Workshop on Learning Systems (2015) Zheng, Z. et al.: SpeeDO: parallelizing stochastic gradient descent for deep convolutional neural network. In: NIPS Workshop on Learning Systems (2015)
40.
go back to reference Zinkevich, M. et al.: Parallelized stochastic gradient descent, pp. 2595–2603 (2010) Zinkevich, M. et al.: Parallelized stochastic gradient descent, pp. 2595–2603 (2010)
Metadata
Title
Distributed Training Large-Scale Deep Architectures
Authors
Shang-Xuan Zou
Chun-Yen Chen
Jui-Lin Wu
Chun-Nan Chou
Chia-Chin Tsao
Kuan-Chieh Tung
Ting-Wei Lin
Cheng-Lung Sung
Edward Y. Chang
Copyright Year
2017
DOI
https://doi.org/10.1007/978-3-319-69179-4_2

Premium Partner