Top

Published in:

2017 | OriginalPaper | Chapter

Distributed Training Large-Scale Deep Architectures

Authors : Shang-Xuan Zou, Chun-Yen Chen, Jui-Lin Wu, Chun-Nan Chou, Chia-Chin Tsao, Kuan-Chieh Tung, Ting-Wei Lin, Cheng-Lung Sung, Edward Y. Chang

Published in: Advanced Data Mining and Applications

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Scale of data and scale of computation infrastructures together enable the current deep learning renaissance. However, training large-scale deep architectures demands both algorithmic improvement and careful system configuration. In this paper, we focus on employing the system approach to speed up large-scale training. Taking both the algorithmic and system aspects into consideration, we develop a procedure for setting mini-batch size and choosing computation algorithms. We also derive lemmas for determining the quantity of key components such as the number of GPUs and parameter servers. Experiments and examples show that these guidelines help effectively speed up large-scale deep learning training.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Querying and Mining Strings Made Easy

next chapter Fault Detection and Localization in Distributed Systems Using Recurrent Convolutional Neural Networks

GPU instances on Google Compute Engine (GCE) do not support GPU peer-to-peer access, and hence we will defer our GCE experiments till such support is available.

For each training instance, we need to store the gradients of all model parameters. The aggregated gradients of all model parameters are also required for a specific batch.

AlexNet achieved 18.2% top-5 error rate in in the ILSVRC-2012 competition, whereas we obtained 21% in our experiments. This is because we did not perform all the tricks for data augmentation and fine-tuning. We choose 25% as the termination criterion to demonstrate convergence behavior when mini-batch sizes are different.

nvprof only profiles GPU activities, so the CPU activities cannot be analyzed.

Abadi, M. et al.: TensorFlow: large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org (2015).

Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing capabilities. In: Proceedings of the Spring Joint Computer Conference, 18–20 April 1967, pp. 483–485. ACM (1967)

Bahrampour, S. et al.: Comparative study of deep learning software frameworks. In: arXiv.org. arxiv: 1511.06435v3 [cs.LG], November 2015

Bengio, Y.: Practical recommendations for gradient-based training of deep architectures. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 437–478. Springer, Heidelberg (2012). doi:10.1007/978-3-642-35289-8_26 CrossRef

Chang, E., Garcia-Molina, H., Li, C.: 2D BubbleUp: managing parallel disks for media servers. Technical report, Stanford InfoLab (1998)

Chen, J. et al.: Revisiting distributed synchronous SGD. arXiv preprint arXiv:1604.00981 (2016)

Chen, T. et al.: MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015)

Chilimbi, T.M. et al.: Project adam: building an efficient and scalable deep learning training system. In: OSDI, vol. 14, pp. 571–582 (2014)

Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: a matlab-like environment for machine learning. In: EPFL-CONF-192376 (2011)

10.

Cong, J., Xiao, B.: Minimizing computation in convolutional neural networks. In: Wermter, S., Weber, C., Duch, W., Honkela, T., Koprinkova-Hristova, P., Magg, S., Palm, G., Villa, A.E.P. (eds.) ICANN 2014. LNCS, vol. 8681, pp. 281–290. Springer, Cham (2014). doi:10.1007/978-3-319-11179-7_36

11.

CS231n Convolutional neural network for visual recognition (2017). http://cs231n.github.io/

12.

Dally, W.J.: CNTK: an embedded language for circuit description. Department of Computer Science, California Institute of Technology, Display File

13.

Dean, J. et al.: Large scale distributed deep networks, pp. 1223–1231 (2012)

14.

Deng, J. et al.: ImageNet: a large-scale hierarchical image database. In: CVPR 2009 (2009)

15.

Elgohary, A., et al.: Compressed linear algebra for large-scale machine learning. Proc. VLDB Endow. 9(12), 960–971 (2016)CrossRef

16.

Ghadimi, S., Lan, G.: Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)MathSciNetCrossRefMATH

17.

GNU linear programming kit (2012). https://www.gnu.org/software/glpk/

18.

Goyal, P. et al.: Accurate, large Minibatch SGD: training ImageNet in 1 h. arXiv preprint arXiv:1706.02677 (2017)

19.

Hadjis, S., et al.: Caffe con troll: shallow ideas to speed up deep learning, April 2015. arXiv.org. arXiv: 1504.04343v2 [cs.LG]

20.

He, K. et al.: Deep residual learning for image recognition, pp. 770–778 (2016)

21.

Hinton, G.: A practical guide to training restricted Boltzmann machines. Momentum 9(1), 926 (2010)

22.

Iandola, F.N. et al.: FireCaffe - near-linear acceleration of deep neural network training on compute clusters. In: CVPR, pp. 2592–2600 (2016)

23.

Ioffe, S.: Batch renormalization: towards reducing Minibatch dependence in batch-normalized models, February 2017. arXiv.org. arXiv: 1702.03275v1 [cs.LG]

24.

Bergstra, J. et al.: Theano: a CPU and GPU math expression compiler (2010)

25.

Jia, Y. et al.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–678 (2014)

26.

Krizhevsky, A.: One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997 (2014)

27.

Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Pereira, F. et al. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates Inc. (2012)

28.

Lavin, A., Gray, S.: Fast algorithms for convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4013–4021 (2016)

29.

Li, M. et al.: Scaling distributed machine learning with the parameter server. In: OSDI (2014)

30.

Liu, Z. et al.: PLDA+: parallel latent Dirichlet allocation with data placement and pipeline processing. ACM Trans. Intell. Syst. Technol. 2(3), 26:1–26:18 (2011). ISSN, pp. 2157–6904, doi:10.1145/1961189.1961198. http://doi.acm.org/10.1145/1961189.1961198

31.

Mathieu, M., Henaff, M., LeCun, Y.: Fast training of convolutional networks through FFTs. In: CoRR abs/1312.5851 cs.CV (2013)

32.

Nemhauser, G.L., Wolsey, L.A.: Integer programming and combinatorial optimization. In: Nemhauser, G.L., Savelsbergh, M.W.P., Sigismondi, G.S. (eds.) Constraint Classification for Mixed Integer Programming Formulations. Wiley, Chichester (1992). COAL Bull. 20, 8–12 (1988)

33.

Ng, A.Y.: The nuts and bolts of machine learning. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2016)

34.

Niu, F. et al.: A lock-free approach to parallelizing stochastic gradient descent. arXiv preprint arXiv:1106.5730 (2011)

35.

Shi, S. et al.: Benchmarking state-of-the-art deep learning software tools, August 2016. arXiv.org. arXiv:1608.07249v5 [cs.DC]

36.

Szegedy, C., Ioffe, S., Vanhoucke, V.: Inception-v4, Inception-ResNet and the impact of residual connections on learning. In: CoRR abs/1602.07261 (2016). http://arxiv.org/abs/1602.07261

37.

Vasilache, N. et al.: Fast convolutional nets with fbfft: a GPU performance evaluation. arXiv preprint arXiv:1412.7580 (2014)

38.

Zhang, H. et al.: Poseidon: a system architecture for effcient GPU-based deep learning on multiple machines. arXiv preprint arXiv:1512.06216 (2015)

39.

Zheng, Z. et al.: SpeeDO: parallelizing stochastic gradient descent for deep convolutional neural network. In: NIPS Workshop on Learning Systems (2015)

40.

Zinkevich, M. et al.: Parallelized stochastic gradient descent, pp. 2595–2603 (2010)

41.

Zou, S.-X. et al.: Distributed training large-scale deep architectures. HTC technical report (2017). https://research.htc.com/publications-and-talks

Title: Distributed Training Large-Scale Deep Architectures
Authors: Shang-Xuan Zou
Chun-Yen Chen
Jui-Lin Wu
Chun-Nan Chou
Chia-Chin Tsao
Kuan-Chieh Tung
Ting-Wei Lin
Cheng-Lung Sung
Edward Y. Chang
Publisher: Springer International Publishing
Book: Advanced Data Mining and Applications
Print ISBN: 978-3-319-69178-7

Electronic ISBN: 978-3-319-69179-4

Copyright Year: 2017
DOI: https://doi.org/10.1007/978-3-319-69179-4_2

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner