Skip to main content

2020 | OriginalPaper | Buchkapitel

Rise the Momentum: A Method for Reducing the Training Error on Multiple GPUs

verfasst von : Yu Tang, Lujia Yin, Zhaoning Zhang, Dongsheng Li

Erschienen in: Algorithms and Architectures for Parallel Processing

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Deep neural network training is a common issue that is receiving increasing attention in recent years and basically performed on Stochastic Gradient Descent or its variants. Distributed training increases training speed significantly but causes precision loss at the mean time. Increasing batchsize can improve training parallelism in distributed training. However, if the batchsize is too large, it will bring difficulty to training process and introduce more training error. In this paper, we consider controlling the total batchsize and lowering batchsize on each GPU by increasing the number of GPUs in distributed training. We train Resnet50 [4] on CIFAR-10 dataset by different optimizers, such as SGD, Adam and NAG. The experimental results show that large batchsize speeds up convergence to some degree. However, if the batchsize of per GPU is too small, training process fails to converge. Large number of GPUs, which means a small batchsize on each GPU declines the training performance in distributed training. We tried several ways to reduce the training error on multiple GPUs. According to our results, increasing momentum is a well-behaved method in distributed training to improve training performance on condition of multiple GPUs of constant large batchsize.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
1.
Zurück zum Zitat Li, D., et al.:HPDL: towards a general framework for high-performance distributed deep learning. In: Proceedings of 39th IEEE International Conference on Distributed Computing Systems (IEEE ICDCS) (2019) Li, D., et al.:HPDL: towards a general framework for high-performance distributed deep learning. In: Proceedings of 39th IEEE International Conference on Distributed Computing Systems (IEEE ICDCS) (2019)
2.
Zurück zum Zitat Szegedy, C., Ioffe, S., Vanhoucke, V., et al.: Inception-v4, Inception-ResNet and the impact of residual connections on learning. In: AAAI, vol. 4, p. 12 (2017) Szegedy, C., Ioffe, S., Vanhoucke, V., et al.: Inception-v4, Inception-ResNet and the impact of residual connections on learning. In: AAAI, vol. 4, p. 12 (2017)
3.
Zurück zum Zitat Chollet, F.: Xception: deep learning with depthwise separable convolutions. arXiv preprint (2016) Chollet, F.: Xception: deep learning with depthwise separable convolutions. arXiv preprint (2016)
4.
Zurück zum Zitat He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
5.
Zurück zum Zitat Huang, G., Liu, Z., Weinberger, K.Q., et al.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, no. 2, p. 3 (2017) Huang, G., Liu, Z., Weinberger, K.Q., et al.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, no. 2, p. 3 (2017)
7.
Zurück zum Zitat Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015) MathSciNetCrossRef Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015) MathSciNetCrossRef
8.
Zurück zum Zitat Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. In: NIPS, pp. 379–387 (2016) Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. In: NIPS, pp. 379–387 (2016)
9.
Zurück zum Zitat Qin, Z., Zhang, Z., Chen, X., et al.: FD-MobileNet: improved MobileNet with a fast downsampling strategy. arXiv preprint arXiv:1802.03750 (2018) Qin, Z., Zhang, Z., Chen, X., et al.: FD-MobileNet: improved MobileNet with a fast downsampling strategy. arXiv preprint arXiv:​1802.​03750 (2018)
10.
Zurück zum Zitat Li, M., et al.: Scaling distributed machine learning with the parameter server. In: Proceedings of OSDI, pp. 583–598 (2014) Li, M., et al.: Scaling distributed machine learning with the parameter server. In: Proceedings of OSDI, pp. 583–598 (2014)
11.
Zurück zum Zitat Chen, T., et al.: MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015) Chen, T., et al.: MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:​1512.​01274 (2015)
12.
Zurück zum Zitat Smith, S.L., Le, Q.V.: A Bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451 (2017) Smith, S.L., Le, Q.V.: A Bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:​1710.​06451 (2017)
13.
Zurück zum Zitat Smith, S.L., Kindermans, P.-J., Le, Q.V.: Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489 (2017) Smith, S.L., Kindermans, P.-J., Le, Q.V.: Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:​1711.​00489 (2017)
14.
Zurück zum Zitat Krizhevsky, A.: One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997 [cs.NE] (2014) Krizhevsky, A.: One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:​1404.​5997 [cs.NE] (2014)
15.
Zurück zum Zitat Nitish, S.K., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On large-batch training for deep learning: generalization gap and sharp minima. arXiv preprint arXiv:1609.04836 (2016) Nitish, S.K., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On large-batch training for deep learning: generalization gap and sharp minima. arXiv preprint arXiv:​1609.​04836 (2016)
17.
Zurück zum Zitat Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014) Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
18.
Zurück zum Zitat Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015) Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
19.
20.
Zurück zum Zitat Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of 32nd International Conference on Machine Learning, ICML15, pp. 448–456 (2015) Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of 32nd International Conference on Machine Learning, ICML15, pp. 448–456 (2015)
21.
22.
23.
Zurück zum Zitat Akiba, T., Suzuki, S., Fukuda, K.: Extremely large minibatch SGD: training ResNet-50 on ImageNet in 15 minutes. arXiv preprint arXiv:1711.04325 (2017) Akiba, T., Suzuki, S., Fukuda, K.: Extremely large minibatch SGD: training ResNet-50 on ImageNet in 15 minutes. arXiv preprint arXiv:​1711.​04325 (2017)
24.
Zurück zum Zitat Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y.: Entropy-SGD: biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838 (2016) Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y.: Entropy-SGD: biasing gradient descent into wide valleys. arXiv preprint arXiv:​1611.​01838 (2016)
25.
Zurück zum Zitat You, Y., Zhang, Z., Hsieh, C.-J., Demmel, J., Keutzer, K.: ImageNet training in minutes. CoRR, abs/1709.05011 (2017) You, Y., Zhang, Z., Hsieh, C.-J., Demmel, J., Keutzer, K.: ImageNet training in minutes. CoRR, abs/1709.05011 (2017)
26.
27.
Zurück zum Zitat Li, Q., Tai, C., Weinan, E.: Stochastic modified equations and adaptive stochastic gradient algorithms. arXiv preprint arXiv:1511.06251 (2017) Li, Q., Tai, C., Weinan, E.: Stochastic modified equations and adaptive stochastic gradient algorithms. arXiv preprint arXiv:​1511.​06251 (2017)
29.
Zurück zum Zitat Chen, J., Pan, X., Monga, R., Bengio, S., Jozefowicz, R.: Revisiting distributed synchronous SGD. arXiv preprint arXiv:1604.00981 [cs.LG] (2016) Chen, J., Pan, X., Monga, R., Bengio, S., Jozefowicz, R.: Revisiting distributed synchronous SGD. arXiv preprint arXiv:​1604.​00981 [cs.LG] (2016)
30.
Zurück zum Zitat Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. arXiv preprint arXiv:1606.04838 [stat.ML] (2016) Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. arXiv preprint arXiv:​1606.​04838 [stat.ML] (2016)
32.
Zurück zum Zitat Ghadimi, S., Lan, G., Zhang, H.: Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Program. 155(1–2), 267–305 (2014) MathSciNetMATH Ghadimi, S., Lan, G., Zhang, H.: Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Program. 155(1–2), 267–305 (2014) MathSciNetMATH
33.
Zurück zum Zitat Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015) Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
34.
Zurück zum Zitat Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop, coursera: neural networks for machine learning. University of Toronto, Technical report (2012) Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop, coursera: neural networks for machine learning. University of Toronto, Technical report (2012)
35.
Zurück zum Zitat Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(Jul), 2121–2159 (2011)MathSciNetMATH Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(Jul), 2121–2159 (2011)MathSciNetMATH
36.
Zurück zum Zitat Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence o(1/k\(^2\)). Doklady ANSSSR (Transl. Soviet. Math. Docl.), 269, 543–547 (1983) Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence o(1/k\(^2\)). Doklady ANSSSR (Transl. Soviet. Math. Docl.), 269, 543–547 (1983)
37.
Zurück zum Zitat Qian, N.: On the momentum term in gradient descent learning algorithms. Neural Netw.: Off. J. Int. Neural Netw. Soc. 12(1), 145–151 (1999) CrossRef Qian, N.: On the momentum term in gradient descent learning algorithms. Neural Netw.: Off. J. Int. Neural Netw. Soc. 12(1), 145–151 (1999) CrossRef
Metadaten
Titel
Rise the Momentum: A Method for Reducing the Training Error on Multiple GPUs
verfasst von
Yu Tang
Lujia Yin
Zhaoning Zhang
Dongsheng Li
Copyright-Jahr
2020
DOI
https://doi.org/10.1007/978-3-030-38961-1_4