Skip to main content
Erschienen in: The Journal of Supercomputing 1/2020

16.04.2019

BOA: batch orchestration algorithm for straggler mitigation of distributed DL training in heterogeneous GPU cluster

verfasst von: Eunju Yang, Dong-Ki Kang, Chan-Hyun Youn

Erschienen in: The Journal of Supercomputing | Ausgabe 1/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Training deep learning model is a time-consuming job since it usually uses a large amount of data. To reduce the training time, most practitioners train the models in a GPU cluster environment in a distributed way. The synchronous stochastic gradient descent, which is one of the widely used distributed training algorithms, has fast convergence rate with the use of multiple GPU workers; but its speed is tied to the slowest worker, i.e., straggler. In a heterogeneous environment, a static straggler, which has not been mainly focused before, has more impact on the performance than randomly occurring straggler. However, most existing studies for straggler mitigation usually consider a homogeneous environment, so their approaches are limited in practice. In this paper, we scrutinize the straggler problem under heterogeneous environment and define static and dynamic straggler from empirical results. Based on this, we propose a novel approach called batch orchestration algorithm (BOA) for straggler mitigation. It adaptively balances the amount of mini-batch data according to the speed of workers. Therefore, BOA can mitigate both static and dynamic straggler in a modern GPU cluster. BOA uses a Min–Max Integer programming to find the optimal mini-batch size, with the hardware-agnostic performance models. For verification, several experiments are conducted on a cluster having up to six GPUs with three types: GTX 1080, GTX 1060 and Quadro M2000. The results show BOA mitigates both types of stragglers and accelerates the training speed with synchronous SGD compared to other straggler mitigation method.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga R, Moore S, Murray DG, Steiner B, Tucker P, Vasudevan V, Warden P, Wicke M, Yu Y, Zheng X (2016) Tensorflow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16, Berkeley, CA, USA. USENIX Association, pp 265–283 Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga R, Moore S, Murray DG, Steiner B, Tucker P, Vasudevan V, Warden P, Wicke M, Yu Y, Zheng X (2016) Tensorflow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16, Berkeley, CA, USA. USENIX Association, pp 265–283
2.
Zurück zum Zitat Boyd S, Mattingley J (2007) Branch and bound methods. Notes for EE364b, Stanford University, pp 2006–2007 Boyd S, Mattingley J (2007) Branch and bound methods. Notes for EE364b, Stanford University, pp 2006–2007
3.
4.
Zurück zum Zitat Chetlur S, Woolley C, Vandermersch P, Cohen J, Tran J, Catanzaro B, Shelhamer E (2014) cudnn: efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 Chetlur S, Woolley C, Vandermersch P, Cohen J, Tran J, Catanzaro B, Shelhamer E (2014) cudnn: efficient primitives for deep learning. arXiv preprint arXiv:​1410.​0759
5.
Zurück zum Zitat Coates A, Huval B, Wang T, Wu D, Catanzaro B, Andrew N (2013) Deep learning with COTS HPC systems. In: International Conference on Machine Learning, pp 1337–1345 Coates A, Huval B, Wang T, Wu D, Catanzaro B, Andrew N (2013) Deep learning with COTS HPC systems. In: International Conference on Machine Learning, pp 1337–1345
6.
Zurück zum Zitat Dean J, Corrado G, Monga R, Chen K, Devin M, Mao M, Senior A, Tucker P, Yang K, Le QV et al (2012) Large scale distributed deep networks. In: Advances in neural information processing systems, pp 1223–1231 Dean J, Corrado G, Monga R, Chen K, Devin M, Mao M, Senior A, Tucker P, Yang K, Le QV et al (2012) Large scale distributed deep networks. In: Advances in neural information processing systems, pp 1223–1231
7.
Zurück zum Zitat Diamond S, Boyd S (2016) CVXPY: a python-embedded modeling language for convex optimization. J Mach Learn Res 17(83):1–5MathSciNetMATH Diamond S, Boyd S (2016) CVXPY: a python-embedded modeling language for convex optimization. J Mach Learn Res 17(83):1–5MathSciNetMATH
8.
Zurück zum Zitat Ferdinand N, Gharachorloo B, Draper SC (2017) Anytime exploitation of stragglers in synchronous stochastic gradient descent. In: 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pp 141–146 Ferdinand N, Gharachorloo B, Draper SC (2017) Anytime exploitation of stragglers in synchronous stochastic gradient descent. In: 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pp 141–146
10.
Zurück zum Zitat Goyal P, Dollár P, Girshick R, Noordhuis P, Wesolowski L, Kyrola A, Tulloch A, Jia Y, He K (2017) Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 Goyal P, Dollár P, Girshick R, Noordhuis P, Wesolowski L, Kyrola A, Tulloch A, Jia Y, He K (2017) Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv preprint arXiv:​1706.​02677
11.
Zurück zum Zitat Harlap A, Cui H, Dai W, Wei J, Ganger GR, Gibbons PB, Gibson GA, Xing EP (2016) Addressing the straggler problem for iterative convergent parallel ML. In: Proceedings of the Seventh ACM Symposium on Cloud Computing, SoCC’16, New York, NY, USA. ACM, pp 98–111 Harlap A, Cui H, Dai W, Wei J, Ganger GR, Gibbons PB, Gibson GA, Xing EP (2016) Addressing the straggler problem for iterative convergent parallel ML. In: Proceedings of the Seventh ACM Symposium on Cloud Computing, SoCC’16, New York, NY, USA. ACM, pp 98–111
12.
Zurück zum Zitat He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778 He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778
13.
Zurück zum Zitat Iandola FN, Moskewicz MW, Ashraf K, Keutzer K (2016) Firecaffe: near-linear acceleration of deep neural network training on compute clusters. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2592–2600 Iandola FN, Moskewicz MW, Ashraf K, Keutzer K (2016) Firecaffe: near-linear acceleration of deep neural network training on compute clusters. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2592–2600
14.
Zurück zum Zitat Jeon M, Venkataraman S, Qian J, Phanishayee A, Xiao W, Yang F (2018) Multi-tenant GPU clusters for deep learning workloads: analysis and implications. Technical report Jeon M, Venkataraman S, Qian J, Phanishayee A, Xiao W, Yang F (2018) Multi-tenant GPU clusters for deep learning workloads: analysis and implications. Technical report
15.
Zurück zum Zitat Jiang J, Cui B, Zhang C, Yu L (2017) Heterogeneity-aware distributed parameter servers. In: Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD’17, New York, NY, USA. ACM, pp 463–478 Jiang J, Cui B, Zhang C, Yu L (2017) Heterogeneity-aware distributed parameter servers. In: Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD’17, New York, NY, USA. ACM, pp 463–478
16.
Zurück zum Zitat Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images. Technical report, Citeseer Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images. Technical report, Citeseer
17.
Zurück zum Zitat Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105 Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
18.
Zurück zum Zitat LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436CrossRef LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436CrossRef
19.
Zurück zum Zitat LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324CrossRef LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324CrossRef
20.
Zurück zum Zitat Merkel D (2014) Docker: lightweight linux containers for consistent development and deployment. Linux J 2014(239):2 Merkel D (2014) Docker: lightweight linux containers for consistent development and deployment. Linux J 2014(239):2
21.
Zurück zum Zitat Robbins H, Monro S (1951) A stochastic approximation method. The annals of mathematical statistics, pp 400–407 Robbins H, Monro S (1951) A stochastic approximation method. The annals of mathematical statistics, pp 400–407
22.
Zurück zum Zitat Tandon R, Lei Q, Dimakis AG, Karampatziakis N (2017) Gradient coding: avoiding stragglers in distributed learning. In: Precup D, Teh YW (eds) Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, PMLR. International Convention Centre, Sydney, Australia, 06–11 Aug 2017, pp 3368–3376 Tandon R, Lei Q, Dimakis AG, Karampatziakis N (2017) Gradient coding: avoiding stragglers in distributed learning. In: Precup D, Teh YW (eds) Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, PMLR. International Convention Centre, Sydney, Australia, 06–11 Aug 2017, pp 3368–3376
23.
Zurück zum Zitat Yan F, Ruwase O, He Y, Chilimbi T (2015) Performance modeling and scalability optimization of distributed deep learning systems. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp 1355–1364 Yan F, Ruwase O, He Y, Chilimbi T (2015) Performance modeling and scalability optimization of distributed deep learning systems. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp 1355–1364
24.
Zurück zum Zitat Yang E, Kim S, Kim T, Jeon M, Park S, Youn C (2018) An adaptive batch-orchestration algorithm for the heterogeneous GPU cluster environment in distributed deep learning system. In: 2018 IEEE International Conference on Big Data and Smart Computing (BigComp), pp 725–728 Yang E, Kim S, Kim T, Jeon M, Park S, Youn C (2018) An adaptive batch-orchestration algorithm for the heterogeneous GPU cluster environment in distributed deep learning system. In: 2018 IEEE International Conference on Big Data and Smart Computing (BigComp), pp 725–728
25.
Zurück zum Zitat Zhang C, Ré C (2014) Dimmwitted: a study of main-memory statistical analytics. Proc VLDB Endow 7(12):1283–1294CrossRef Zhang C, Ré C (2014) Dimmwitted: a study of main-memory statistical analytics. Proc VLDB Endow 7(12):1283–1294CrossRef
26.
Zurück zum Zitat Zinkevich M, Weimer M, Li L, Smola AJ (2010) Parallelized stochastic gradient descent. In: Advances in neural information processing systems, pp 2595–2603 Zinkevich M, Weimer M, Li L, Smola AJ (2010) Parallelized stochastic gradient descent. In: Advances in neural information processing systems, pp 2595–2603
Metadaten
Titel
BOA: batch orchestration algorithm for straggler mitigation of distributed DL training in heterogeneous GPU cluster
verfasst von
Eunju Yang
Dong-Ki Kang
Chan-Hyun Youn
Publikationsdatum
16.04.2019
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 1/2020
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-019-02845-2

Weitere Artikel der Ausgabe 1/2020

The Journal of Supercomputing 1/2020 Zur Ausgabe