Skip to main content
Erschienen in: International Journal of Machine Learning and Cybernetics 2/2024

28.06.2023 | Original Article

2D-THA-ADMM: communication efficient distributed ADMM algorithm framework based on two-dimensional torus hierarchical AllReduce

verfasst von: Guozheng Wang, Yongmei Lei, Zeyu Zhang, Cunlu Peng

Erschienen in: International Journal of Machine Learning and Cybernetics | Ausgabe 2/2024

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Model synchronization refers to the communication process involved in large-scale distributed machine learning tasks. As the cluster scales up, the synchronization of model parameters becomes a challenging task that has to be coordinated among thousands of workers. Firstly, this study proposes a hierarchical AllReduce algorithm structured on a two-dimensional torus (2D-THA), which utilizes a hierarchical structure to synchronize model parameters and maximize bandwidth utilization. Secondly, this study introduces a distributed consensus algorithm called 2D-THA-ADMM, which combines the 2D-THA synchronization algorithm with the alternating direction method of multipliers (ADMM). Thirdly, we evaluate the model parameter synchronization performance of 2D-THA and the scalability of 2D-THA-ADMM on the Tianhe-2 supercomputing platform using real public datasets. Our experiments demonstrate that 2D-THA significantly reduces synchronization time by \(63.447\%\) compared to MPI_Allreduce. Furthermore, the proposed 2D-THA-ADMM algorithm exhibits excellent scalability, with a training speed increase of over 3\(\times \) compared to the state-of-the-art methods, while maintaining high accuracy and computational efficiency.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Weitere Produktempfehlungen anzeigen
Literatur
1.
Zurück zum Zitat Gu R, Qi Y, Wu T, Wang Z, Xu X, Yuan C, Huang Y (2021) Sparkdq: efficient generic big data quality management on distributed data-parallel computation. J Parall Distrib Comput 156:132–147CrossRef Gu R, Qi Y, Wu T, Wang Z, Xu X, Yuan C, Huang Y (2021) Sparkdq: efficient generic big data quality management on distributed data-parallel computation. J Parall Distrib Comput 156:132–147CrossRef
2.
Zurück zum Zitat Nagrecha K (2021) Model-parallel model selection for deep learning systems. In: Proceedings of the 2021 international conference on management of data, pp 2929–2931 Nagrecha K (2021) Model-parallel model selection for deep learning systems. In: Proceedings of the 2021 international conference on management of data, pp 2929–2931
3.
Zurück zum Zitat Shang F, Xu T, Liu Y, Liu H, Shen L, Gong M (2021) Differentially private ADMM algorithms for machine learning. IEEE Trans Inf Forens Secur 16:4733–4745CrossRef Shang F, Xu T, Liu Y, Liu H, Shen L, Gong M (2021) Differentially private ADMM algorithms for machine learning. IEEE Trans Inf Forens Secur 16:4733–4745CrossRef
4.
Zurück zum Zitat Boyd S, Parikh N, Chu E (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Now Publishers Inc, Norwell Boyd S, Parikh N, Chu E (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Now Publishers Inc, Norwell
5.
Zurück zum Zitat Yang Y, Guan X, Jia Q.-S, Yu L, Xu B, Spanos CJ (2022) A survey of ADMM variants for distributed optimization: problems, algorithms and features. arXiv preprint arXiv:2208.03700 Yang Y, Guan X, Jia Q.-S, Yu L, Xu B, Spanos CJ (2022) A survey of ADMM variants for distributed optimization: problems, algorithms and features. arXiv preprint arXiv:​2208.​03700
6.
Zurück zum Zitat Elgabli A, Park J, Bedi AS, Issaid CB, Bennis M, Aggarwal V (2020) Q-GADMM: quantized group ADMM for communication efficient decentralized machine learning. IEEE Trans Commun 69(1):164–181CrossRef Elgabli A, Park J, Bedi AS, Issaid CB, Bennis M, Aggarwal V (2020) Q-GADMM: quantized group ADMM for communication efficient decentralized machine learning. IEEE Trans Commun 69(1):164–181CrossRef
7.
Zurück zum Zitat Wang D, Lei Y, Xie J, Wang G (2021) HSAC-ALADMM: an asynchronous lazy ADMM algorithm based on hierarchical sparse allreduce communication. J Supercomput 77:8111–8134CrossRef Wang D, Lei Y, Xie J, Wang G (2021) HSAC-ALADMM: an asynchronous lazy ADMM algorithm based on hierarchical sparse allreduce communication. J Supercomput 77:8111–8134CrossRef
8.
Zurück zum Zitat Liu Z, Xu Y (2022) Multi-task nonparallel support vector machine for classification. Appl Soft Comput 124:109051CrossRef Liu Z, Xu Y (2022) Multi-task nonparallel support vector machine for classification. Appl Soft Comput 124:109051CrossRef
9.
Zurück zum Zitat Zhou S, Li GY (2023) Federated learning via inexact ADMM. IEEE Trans Pattern Anal Mach Intell Zhou S, Li GY (2023) Federated learning via inexact ADMM. IEEE Trans Pattern Anal Mach Intell
10.
Zurück zum Zitat Liu Y, Wu G, Tian Z, Ling Q (2021) Dqc-admm: decentralized dynamic admm with quantized and censored communications’’. IEEE Trans Neural Netw Learn Syst 33(8):3290–3304MathSciNetCrossRef Liu Y, Wu G, Tian Z, Ling Q (2021) Dqc-admm: decentralized dynamic admm with quantized and censored communications’’. IEEE Trans Neural Netw Learn Syst 33(8):3290–3304MathSciNetCrossRef
11.
Zurück zum Zitat Wang S, Geng J, Li D (2021) Impact of synchronization topology on dml performance: both logical topology and physical topology. IEEE/ACM Trans Netw 30(2):572–585CrossRef Wang S, Geng J, Li D (2021) Impact of synchronization topology on dml performance: both logical topology and physical topology. IEEE/ACM Trans Netw 30(2):572–585CrossRef
12.
Zurück zum Zitat Sun DL, Fevotte C (2014) Alternating direction method of multipliers for non-negative matrix factorization with the beta-divergence. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6201–6205 Sun DL, Fevotte C (2014) Alternating direction method of multipliers for non-negative matrix factorization with the beta-divergence. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6201–6205
13.
Zurück zum Zitat Shi S, Tang Z, Chu X, Liu C, Wang W, Li B (2020) A quantitative survey of communication optimizations in distributed deep learning. IEEE Netw 35(3):230–237CrossRef Shi S, Tang Z, Chu X, Liu C, Wang W, Li B (2020) A quantitative survey of communication optimizations in distributed deep learning. IEEE Netw 35(3):230–237CrossRef
14.
Zurück zum Zitat Thakur R, Rabenseifner R, Gropp W (2005) Optimization of collective communication operations in mpich. Int J High Perform Comput Appl 19(1):49–66CrossRef Thakur R, Rabenseifner R, Gropp W (2005) Optimization of collective communication operations in mpich. Int J High Perform Comput Appl 19(1):49–66CrossRef
15.
Zurück zum Zitat Graham RL, Barrett BW, Shipman GM, Woodall TS, Bosilca G (2007) Open mpi: a high performance, flexible implementation of mpi point-to-point communications. Parall Process Lett 17(01):79–88MathSciNetCrossRef Graham RL, Barrett BW, Shipman GM, Woodall TS, Bosilca G (2007) Open mpi: a high performance, flexible implementation of mpi point-to-point communications. Parall Process Lett 17(01):79–88MathSciNetCrossRef
16.
Zurück zum Zitat Patarasuk P, Yuan X (2009) Bandwidth optimal all-reduce algorithms for clusters of workstations. J Parall Distrib Comput 69(2):117–124CrossRef Patarasuk P, Yuan X (2009) Bandwidth optimal all-reduce algorithms for clusters of workstations. J Parall Distrib Comput 69(2):117–124CrossRef
18.
Zurück zum Zitat Lee J, Hwang I, Shah S, Cho M (2020) Flexreduce: Flexible all-reduce for distributed deep learning on asymmetric network topology. In: 2020 57th ACM/IEEE design automation conference (DAC). IEEE, pp 1–6 Lee J, Hwang I, Shah S, Cho M (2020) Flexreduce: Flexible all-reduce for distributed deep learning on asymmetric network topology. In: 2020 57th ACM/IEEE design automation conference (DAC). IEEE, pp 1–6
19.
Zurück zum Zitat Sanghoon J, Son H, Kim J (2023) Logical/physical topology-aware collective communication in deep learning training. In: 2023 IEEE International symposium on high-performance computer architecture (HPCA). IEEE, pp 56–68 Sanghoon J, Son H, Kim J (2023) Logical/physical topology-aware collective communication in deep learning training. In: 2023 IEEE International symposium on high-performance computer architecture (HPCA). IEEE, pp 56–68
20.
Zurück zum Zitat França G, Bento J (2020) Distributed optimization, averaging via admm, and network topology. Proc IEEE 108(11):1939–1952CrossRef França G, Bento J (2020) Distributed optimization, averaging via admm, and network topology. Proc IEEE 108(11):1939–1952CrossRef
21.
Zurück zum Zitat Tavara S, Schliep A (2018) Effect of network topology on the performance of admm-based svms. In: 2018 30th international symposium on computer architecture and high performance computing (SBAC-PAD). IEEE, pp 388–393 Tavara S, Schliep A (2018) Effect of network topology on the performance of admm-based svms. In: 2018 30th international symposium on computer architecture and high performance computing (SBAC-PAD). IEEE, pp 388–393
22.
Zurück zum Zitat Wang D, Lei Y, Zhou J (2021) Hybrid mpi/openmp parallel asynchronous distributed alternating direction method of multipliers. Computing 103(12):2737–2762MathSciNetCrossRef Wang D, Lei Y, Zhou J (2021) Hybrid mpi/openmp parallel asynchronous distributed alternating direction method of multipliers. Computing 103(12):2737–2762MathSciNetCrossRef
23.
Zurück zum Zitat Xie J, Lei Y (2019) Admmlib: a library of communication-efficient ad-admm for distributed machine learning. In: IFIP international conference on network and parallel computing. Springer, pp 322–326 Xie J, Lei Y (2019) Admmlib: a library of communication-efficient ad-admm for distributed machine learning. In: IFIP international conference on network and parallel computing. Springer, pp 322–326
24.
Zurück zum Zitat Wang Q, Wu W, Wang B, Wang G, Xi Y, Liu H, Wang S, Zhang J (2022)Asynchronous decomposition method for the coordinated operation of virtual power plants. IEEE Trans Power Syst Wang Q, Wu W, Wang B, Wang G, Xi Y, Liu H, Wang S, Zhang J (2022)Asynchronous decomposition method for the coordinated operation of virtual power plants. IEEE Trans Power Syst
25.
Zurück zum Zitat Li M, Andersen DG, Smola AJ, Yu K (2014) Communication efficient distributed machine learning with the parameter server. Adv Neural Inf Process Syst 27 Li M, Andersen DG, Smola AJ, Yu K (2014) Communication efficient distributed machine learning with the parameter server. Adv Neural Inf Process Syst 27
26.
Zurück zum Zitat Zhang Z, Yang S, Xu W, Di K (2022) Privacy-preserving distributed admm with event-triggered communication. IEEE Trans Neural Netw Learn Syst Zhang Z, Yang S, Xu W, Di K (2022) Privacy-preserving distributed admm with event-triggered communication. IEEE Trans Neural Netw Learn Syst
27.
Zurück zum Zitat Huang J, Majumder P, Kim S, Muzahid A, Yum KH, Kim EJ (2021) “Communication algorithm-architecture co-design for distributed deep learning. In: 2021 ACM/IEEE 48th annual international symposium on computer architecture (ISCA). IEEE, pp 181–194 Huang J, Majumder P, Kim S, Muzahid A, Yum KH, Kim EJ (2021) “Communication algorithm-architecture co-design for distributed deep learning. In: 2021 ACM/IEEE 48th annual international symposium on computer architecture (ISCA). IEEE, pp 181–194
28.
Zurück zum Zitat Mikami H, Suganuma H, Tanaka Y, Kageyama Y et al (2018) Massively distributed sgd: Imagenet/resnet-50 training in a flash. arXiv preprint arXiv:1811.05233 Mikami H, Suganuma H, Tanaka Y, Kageyama Y et al (2018) Massively distributed sgd: Imagenet/resnet-50 training in a flash. arXiv preprint arXiv:​1811.​05233
29.
Zurück zum Zitat Cho M, Finkler U, Kung D, Hunter H (2019) Blueconnect: decomposing all-reduce for deep learning on heterogeneous network hierarchy. Proc Mach Learn Syst 1:241–251 Cho M, Finkler U, Kung D, Hunter H (2019) Blueconnect: decomposing all-reduce for deep learning on heterogeneous network hierarchy. Proc Mach Learn Syst 1:241–251
30.
Zurück zum Zitat Wang G, Venkataraman S, Phanishayee A, Devanur N, Thelin J, Stoica I (2020) Blink: fast and generic collectives for distributed ml. Proc Mach Learn Syst 2:172–186 Wang G, Venkataraman S, Phanishayee A, Devanur N, Thelin J, Stoica I (2020) Blink: fast and generic collectives for distributed ml. Proc Mach Learn Syst 2:172–186
31.
Zurück zum Zitat Kielmann T, Hofman RF, Bal HE, Plaat A, Bhoedjang RA (1999) Magpie: Mpi’s collective communication operations for clustered wide area systems. In: Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming, pp 131–140 Kielmann T, Hofman RF, Bal HE, Plaat A, Bhoedjang RA (1999) Magpie: Mpi’s collective communication operations for clustered wide area systems. In: Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming, pp 131–140
32.
Zurück zum Zitat Zhu H, Goodell D, Gropp W, Thakur R (2009) Hierarchical collectives in mpich2. In: European parallel virtual machine/message passing interface users’ group meeting. Springer, pp 325–326 Zhu H, Goodell D, Gropp W, Thakur R (2009) Hierarchical collectives in mpich2. In: European parallel virtual machine/message passing interface users’ group meeting. Springer, pp 325–326
33.
Zurück zum Zitat Bayatpour M, Chakraborty S, Subramoni H, Lu X, Panda DK (2017) Scalable reduction collectives with data partitioning-based multi-leader design. In: Proceedings of the international conference for high performance computing, networking, storage and analysis, pp 1–11 Bayatpour M, Chakraborty S, Subramoni H, Lu X, Panda DK (2017) Scalable reduction collectives with data partitioning-based multi-leader design. In: Proceedings of the international conference for high performance computing, networking, storage and analysis, pp 1–11
34.
Zurück zum Zitat Jia X, Song S, He W, Wang Y, Rong H, Zhou F, Xie L, Guo Z, Yang Y, Yu L et al (2018) Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. arXiv preprint arXiv:1807.11205 Jia X, Song S, He W, Wang Y, Rong H, Zhou F, Xie L, Guo Z, Yang Y, Yu L et al (2018) Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. arXiv preprint arXiv:​1807.​11205
35.
Zurück zum Zitat Ryabinin M, Gorbunov E, Plokhotnyuk V, Pekhimenko G (2021) Moshpit sgd: communication-efficient decentralized training on heterogeneous unreliable devices. Adv Neural Inf Process Syst 34:18195–18211 Ryabinin M, Gorbunov E, Plokhotnyuk V, Pekhimenko G (2021) Moshpit sgd: communication-efficient decentralized training on heterogeneous unreliable devices. Adv Neural Inf Process Syst 34:18195–18211
36.
Zurück zum Zitat Lin CJ, Weng RC, Keerthi SS (2008) Trust region newton method for large-scale logistic regression. J Mach Learn Res 9(4) Lin CJ, Weng RC, Keerthi SS (2008) Trust region newton method for large-scale logistic regression. J Mach Learn Res 9(4)
37.
Zurück zum Zitat Mamidala AR, Liu J, Panda DK (2004) Efficient barrier and allreduce on infiniband clusters using multicast and adaptive algorithms. In: 2004 IEEE international conference on cluster computing (IEEE Cat. No. 04EX935). IEEE, pp 135–144 Mamidala AR, Liu J, Panda DK (2004) Efficient barrier and allreduce on infiniband clusters using multicast and adaptive algorithms. In: 2004 IEEE international conference on cluster computing (IEEE Cat. No. 04EX935). IEEE, pp 135–144
38.
Zurück zum Zitat Ho Q, Cipar J, Cui H, Lee S, Kim JK, Gibbons PB, Gibson GA, Ganger G, Xing EP (2013) More effective distributed ml via a stale synchronous parallel parameter server. In: Advances in neural information processing systems, pp 1223–1231 Ho Q, Cipar J, Cui H, Lee S, Kim JK, Gibbons PB, Gibson GA, Ganger G, Xing EP (2013) More effective distributed ml via a stale synchronous parallel parameter server. In: Advances in neural information processing systems, pp 1223–1231
39.
Zurück zum Zitat Zhang R, Kwok J (2014) Asynchronous distributed admm for consensus optimization. In: International conference on machine learning. PMLR, pp 1701–1709 Zhang R, Kwok J (2014) Asynchronous distributed admm for consensus optimization. In: International conference on machine learning. PMLR, pp 1701–1709
Metadaten
Titel
2D-THA-ADMM: communication efficient distributed ADMM algorithm framework based on two-dimensional torus hierarchical AllReduce
verfasst von
Guozheng Wang
Yongmei Lei
Zeyu Zhang
Cunlu Peng
Publikationsdatum
28.06.2023
Verlag
Springer Berlin Heidelberg
Erschienen in
International Journal of Machine Learning and Cybernetics / Ausgabe 2/2024
Print ISSN: 1868-8071
Elektronische ISSN: 1868-808X
DOI
https://doi.org/10.1007/s13042-023-01903-9

Weitere Artikel der Ausgabe 2/2024

International Journal of Machine Learning and Cybernetics 2/2024 Zur Ausgabe

Neuer Inhalt