Introduction
-
Increased scalability: Distributed deep learning is increasingly necessary as neural networks and datasets grow. The training process could be scaled to handle more extensive datasets and models, and more computers can be added to the distributed system to increase scalability and enable quicker training cycles.
-
Resource utilization: By spreading the work across several computers, we may use the already available resources and reach a higher level of parallelism, resulting in shorter training durations and less resource usage.
Related works
Distributed machine learning algorithms
Distributed classification
Distributed boosting
Distributed SVM
Algorithm | Articles | Year | No. of references | Simulation/ Dataset | Evaluation metrics |
---|---|---|---|---|---|
Distributed Boosting | [41] | 2022 | 26 | • Accuracy • Correctness • Communication complexity | |
[44] | 2017 | 20 | • ocr17 • ocr49 • forestcover12 • particle • ringnorm • twonorm • Yahoo! | • Error | |
[43] | 2014 | 19 | • Reuters-21,578 • Medlin | • Time | |
[40] | 2002 | 31 | • Covertype • Pen-based digits • Waveform • LED | • Accuracy • Speedup | |
Distributed SVM | [51] | 2019 | 32 | • Optical satellite images | • RMSE |
[50] | 2015 | 14 | • Spiral data set • MNIST • COVERTYPE | • Integrations • Parallel Speed-up | |
[49] | 2011 | 40 | • Images from Corel database | • accuracy • training times | |
[48] | 2008 | 17 | • MNIST | • CPU seconds • Number of iterations • Communication overhead | |
[47] | 2003 | 16 | • Handwritten Chinese database ETL9B | • Error rate |
Distributed clustering
Algorithm | Articles | Year | No. of references | Simulation/ Dataset | Evaluation metrics |
---|---|---|---|---|---|
Consensus-based algorithm | [53] | 2016 | 35 | • Wireless sensor networks (WSNs) | • Within-cluster sum of squares (WCSS) • Iteration time |
[54] | 2011 | 15 | • Two data sites | • Xie-Beni (XB) fuzzy clustering validity index | |
Distributed k-means | [59] | 2021 | 25 | • MRI image segmentation | • Number of iterations |
[61] | 2016 | 39 | • YearPredictionMSD | • Communication costs | |
[57] | 2013 | 33 | • Mammal’s Milk • River dataset • Water treatment dataset | • Communication Overhead • Computation Overhead | |
[58] | 2013 | 42 | • Wireless sensor networks | • Time complexity • Memory complexity | |
[60] | 2008 | 38 | • P2P network | • Accuracy • Scalability • Communication |
Consensus-based algorithm
Distributed k-means
Distributed deep learning
Data parallelism
Model parallelism
Pipelining parallelism
Hybrid parallelization
Algorithm | Articles | Year | No. of references | Simulation/ dataset | Evaluation metrics |
---|---|---|---|---|---|
Data parallelism | [75] | 2022 | 61 | • ResNet110 and AlexNet models on CIFAR10 | • Train loss • Test accuracy |
[72] | 2022 | 24 | • Matrix Classification • MovieLens Avazu-CTR | • Convergence time per epoch • Disk I/O • Network communication | |
[65] | 2021 | 138 | • ResNet-50 on ImageNet dataset • ALBERT-large on WikiText-103 dataset | • Training time | |
[71] | 2020 | 37 | • ResNet101 on CIFAR10 dataset | • Convergence • Robustness | |
[69] | 2019 | 53 | • LeNet-5 on MNIST dataset | • Accuracy | |
[70] | 2019 | 46 | • ResNet-50 and Inception-v3 on ImageNet • LM model on One Billion Word Benchmark • NMT model on WMT English-German dataset | • Validation error • Test perplexity • BLEU | |
[73] | 2018 | 20 | • Inception V3 • ResNet-101 • VGG-16 | • Images processed per second | |
[68] | 2015 | 31 | • CNN on CIFAR and ImageNet datasets | • Test loss • Test error | |
[67] | 2012 | 29 | • ImageNet | • Accuracy | |
Model parallelism | [77] | 2021 | 72 | • GNN model on OGB-Product, OGB-Paper, UK-2006-05, UK-Union, Facebook datasets | • ROC |
[79] | 2021 | 29 | • ResNet and WRN models on CIFAR-10 dataset • ResNet-18 and MobileNet v2 on Tiny-ImageNet | • Error rate | |
[76] | 2019 | 30 | • AlexNet, Inception-v3 and ResNet-101 on ImageNet dataset • RNNTC on Movie Reviews dataset • RNNLM on Penn Treebank dataset • NMT on WMT English-German dataser | • Accuracy | |
[80] | 2018 | 25 | • ResNet on CIFAR | • Accuracy | |
Pipelining parallelism | [81] | 2020 | 29 | • AmoebaNet-D • U-Net | • Throughput • Speed up |
[82] | 2019 | 57 | • VGG-16 and ResNet-50 on ImageNet • AlexNet on Synthetic Data • GNMT-16 and GNMT-8 on WMT16 EN-De • AWD LM on Penn Treebank • S2VT on MSVD | • Accuracy • Speed up | |
[83] | 2018 | 50 | • VGG16, ResNet-152, Inception v4 and SNN on CIFAR-10 • Transformer on IMDb Movie Review Sentiment Dataset • Residual LSTM on IMDb Dataset | • Speed up | |
[84] | 2017 | 25 | • VGG-A model on ImageNet | • Speed up | |
Hybrid parallelization | [88] | 2023 | 64 | • MATCHNET, CTRDNN, 2EMB and NCE models | • Scheduling performance • Throughput |
[64] | 2022 | 57 | • 3D-ResAttNet on Alzheimer’s Disease Neuroimaging Initiative (ADNI) database | • Speedup • Accuracy • Training time• | |
[91] | 2020 | 64 | • CosmoFlow and 3D UNet models | • MSE | |
[86] | 2019 | 23 | • Seq2Seq RNN MT with attention on WMT14 and WMT17 datasets | • BLEU scores | |
[87] | 2019 | 120 | • SFC, SCONV, Lenet-c, Cifar-c, AlexNet,VGG-A, VGG-B, VGG-C, VGG-D and VGG-E models on MNIST, CIFAR-10 and ImageNet datasets | • Energy efficiency • Performance • Total communication | |
[85] | 2018 | 67 | • AlexNet and VGG models | • Communication Overhead • Training time • Speed up | |
[90] | 2017 | 33 | • CNN on ImageNet LSVRC-2010 dataset | • Error rate | |
[89] | 2013 | 12 | • ImageNet dataset | • Error rate |
Distributed deep reinforcement learning
Algorithm | Articles | Year | No. of references | Simulation/dataset | Evaluation metrics |
---|---|---|---|---|---|
Distributed deep reinforcement learning | [107] | 2022 | 42 | • Atari games | • Onvergence rate • Convergence time • Running time • GPU usage • Memory usage • Bandwidth consumption |
[106] | 2020 | 127 | • 5 Atari games: Asterix, • Breakout, MsPacman, Pong and SpaceInvaders • Arcade Learning Environment • DeepMind Control suite • Gym environments | • Mean and standard deviation • Speed | |
[105] | 2019 | 53 | • Atari-57 • DeepMind Lab • Google Research Football | • Training cost • Speed | |
[101] | 2018 | 41 | • Atari-57 • DMLab-30 | • Median and Mean Human-Normalized scores | |
[104] | 2018 | 40 | • Atari games | • Median and Mean Human-Normalized scores | |
[100] | 2016 | 43 | • Atari games • TORCS 3D • Mujoco • Labyrinth | • Median and Mean Human-Normalized scores | |
[98] | 2015 | 19 | • 49 games from Atari 2600 games | • Human Score |
Conclusions and research directions
-
Lack of attention to distributed traditional machine learning: There has been a significant focus on distributed deep learning in recent studies and less attention has been paid to distributed traditional machine learning. Although machine learning algorithms have their advantages and have shown promising results in a number of areas, they have not been studied as extensively as deep learning in distributed systems.
-
Lack of benchmarks: Most studies used MNIST and ImageNet datasets to evaluate their proposed method, but there is no benchmarking to evaluate and compare the performance of existing approaches. Researchers considered a wide range of models, datasets, and evaluation metrics, and even in distributed RL, each study evaluated its method on different types of Atari games. Consequently, benchmarks are necessary to compare the results of different methods.
-
Interpretability: Even though DNNs have excellent performance in many areas, understanding their results, particularly in distributed systems, can be challenging. A model’s interpretability can help to provide insight into the relationship between input data and the trained model, which is particularly useful in critical domains like healthcare. The interpretability of distributed algorithms remains an open problem.
-
New issues: New subjects arise when we try to have distributed algorithms, including the way data and model are partitioned, optimality, delay of the slowest node, communication overhead, scalability, and aggregation of results. These issues need to be addressed to succeed at distributed training and to make it more accessible to data scientists and researchers.