Introduction
AllReduce
operation is executed for gradient reduction, while in the asynchronous case, a single central parameter server receives all gradient updates from the workers and performs the optimization step. For the asynchronous case, the overall performance is limited by the network bandwidth of the parameter server and depends on the amount of data to transmit per worker. The usage of alternative strategies such as asynchronous ring communications or employing MPI can alleviate this bottleneck. In general, with an increasing number of computational nodes, the communication to share the gradients becomes the main bottleneck. To minimize this communication overhead, the number of gradient reductions has to be reduced.-
A comprehensive scalability analysis and comparison of training ResNet50, ResNet101, and ResNet152 on ImageNet with Horovod, PyTorch-DDP, and DeepSpeed, using the DALI and native PyTorch data loaders ranging on up to 1024 GPUs is performed. The results demonstrate that in combination with the native PyTorch data loader, DeepSpeed shows the best performance on a small number of GPUs, while Horovod and PyTorch-DDP outperform it when using a larger number of GPUs. However, in comparison, the use of DALI leads to higher throughput and improves scalability for all ResNet architectures.
-
An assessment of the influence of step-wise, cosine, and exponential learning rate annealing on the validation accuracy for different batch sizes ranging from \(BS=256\) to 65,536 is performed. These findings reveal that cosine annealing delivers superior performance on small and medium batch sizes, while exponential annealing achieves the highest accuracy for the largest batch size.
Related work
Framework | Parallelism | Communication |
---|---|---|
DistBelief [18] | Model + Data | Asynchronous |
FireCaffe [21] | Data | Synchronous |
Horovod [5] | Model + Data | Synchronous |
MXNet [23] | Model + Data | Bounded Asynchronous |
Petuum [19] | Model + Data | Bounded Asynchronous |
TensorFlow [22] | model + Data | Bounded Asynchronous |
PyTorch-DDP [6] | Model + Data | Synchronous |
DeepSpeed [7] | Model + Data | Synchronous |
Overview of the benchmark setup
Horovod
Ring
AllReduce
approach [34] to compute the gradient reduction instead of a single parameter server receiving all the updates, cf. "Introduction" section. It relies on low-level communication libraries such as MPI, the NVIDIA Collective Communications Library (NCCL) [35], or Facebook Gloo [36]. It is observed that the NCCL AllReduce
yields superior performance on NVIDIA GPUs [6].AllReduce
.AllReduce
operations. This exploitation of batching communication operations is known as tensor fusion [5]. With this operation, the smaller data volumes are transferred across different workers by locally fusing the data that are ready to be reduced. Hence, fewer AllReduce
operations are required. In large neural networks with large number of parameters, this operation is expected to yield huge parallel performance gains.PyTorch-distributed data parallel
AllReduce
paradigm (with the communication libraries NCCL, Gloo, or MPI) for updating the gradients used in deep neural networks. To trigger the communication operation, a custom ‘hook’ is registered in the internal automatic differentiation engine that is integrated into the backward pass operation of deep neural networks [6]. A separate code for managing the communication is hence not required.
AllReduce
operation the algorithm waits for a few processor cycles once a batch of gradients is complete, and buckets (or ‘fuses’ in the sense of Horovod) multiple gradient parameters into a single parallel operation. Hence, the computation and communication are overlapped, thus skipping frequent gradient synchronization. A drawback of this method is a possible mismatch in the AllReduce
operation if the reduction order is not the same across all processes—resulting in an incorrect reduction or data inconsistencies. This issue is addressed by bucketing the gradients in the reversed order obtained during the forward pass operation. This is motivated by the fact that the last layers of a network are likely the first ones to finish computation during the backward pass. Another issue is the skipped bucketed gradients that never enter the AllReduce
operation. PyTorch-DDP handles this issue by a participation algorithm, which checks the output tensors during the forward pass to find all non-participated parameters (i.e., based on gradients that have not been updated) in the current iteration to be included in the next iteration.
DeepSpeed
AllReduce
communication step is necessary to ensure consistency. This is performed in a two-step process: first, different parts of the data are distributed to different workers with a ReduceScatter
command, then each worker gathers the different chunks of data with an AllGather
operation [26]. Code snippet 3 shows the integration of DeepSpeed within PyTorch, which is currently the only supported DL backend.
Data loaders and dataset compression
JPG
format. This data loader only supports raw image data and performs all pre-processing steps on the CPUs. The other data loader is the NVIDIA Data Loading Library (DALI) [37], where a compressed version (TFRecord
) of the ImageNet dataset is used. DALI is an open-source framework to accelerate the data-loading process in DL applications by involving the GPU, following a pipeline-based approach. Usually, the GPU runs computations much faster compared to the data-loading speed of the CPU. The idea of DALI is to prevent the GPU from starving by moving the data-loading process to the GPU at an early stage. The GPU then performs the data pre-processing, such as image resizing, cropping, and normalization on the fly. By pipe-lining these operations and executing them directly on the GPU, DALI minimizes the amount of data that needs to be transferred between the CPU and GPU, which reduces the overhead associated with these operations. DALI supports multiple data formats and with its unified interface, it is easy to integrate into all common DL frameworks. With this seamless integration developers can exploit the full potential of their GPU-based systems without having to modify their existing workflows significantly or switch between different data loading libraries. While the main focus of DALI is the GPU-based approach, it also offers the possibility to use the CPU for all steps of the pipeline. In this case also the pre-processing is performed on the CPU. Initial benchmarks show a speed-up between 20-\(40\%\) in throughput compared to the original PyTorch data loader [38]. TFRecord
version of the ImageNet dataset (144 GB) and the raw JPG
data (154 GB) is marginal. However, the file structure of the TFRecord
dataset is much better suited for data loading in comparison to the over one million single image files in the raw dataset.GPU scaling issues
AllReduce
methods [39, 40]. Load balancing is crucial for ensuring an even distribution of computational workload across all GPUs, maximizing resource utilization. An uneven workload distribution can lead to idle GPUs, wasting resources and increasing runtime. Dynamic load balancing algorithms and data or model parallelism techniques can help distribute tasks and data efficiently across multiple GPUs. Memory limitations pose a challenge when large models (or datasets) exceed a single GPU memory capacity, causing out-of-memory errors or forcing smaller batch sizes, which can negatively impact performance and convergence.Residual neural networks
Learning rate scheduling
JUWELS HPC system and software stack
-
GCC 11.2.0
-
OpenMPI 4.1.2
-
Python 3.9.6
-
CUDA 11.5
-
PyTorch 1.11.0
-
Horovod 0.24.3
-
Deepspeed 0.6.3
-
DALI 1.12.0
(virtual environment)
Benchmark results
Efficiency
AllReduce
commands is similar for the DALI and native data loader methods. However, this difference increases rapidly with more workers. For example, in the ResNet50 case with 512 GPUs, the native data loader spends \(82.61\%\) of its time on communication, while the DALI data loader spends only \(43.48\%\). A similar trend can be observed for the trainings with ResNet101 and ResNet152, where the native data loader spends 59.18 and \(51.76\%\) of its time on communication, compared to \(37.73\%\) and 40.36 for the DALI data loader, respectively. This substantial discrepancy could explain the poor scaling behavior of the native data loader. Regarding the computation time with the cuDNN library, it decreases for both data loaders as the number of GPUs increases, which is expected as the overall computational workload is distributed across a larger number of GPUs. For all three ResNet cases, the DALI data loader consistently exhibits higher computation percentages than the native data loader, suggesting that it effectively utilizes GPU resources. As for data loading, the time spent decreases as the number of GPUs increases for both data loaders. Although the relative data loading time is comparable between the two data loaders, it is important to emphasize that the DALI data loader is much faster in absolute timing. For example, in the ResNet152 case on 64 GPUs, the DALI data loader is responsible for \(16.9\%\) of the total runtime which amounts to \(\approx 25s\) in absolute timing. For the native data loader case, the relative value is roughly the same with \(16.42\%\) of the total runtime, which, however, amounts to \(\approx 47s\) in absolute timing. As expected, when comparing the three ResNet models it is evident that the communication overhead slightly reduces for smaller ResNet architectures, while the computation time increases as the size of the ResNet grows. Due to the low scaling performance of the native data loader, no evaluations on 1024 GPUs were performed for this case.AllReduce
, computations with the cuDNN library, and data loading functionsNo. GPUs | PyTorch-DDP DALI | PyTorch-DDP native | ||||
---|---|---|---|---|---|---|
AllReduce [\(\%\)](Communication) | data[\(\%\)] (I/O) | cuDNN[\(\%\)] (Computation) | AllReduce [\(\%\)](Communication) | data[\(\%\)] (I/O) | cuDNN[\(\%\)] (Computation) | |
(a) Training of ResNet50 on ImageNet | ||||||
4 | 15.40 | 22.00 | 32.50 | 22.40 | 21.00 | 30.80 |
8 | 19.00 | 21.40 | 31.75 | 23.95 | 20.05 | 29.20 |
16 | 21.00 | 20.95 | 30.70 | 27.15 | 18.83 | 27.35 |
32 | 27.09 | 18.98 | 28.14 | 31.30 | 17.26 | 25.11 |
64 | 30.87 | 17.76 | 26.35 | 32.75 | 16.30 | 23.55 |
128 | 33.61 | 17.03 | 24.99 | 49.48 | 11.77 | 17.33 |
256 | 37.08 | 15.78 | 23.26 | 76.77 | 5.06 | 7.14 |
512 | 43.48 | 13.57 | 20.02 | 82.61 | 3.66 | 5.52 |
1024 | 46.18 | 11.56 | 17.31 | – | – | – |
(b) Training of ResNet101 on ImageNet | ||||||
4 | 13.30 | 23.00 | 46.00 | 28.65 | 22.50 | 38.12 |
8 | 20.55 | 21.25 | 41.45 | 30.15 | 18.28 | 35.52 |
16 | 24.08 | 20.37 | 39.65 | 35.67 | 16.76 | 32.71 |
32 | 25.36 | 18.71 | 36.99 | 35.46 | 14.59 | 28.43 |
64 | 37.17 | 16.69 | 33.39 | 37.69 | 15.31 | 29.88 |
128 | 36.29 | 16.74 | 34.02 | 42.32 | 13.39 | 26.38 |
256 | 39.31 | 15.54 | 31.56 | 56.43 | 11.38 | 22.83 |
512 | 37.73 | 15.40 | 31.59 | 59.18 | 11.87 | 24.45 |
1204 | 49.18 | 11.87 | 24.45 | – | – | – |
(c) Training of ResNet152 on ImageNet | ||||||
4 | 16.20 | 22.40 | 44.60 | 18.41 | 21.97 | 44.17 |
8 | 20.55 | 21.75 | 42.35 | 20.65 | 21.95 | 40.75 |
16 | 25.90 | 20.05 | 39.07 | 24.77 | 20.70 | 38.62 |
32 | 29.16 | 18.72 | 37.15 | 30.31 | 18.77 | 35.32 |
64 | 33.56 | 16.90 | 33.82 | 38.34 | 16.42 | 30.80 |
128 | 36.16 | 16.66 | 33.73 | 45.75 | 14.02 | 26.60 |
256 | 38.33 | 15.51 | 31.60 | 49.39 | 15.05 | 28.46 |
512 | 40.36 | 14.41 | 29.43 | 51.76 | 11.16 | 25.36 |
1024 | 43.21 | 13.08 | 26.99 | – | – | – |