nach oben

The Journal of Supercomputing

Erschienen in:

12.04.2021

HGP4CNN: an efficient parallelization framework for training convolutional neural networks on modern GPUs

verfasst von: Hao Fu, Shanjiang Tang, Bingsheng He, Ce Yu, Jizhou Sun

Erschienen in: The Journal of Supercomputing | Ausgabe 11/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Graphics Processing Units (GPUs) have evolved into powerful accelerators for the development of Convolutional Neural Network (CNN) models. Most existing GPU-based frameworks adopt a kernel-based execution approach and only focus on optimizing individual kernels for better performance and resource utilization. With this approach, kernels involved will be launched sequentially, which may result in the underutilization of GPU resources due to the limited optimization space of a single kernel. In this paper, we propose an efficient software parallelization framework, called HGP4CNN, to accelerate the training of CNN models by considering the characteristics of workloads from both the same layer and adjacent layers as well as new GPU features such as concurrent kernel execution. In the intra-layer level, to achieve a better training performance of a single network layer, we design a novel model-based lightweight parallelization module to make better use of the concurrent kernel execution feature on modern GPUs. An asynchronous resource tracker is used to collect kernels’ information at runtime, and a kernel analyzer is devised to calculate the number of kernels that can be dispatched concurrently. Moreover, to avoid consuming too many CPU threads or process resources, we integrate a runtime scheduler module for kernel launch and a pool-based stream manager for GPU work queue management. While in the inter-layer level, we present a pipeline execution strategy to overlap the processing of workloads from adjacent layers. To determine the number of samples to be processed by a single pipeline stage, the analysis result from the intra-layer module is considered. In the end, we implement a prototype of the proposed framework with Caffe, a well-known deep learning framework and conduct experiments with four off-the-shelf CNN models on three NVIDIA GPUs. Results show that HGP4CNN can be exploited to achieve better performance over the original implementation and keep the convergence property of networks. We can achieve a speedup of up to 6.X for a single convolutional layer and 2.X for multiple layers within pipelines of a network model.

Vorheriger Artikel Availability-aware and energy-aware dynamic SFC placement using reinforcement learning

Nächster Artikel Auditing images collected by sensors in ambient intelligence systems with privacy and high efficiency

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

https://gitee.com/tju_haibo/HGP4CNN-Caffe.

Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga R, Moore S, Murray DG, Steiner B, Tucker P, Vasudevan V, Warden P, Wicke M, Yu Y, Zheng X (2016) Tensorflow: A system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), USENIX Association, Savannah, GA, pp 265–283, https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi

AMD (2018) Hcblas library. https://gpuopen.com/compute-product/hcblas/

Chen J, Pan X, Monga R, Bengio S, Jozefowicz R (2016a) Revisiting distributed synchronous sgd. arXiv preprint arXiv:160400981

Chen T, Li M, Li Y, Lin M, Wang N, Wang M, Xiao T, Xu B, Zhang C, Zhang Z (2015) Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR arXiv:1512.01274

Chen T, Xu B, Zhang C, Guestrin C (2016b) Training deep nets with sublinear memory cost. CoRR arXiv:1604.06174

Chen T, Moreau T, Jiang Z, Zheng L, Yan E, Shen H, Cowan M, Wang L, Hu Y, Ceze L, Guestrin C, Krishnamurthy A (2018) TVM: an automated end-to-end optimizing compiler for deep learning. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), USENIX Association, Carlsbad, CA, pp 578–594, https://www.usenix.org/conference/osdi18/presentation/chen

Chetlur S, Woolley C, Vandermersch P, Cohen J, Tran J, Catanzaro B, Shelhamer E (2014) cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:14100759

Cipar J, Ho Q, Kim JK, Lee S, Ganger GR, Gibson G, Keeton K, Xing E (2013) Solving the straggler problem with bounded staleness. In: Presented as part of the 14th Workshop on Hot Topics in Operating Systems, USENIX, Santa Ana Pueblo, NM, https://www.usenix.org/conference/hotos13/solving-straggler-problem-bounded-staleness

Cui H, Cipar J, Ho Q, Kim JK, Lee S, Kumar A, Wei J, Dai W, Ganger GR, Gibbons PB, Gibson GA, Xing EP (2014) Exploiting bounded staleness to speed up big data analytics. In: 2014 USENIX Annual Technical Conference (USENIX ATC 14), USENIX Association, Philadelphia, PA, pp 37–48, https://www.usenix.org/conference/atc14/technical-sessions/presentation/cui

10.

Cui H, Zhang H, Ganger GR, Gibbons PB, Xing EP (2016) Geeps: scalable deep learning on distributed gpus with a gpu-specialized parameter server. In: Proceedings of the Eleventh European Conference on Computer Systems (EuroSys 16), Association for Computing Machinery, New York, NY, USA, pp 1–16, https://doi.org/10.1145/2901318.2901323

11.

Dean J, Corrado G, Monga R, Chen K, Devin M, Mao M, Ranzato M, Senior A, Tucker P, Yang K, Le QV, Ng AY (2012) Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, Curran Associates, Inc., pp 1223–1231

12.

Fu H, Tang S, He B, Yu C, Sun J (2018) Glp4nn: A convergence-invariant and network-agnostic light-weight parallelization framework for deep neural networks on modern gpus. In: Proceedings of the 47th International Conference on Parallel Processing, Association for Computing Machinery, New York, NY, USA, ICPP 2018, https://doi.org/10.1145/3225058.3225077

13.

GmbH GT (2018) Vampir. https://www.vampir.eu/

14.

Goyal P, Dollár P, Girshick RB, Noordhuis P, Wesolowski L, Kyrola A, Tulloch A, Jia Y, He K (2017) Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR arXiv:1706.02677

15.

He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 770–778 https://doi.org/10.1109/CVPR.2016.90

16.

Ho Q, Cipar J, Cui H, Lee S, Kim JK, Gibbons PB, Gibson GA, Ganger G, Xing EP (2013) More effective distributed ml via a stale synchronous parallel parameter server. In: Advances in Neural Information Processing Systems, Curran Associates, Inc., pp 1223–1231

17.

Huang Y, Cheng Y, Bapna A, Firat O, Chen D, Chen M, Lee H, Ngiam J, Le QV, Wu Y, Chen z (2019) Gpipe: Efficient training of giant neural networks using pipeline parallelism. In: Wallach H, Larochelle H, Beygelzimer A, d’ Alché-Buc F, Fox E, Garnett R (eds) Advances in Neural Information Processing Systems 32, Curran Associates, Inc., pp 103–112, http://papers.nips.cc/paper/8305-gpipe-efficient-training-of-giant-neural-networks-using-pipeline-parallelism.pdf

18.

Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Bach F, Blei D (eds) Proceedings of the 32nd International Conference on Machine Learning, PMLR, Lille, France, Proceedings of Machine Learning Research, vol 37, pp 448–456, http://proceedings.mlr.press/v37/ioffe15.html

19.

Jia X, Song S, He W, Wang Y, Rong H, Zhou F, Xie L, Guo Z, Yang Y, Yu L, Chen T, Hu G, Shi S, Chu X (2018) Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. CoRR arXiv:1807.11205

20.

Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, Association for Computing Machinery, New York, NY, USA, MM ’14, p 675–678 https://doi.org/10.1145/2647868.2654889

21.

Jiang W, Zhang Y, Liu P, Ye G, Jin H (2018) Filayer: A novel fine-grained layer-wise parallelism strategy for deep neural networks. In: Kůrková V, Manolopoulos Y, Hammer B, Iliadis L, Maglogiannis I (eds) Artificial Neural Networks and Machine Learning—ICANN 2018. Springer International Publishing, Cham, pp 321–330

22.

Jin H, Liu B, Jiang W, Ma Y, Shi X, He B, Zhao S (2018) Layer-centric memory reuse and data migration for extreme-scale deep learning on many-core architectures. ACM Trans Archit Code Optim. https://doi.org/10.1145/3243904CrossRef

23.

Kim JK, Ho Q, Lee S, Zheng X, Dai W, Gibson GA, Xing EP (2016) Strads: a distributed framework for scheduled model parallel machine learning. In: Proceedings of the Eleventh European Conference on Computer Systems, ACM, New York, NY, USA, EuroSys ’16, pp 5:1–5:16, https://doi.org/10.1145/2901318.2901331

24.

Kjolstad F, Kamil S, Chou S, Lugato D, Amarasinghe S (2017) The tensor algebra compiler. Proc ACM Program Lang 1(OOPSLA), https://doi.org/10.1145/3133901

25.

Krizhevsky A (2009) Learning multiple layers of features from tiny images. University of Toronto, Tech. rep

26.

Krizhevsky A (2014) One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:14045997

27.

Krizhevsky A, Sutskever I, Hinton GE (2012a) Imagenet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems 25, Curran Associates, Inc., pp 1097–1105, http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

28.

Krizhevsky A, Sutskever I, Hinton GE (2012b) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems 25, Curran Associates, Inc., pp 1097–1105, http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

29.

Lavini A, Gray S (2016) Fast algorithms for convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 4013–4021

30.

Li C, Yang Y, Feng M, Chakradhar S, Zhou H (2016) Optimizing memory efficiency for deep convolutional neural networks on gpus. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE Press, SC ’16

31.

Li M, Andersen DG, Park JW, Smola AJ, Ahmed A, Josifovski V, Long J, Shekita EJ, Su BY (2014) Scaling distributed machine learning with the parameter server. In: 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), vol 14, pp 583–598

32.

Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: single shot multibox detector. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer Vision - ECCV 2016. Springer International Publishing, Cham, pp 21–37CrossRef

33.

Mathieu M, Henaff M, LeCun Y (2013) Fast training of convolutional networks through ffts. arXiv preprint arXiv:13125851

34.

Meng L, Brothers J (2019) Efficient winograd convolution via integer arithmetic. CoRR arXiv:1901.01965

35.

Narayanan D, Harlap A, Phanishayee A, Seshadri V, Devanur NR, Ganger GR, Gibbons PB, Zaharia M (2019) Pipedream: Generalized pipeline parallelism for dnn training. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles, Association for Computing Machinery, New York, NY, USA, SOSP ’19, p 1–15 https://doi.org/10.1145/3341301.3359646

36.

NVIDIA (2017) Nccl library. https://developer.nvidia.com/nccl

37.

NVIDIA (2018a) cublas library. https://developer.nvidia.com/cublas

38.

NVIDIA (2018b) Cuda profiling tools interface. https://developer.nvidia.com/cuda-profiling-tools-interface

39.

NVIDIA (2018c) Nvidia visual profiler. https://developer.nvidia.com/nvidia-visual-profiler

40.

Paul J, He J, He B (2016) Gpl: A gpu-based pipelined query processing engine. In: Proceedings of the 2016 International Conference on Management of Data, ACM, New York, NY, USA, SIGMOD ’16, pp 1935–1950, https://doi.org/10.1145/2882903.2915224

41.

Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 779–788, https://doi.org/10.1109/CVPR.2016.91

42.

Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in Neural Information Processing Systems 28, Curran Associates, Inc., pp 91–99, http://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks.pdf

43.

Rhu M, Gimelshein N, Clemons J, Zulfiqar A, Keckler SW (2016) vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design. In: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp 1–13

44.

Sim J, Dasgupta A, Kim H, Vuduc R (2012) A performance analysis framework for identifying potential benefits in gpgpu applications. In: Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Association for Computing Machinery, New York, NY, USA, PPoPP ’12, p 11–22, https://doi.org/10.1145/2145816.2145819

45.

Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556

46.

Steiner B, DeVito Z, Chintala S, Gross S, Paszke A, Massa F, Lerer A, Chanan G, Lin Z, Yang E, Desmaison A, Tejani A, Kopf A, Bradbury J, Antiga L, Raison M, Gimelshein N, Chilamkurthy S, Killeen T, Feng L, Bai J (2019) Pytorch: An imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 33, Curran Associates, Inc

47.

Szegedy C, Wei Liu, Yangqing Jia, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1–9

48.

Tallada MG (2016) Coarse grain parallelization of deep neural networks. In: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Association for Computing Machinery, New York, NY, USA, PPoPP ’16, https://doi.org/10.1145/2851141.2851158

49.

Tang S, He B, Zhang S, Niu Z (2016) Elastic multi-resource fairness: balancing fairness and efficiency in coupled cpu-gpu architectures. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, SC ’16

50.

Vasilache N, Johnson J, Mathieu M, Chintala S, Piantino S, LeCun Y (2014) Fast convolutional nets with fbfft: A gpu performance evaluation. arXiv preprint arXiv:14127580

51.

Wang W, Chen G, Dinh ATT, Gao J, Ooi BC, Tan KL, Wang S (2015) Singa: Putting deep learning in the hands of multimedia users. In: Proceedings of the 23rd ACM International Conference on Multimedia, Association for Computing Machinery, New York, NY, USA, MM ’15, p 25–34, https://doi.org/10.1145/2733373.2806232

52.

Wei J, Dai W, Kumar A, Zheng X, Ho Q, Xing EP (2013) Consistent bounded-asynchronous parameter servers for distributed ml. arXiv preprint arXiv:13127869

53.

Wen Z, He B, Ramamohanarao K, Lu S, Shi J (2018) Efficient gradient boosted decision tree training on gpus. In: Parallel and Distributed Processing Symposium (IPDPS), (2018) IEEE International. Vancouver, British Columbia, Canada

54.

Xing EP, Ho Q, Dai W, Kim JK, Wei J, Lee S, Zheng X, Xie P, Kumar A, Yu Y (2015) Petuum: a new platform for distributed machine learning on big data. IEEE Trans Big Data 1(2):49–67CrossRef

55.

You Y, Zhang Z, Hsieh CJ, Demmel J, Keutzer K (2018) Imagenet training in minutes. In: Proceedings of the 47th International Conference on Parallel Processing, Association for Computing Machinery, New York, NY, USA, ICPP 2018, https://doi.org/10.1145/3225058.3225069

56.

Zhong J, He B (2014) Kernelet: high-throughput gpu kernel executions with dynamic slicing and schedulingh. IEEE Trans Parallel Distrib Syst 25(6):1522–1532MathSciNetCrossRef

57.

Zhou F, Wu F, Zhang Z, Dong M (2017) Node-level parallelization for deep neural networks with conditional independent graph. Neurocomputing 267:261–270CrossRef

Titel: HGP4CNN: an efficient parallelization framework for training convolutional neural networks on modern GPUs
verfasst von: Hao Fu
Shanjiang Tang
Bingsheng He
Ce Yu
Jizhou Sun
Publikationsdatum: 12.04.2021
Verlag: Springer US
Erschienen in: The Journal of Supercomputing / Ausgabe 11/2021
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-021-03746-z

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 11/2021

Lea-TN: leader election algorithm considering node and link failures in a torus network

Multi criteria based personalized recommendation service using analytical hierarchy process for airbnb

Auditing images collected by sensors in ambient intelligence systems with privacy and high efficiency

Lustre I/O performance investigations on Hazel Hen: experiments and heuristics

An efficient query optimization technique in big data using -ANFIS load balancer and CaM-BW optimizer

People-centric collective intelligence: decentralized and enhanced privacy mobile crowd sensing based on blockchain

Premium Partner