Skip to main content
Erschienen in: The Journal of Supercomputing 11/2021

12.04.2021

HGP4CNN: an efficient parallelization framework for training convolutional neural networks on modern GPUs

verfasst von: Hao Fu, Shanjiang Tang, Bingsheng He, Ce Yu, Jizhou Sun

Erschienen in: The Journal of Supercomputing | Ausgabe 11/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Graphics Processing Units (GPUs) have evolved into powerful accelerators for the development of Convolutional Neural Network (CNN) models. Most existing GPU-based frameworks adopt a kernel-based execution approach and only focus on optimizing individual kernels for better performance and resource utilization. With this approach, kernels involved will be launched sequentially, which may result in the underutilization of GPU resources due to the limited optimization space of a single kernel. In this paper, we propose an efficient software parallelization framework, called HGP4CNN, to accelerate the training of CNN models by considering the characteristics of workloads from both the same layer and adjacent layers as well as new GPU features such as concurrent kernel execution. In the intra-layer level, to achieve a better training performance of a single network layer, we design a novel model-based lightweight parallelization module to make better use of the concurrent kernel execution feature on modern GPUs. An asynchronous resource tracker is used to collect kernels’ information at runtime, and a kernel analyzer is devised to calculate the number of kernels that can be dispatched concurrently. Moreover, to avoid consuming too many CPU threads or process resources, we integrate a runtime scheduler module for kernel launch and a pool-based stream manager for GPU work queue management. While in the inter-layer level, we present a pipeline execution strategy to overlap the processing of workloads from adjacent layers. To determine the number of samples to be processed by a single pipeline stage, the analysis result from the intra-layer module is considered. In the end, we implement a prototype of the proposed framework with Caffe, a well-known deep learning framework and conduct experiments with four off-the-shelf CNN models on three NVIDIA GPUs. Results show that HGP4CNN can be exploited to achieve better performance over the original implementation and keep the convergence property of networks. We can achieve a speedup of up to 6.X for a single convolutional layer and 2.X for multiple layers within pipelines of a network model.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga R, Moore S, Murray DG, Steiner B, Tucker P, Vasudevan V, Warden P, Wicke M, Yu Y, Zheng X (2016) Tensorflow: A system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), USENIX Association, Savannah, GA, pp 265–283, https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga R, Moore S, Murray DG, Steiner B, Tucker P, Vasudevan V, Warden P, Wicke M, Yu Y, Zheng X (2016) Tensorflow: A system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), USENIX Association, Savannah, GA, pp 265–283, https://​www.​usenix.​org/​conference/​osdi16/​technical-sessions/​presentation/​abadi
3.
Zurück zum Zitat Chen J, Pan X, Monga R, Bengio S, Jozefowicz R (2016a) Revisiting distributed synchronous sgd. arXiv preprint arXiv:160400981 Chen J, Pan X, Monga R, Bengio S, Jozefowicz R (2016a) Revisiting distributed synchronous sgd. arXiv preprint arXiv:​160400981
4.
Zurück zum Zitat Chen T, Li M, Li Y, Lin M, Wang N, Wang M, Xiao T, Xu B, Zhang C, Zhang Z (2015) Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR arXiv:1512.01274 Chen T, Li M, Li Y, Lin M, Wang N, Wang M, Xiao T, Xu B, Zhang C, Zhang Z (2015) Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR arXiv:​1512.​01274
6.
Zurück zum Zitat Chen T, Moreau T, Jiang Z, Zheng L, Yan E, Shen H, Cowan M, Wang L, Hu Y, Ceze L, Guestrin C, Krishnamurthy A (2018) TVM: an automated end-to-end optimizing compiler for deep learning. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), USENIX Association, Carlsbad, CA, pp 578–594, https://www.usenix.org/conference/osdi18/presentation/chen Chen T, Moreau T, Jiang Z, Zheng L, Yan E, Shen H, Cowan M, Wang L, Hu Y, Ceze L, Guestrin C, Krishnamurthy A (2018) TVM: an automated end-to-end optimizing compiler for deep learning. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), USENIX Association, Carlsbad, CA, pp 578–594, https://​www.​usenix.​org/​conference/​osdi18/​presentation/​chen
7.
Zurück zum Zitat Chetlur S, Woolley C, Vandermersch P, Cohen J, Tran J, Catanzaro B, Shelhamer E (2014) cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:14100759 Chetlur S, Woolley C, Vandermersch P, Cohen J, Tran J, Catanzaro B, Shelhamer E (2014) cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:​14100759
10.
Zurück zum Zitat Cui H, Zhang H, Ganger GR, Gibbons PB, Xing EP (2016) Geeps: scalable deep learning on distributed gpus with a gpu-specialized parameter server. In: Proceedings of the Eleventh European Conference on Computer Systems (EuroSys 16), Association for Computing Machinery, New York, NY, USA, pp 1–16, https://doi.org/10.1145/2901318.2901323 Cui H, Zhang H, Ganger GR, Gibbons PB, Xing EP (2016) Geeps: scalable deep learning on distributed gpus with a gpu-specialized parameter server. In: Proceedings of the Eleventh European Conference on Computer Systems (EuroSys 16), Association for Computing Machinery, New York, NY, USA, pp 1–16, https://​doi.​org/​10.​1145/​2901318.​2901323
11.
Zurück zum Zitat Dean J, Corrado G, Monga R, Chen K, Devin M, Mao M, Ranzato M, Senior A, Tucker P, Yang K, Le QV, Ng AY (2012) Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, Curran Associates, Inc., pp 1223–1231 Dean J, Corrado G, Monga R, Chen K, Devin M, Mao M, Ranzato M, Senior A, Tucker P, Yang K, Le QV, Ng AY (2012) Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, Curran Associates, Inc., pp 1223–1231
12.
Zurück zum Zitat Fu H, Tang S, He B, Yu C, Sun J (2018) Glp4nn: A convergence-invariant and network-agnostic light-weight parallelization framework for deep neural networks on modern gpus. In: Proceedings of the 47th International Conference on Parallel Processing, Association for Computing Machinery, New York, NY, USA, ICPP 2018, https://doi.org/10.1145/3225058.3225077 Fu H, Tang S, He B, Yu C, Sun J (2018) Glp4nn: A convergence-invariant and network-agnostic light-weight parallelization framework for deep neural networks on modern gpus. In: Proceedings of the 47th International Conference on Parallel Processing, Association for Computing Machinery, New York, NY, USA, ICPP 2018, https://​doi.​org/​10.​1145/​3225058.​3225077
14.
Zurück zum Zitat Goyal P, Dollár P, Girshick RB, Noordhuis P, Wesolowski L, Kyrola A, Tulloch A, Jia Y, He K (2017) Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR arXiv:1706.02677 Goyal P, Dollár P, Girshick RB, Noordhuis P, Wesolowski L, Kyrola A, Tulloch A, Jia Y, He K (2017) Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR arXiv:​1706.​02677
16.
Zurück zum Zitat Ho Q, Cipar J, Cui H, Lee S, Kim JK, Gibbons PB, Gibson GA, Ganger G, Xing EP (2013) More effective distributed ml via a stale synchronous parallel parameter server. In: Advances in Neural Information Processing Systems, Curran Associates, Inc., pp 1223–1231 Ho Q, Cipar J, Cui H, Lee S, Kim JK, Gibbons PB, Gibson GA, Ganger G, Xing EP (2013) More effective distributed ml via a stale synchronous parallel parameter server. In: Advances in Neural Information Processing Systems, Curran Associates, Inc., pp 1223–1231
18.
Zurück zum Zitat Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Bach F, Blei D (eds) Proceedings of the 32nd International Conference on Machine Learning, PMLR, Lille, France, Proceedings of Machine Learning Research, vol 37, pp 448–456, http://proceedings.mlr.press/v37/ioffe15.html Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Bach F, Blei D (eds) Proceedings of the 32nd International Conference on Machine Learning, PMLR, Lille, France, Proceedings of Machine Learning Research, vol 37, pp 448–456, http://​proceedings.​mlr.​press/​v37/​ioffe15.​html
19.
Zurück zum Zitat Jia X, Song S, He W, Wang Y, Rong H, Zhou F, Xie L, Guo Z, Yang Y, Yu L, Chen T, Hu G, Shi S, Chu X (2018) Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. CoRR arXiv:1807.11205 Jia X, Song S, He W, Wang Y, Rong H, Zhou F, Xie L, Guo Z, Yang Y, Yu L, Chen T, Hu G, Shi S, Chu X (2018) Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. CoRR arXiv:​1807.​11205
20.
Zurück zum Zitat Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, Association for Computing Machinery, New York, NY, USA, MM ’14, p 675–678 https://doi.org/10.1145/2647868.2654889 Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, Association for Computing Machinery, New York, NY, USA, MM ’14, p 675–678 https://​doi.​org/​10.​1145/​2647868.​2654889
21.
Zurück zum Zitat Jiang W, Zhang Y, Liu P, Ye G, Jin H (2018) Filayer: A novel fine-grained layer-wise parallelism strategy for deep neural networks. In: Kůrková V, Manolopoulos Y, Hammer B, Iliadis L, Maglogiannis I (eds) Artificial Neural Networks and Machine Learning—ICANN 2018. Springer International Publishing, Cham, pp 321–330 Jiang W, Zhang Y, Liu P, Ye G, Jin H (2018) Filayer: A novel fine-grained layer-wise parallelism strategy for deep neural networks. In: Kůrková V, Manolopoulos Y, Hammer B, Iliadis L, Maglogiannis I (eds) Artificial Neural Networks and Machine Learning—ICANN 2018. Springer International Publishing, Cham, pp 321–330
23.
Zurück zum Zitat Kim JK, Ho Q, Lee S, Zheng X, Dai W, Gibson GA, Xing EP (2016) Strads: a distributed framework for scheduled model parallel machine learning. In: Proceedings of the Eleventh European Conference on Computer Systems, ACM, New York, NY, USA, EuroSys ’16, pp 5:1–5:16, https://doi.org/10.1145/2901318.2901331 Kim JK, Ho Q, Lee S, Zheng X, Dai W, Gibson GA, Xing EP (2016) Strads: a distributed framework for scheduled model parallel machine learning. In: Proceedings of the Eleventh European Conference on Computer Systems, ACM, New York, NY, USA, EuroSys ’16, pp 5:1–5:16, https://​doi.​org/​10.​1145/​2901318.​2901331
25.
Zurück zum Zitat Krizhevsky A (2009) Learning multiple layers of features from tiny images. University of Toronto, Tech. rep Krizhevsky A (2009) Learning multiple layers of features from tiny images. University of Toronto, Tech. rep
26.
Zurück zum Zitat Krizhevsky A (2014) One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:14045997 Krizhevsky A (2014) One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:​14045997
29.
Zurück zum Zitat Lavini A, Gray S (2016) Fast algorithms for convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 4013–4021 Lavini A, Gray S (2016) Fast algorithms for convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 4013–4021
30.
Zurück zum Zitat Li C, Yang Y, Feng M, Chakradhar S, Zhou H (2016) Optimizing memory efficiency for deep convolutional neural networks on gpus. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE Press, SC ’16 Li C, Yang Y, Feng M, Chakradhar S, Zhou H (2016) Optimizing memory efficiency for deep convolutional neural networks on gpus. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE Press, SC ’16
31.
Zurück zum Zitat Li M, Andersen DG, Park JW, Smola AJ, Ahmed A, Josifovski V, Long J, Shekita EJ, Su BY (2014) Scaling distributed machine learning with the parameter server. In: 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), vol 14, pp 583–598 Li M, Andersen DG, Park JW, Smola AJ, Ahmed A, Josifovski V, Long J, Shekita EJ, Su BY (2014) Scaling distributed machine learning with the parameter server. In: 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), vol 14, pp 583–598
32.
Zurück zum Zitat Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: single shot multibox detector. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer Vision - ECCV 2016. Springer International Publishing, Cham, pp 21–37CrossRef Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: single shot multibox detector. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer Vision - ECCV 2016. Springer International Publishing, Cham, pp 21–37CrossRef
33.
Zurück zum Zitat Mathieu M, Henaff M, LeCun Y (2013) Fast training of convolutional networks through ffts. arXiv preprint arXiv:13125851 Mathieu M, Henaff M, LeCun Y (2013) Fast training of convolutional networks through ffts. arXiv preprint arXiv:​13125851
35.
Zurück zum Zitat Narayanan D, Harlap A, Phanishayee A, Seshadri V, Devanur NR, Ganger GR, Gibbons PB, Zaharia M (2019) Pipedream: Generalized pipeline parallelism for dnn training. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles, Association for Computing Machinery, New York, NY, USA, SOSP ’19, p 1–15 https://doi.org/10.1145/3341301.3359646 Narayanan D, Harlap A, Phanishayee A, Seshadri V, Devanur NR, Ganger GR, Gibbons PB, Zaharia M (2019) Pipedream: Generalized pipeline parallelism for dnn training. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles, Association for Computing Machinery, New York, NY, USA, SOSP ’19, p 1–15 https://​doi.​org/​10.​1145/​3341301.​3359646
43.
Zurück zum Zitat Rhu M, Gimelshein N, Clemons J, Zulfiqar A, Keckler SW (2016) vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design. In: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp 1–13 Rhu M, Gimelshein N, Clemons J, Zulfiqar A, Keckler SW (2016) vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design. In: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp 1–13
44.
Zurück zum Zitat Sim J, Dasgupta A, Kim H, Vuduc R (2012) A performance analysis framework for identifying potential benefits in gpgpu applications. In: Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Association for Computing Machinery, New York, NY, USA, PPoPP ’12, p 11–22, https://doi.org/10.1145/2145816.2145819 Sim J, Dasgupta A, Kim H, Vuduc R (2012) A performance analysis framework for identifying potential benefits in gpgpu applications. In: Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Association for Computing Machinery, New York, NY, USA, PPoPP ’12, p 11–22, https://​doi.​org/​10.​1145/​2145816.​2145819
45.
Zurück zum Zitat Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556 Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:​14091556
46.
Zurück zum Zitat Steiner B, DeVito Z, Chintala S, Gross S, Paszke A, Massa F, Lerer A, Chanan G, Lin Z, Yang E, Desmaison A, Tejani A, Kopf A, Bradbury J, Antiga L, Raison M, Gimelshein N, Chilamkurthy S, Killeen T, Feng L, Bai J (2019) Pytorch: An imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 33, Curran Associates, Inc Steiner B, DeVito Z, Chintala S, Gross S, Paszke A, Massa F, Lerer A, Chanan G, Lin Z, Yang E, Desmaison A, Tejani A, Kopf A, Bradbury J, Antiga L, Raison M, Gimelshein N, Chilamkurthy S, Killeen T, Feng L, Bai J (2019) Pytorch: An imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 33, Curran Associates, Inc
47.
Zurück zum Zitat Szegedy C, Wei Liu, Yangqing Jia, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1–9 Szegedy C, Wei Liu, Yangqing Jia, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1–9
48.
Zurück zum Zitat Tallada MG (2016) Coarse grain parallelization of deep neural networks. In: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Association for Computing Machinery, New York, NY, USA, PPoPP ’16, https://doi.org/10.1145/2851141.2851158 Tallada MG (2016) Coarse grain parallelization of deep neural networks. In: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Association for Computing Machinery, New York, NY, USA, PPoPP ’16, https://​doi.​org/​10.​1145/​2851141.​2851158
49.
Zurück zum Zitat Tang S, He B, Zhang S, Niu Z (2016) Elastic multi-resource fairness: balancing fairness and efficiency in coupled cpu-gpu architectures. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, SC ’16 Tang S, He B, Zhang S, Niu Z (2016) Elastic multi-resource fairness: balancing fairness and efficiency in coupled cpu-gpu architectures. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, SC ’16
50.
Zurück zum Zitat Vasilache N, Johnson J, Mathieu M, Chintala S, Piantino S, LeCun Y (2014) Fast convolutional nets with fbfft: A gpu performance evaluation. arXiv preprint arXiv:14127580 Vasilache N, Johnson J, Mathieu M, Chintala S, Piantino S, LeCun Y (2014) Fast convolutional nets with fbfft: A gpu performance evaluation. arXiv preprint arXiv:​14127580
51.
Zurück zum Zitat Wang W, Chen G, Dinh ATT, Gao J, Ooi BC, Tan KL, Wang S (2015) Singa: Putting deep learning in the hands of multimedia users. In: Proceedings of the 23rd ACM International Conference on Multimedia, Association for Computing Machinery, New York, NY, USA, MM ’15, p 25–34, https://doi.org/10.1145/2733373.2806232 Wang W, Chen G, Dinh ATT, Gao J, Ooi BC, Tan KL, Wang S (2015) Singa: Putting deep learning in the hands of multimedia users. In: Proceedings of the 23rd ACM International Conference on Multimedia, Association for Computing Machinery, New York, NY, USA, MM ’15, p 25–34, https://​doi.​org/​10.​1145/​2733373.​2806232
52.
Zurück zum Zitat Wei J, Dai W, Kumar A, Zheng X, Ho Q, Xing EP (2013) Consistent bounded-asynchronous parameter servers for distributed ml. arXiv preprint arXiv:13127869 Wei J, Dai W, Kumar A, Zheng X, Ho Q, Xing EP (2013) Consistent bounded-asynchronous parameter servers for distributed ml. arXiv preprint arXiv:​13127869
53.
Zurück zum Zitat Wen Z, He B, Ramamohanarao K, Lu S, Shi J (2018) Efficient gradient boosted decision tree training on gpus. In: Parallel and Distributed Processing Symposium (IPDPS), (2018) IEEE International. Vancouver, British Columbia, Canada Wen Z, He B, Ramamohanarao K, Lu S, Shi J (2018) Efficient gradient boosted decision tree training on gpus. In: Parallel and Distributed Processing Symposium (IPDPS), (2018) IEEE International. Vancouver, British Columbia, Canada
54.
Zurück zum Zitat Xing EP, Ho Q, Dai W, Kim JK, Wei J, Lee S, Zheng X, Xie P, Kumar A, Yu Y (2015) Petuum: a new platform for distributed machine learning on big data. IEEE Trans Big Data 1(2):49–67CrossRef Xing EP, Ho Q, Dai W, Kim JK, Wei J, Lee S, Zheng X, Xie P, Kumar A, Yu Y (2015) Petuum: a new platform for distributed machine learning on big data. IEEE Trans Big Data 1(2):49–67CrossRef
55.
56.
Zurück zum Zitat Zhong J, He B (2014) Kernelet: high-throughput gpu kernel executions with dynamic slicing and schedulingh. IEEE Trans Parallel Distrib Syst 25(6):1522–1532MathSciNetCrossRef Zhong J, He B (2014) Kernelet: high-throughput gpu kernel executions with dynamic slicing and schedulingh. IEEE Trans Parallel Distrib Syst 25(6):1522–1532MathSciNetCrossRef
57.
Zurück zum Zitat Zhou F, Wu F, Zhang Z, Dong M (2017) Node-level parallelization for deep neural networks with conditional independent graph. Neurocomputing 267:261–270CrossRef Zhou F, Wu F, Zhang Z, Dong M (2017) Node-level parallelization for deep neural networks with conditional independent graph. Neurocomputing 267:261–270CrossRef
Metadaten
Titel
HGP4CNN: an efficient parallelization framework for training convolutional neural networks on modern GPUs
verfasst von
Hao Fu
Shanjiang Tang
Bingsheng He
Ce Yu
Jizhou Sun
Publikationsdatum
12.04.2021
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 11/2021
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-021-03746-z

Weitere Artikel der Ausgabe 11/2021

The Journal of Supercomputing 11/2021 Zur Ausgabe

Premium Partner