Skip to main content
Erschienen in: Soft Computing 3/2019

06.09.2017 | Methodologies and Application

Dual buffer rotation four-stage pipeline for CPU–GPU cooperative computing

verfasst von: Tao Li, Qiankun Dong, Yifeng Wang, Xiaoli Gong, Yulu Yang

Erschienen in: Soft Computing | Ausgabe 3/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Accelerators such as GPUs have become popular general-purpose computing device in the field of high-performance computing. With the boosting ability of storage and computation, it is very important to solve the complex scientific and engineering problems on CPU–GPU heterogeneous system in the big data era. Now the compute-intensive problems have been successfully solved using CPU–GPU cooperative computing. However, it is difficult to handle large-scale data-intensive problems, especially for those limited by GPU device memory. In this paper, the dual buffer rotation four-stage pipeline (DBFP) mechanism is proposed for CPU–GPU cooperative computation to efficiently handle data-intensive problems, which need larger memory than that of a single GPU. The data block partition-based pipeline computing strategy is designed on top of the DBFP mechanism. On the one hand, it breaks out the bottleneck of limited GPU device memory. On the other hand, it explores high-performance computing of CPU and GPU with data transfer and computation overlap. Furthermore, it is easy to extend the DBFP mechanism on the heterogeneous system equipped with multiple GPUs and achieve high resource utilization. The results show that it can achieve 99 and 90% of theoretical performance for dense general matrix multiplication on one GPU and two GPUs respectively with Nvidia GTX480 or K40 GPUs. It also enables K-means and T-nearest-neighbor algorithms to process larger datasets, which used to be limited by the GPU device memory. We achieve nearly 1.9-fold performance gains by dynamic task scheduling on two GPUs when the performance bottleneck is GPU computing or data transmission.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
1
TOP500.org.
 
2
CUDA_C_Programming_Guide.
 
6
NVIDIA. NVIDIA’s next generation CUDATM compute architecture Whitepaper, V1.0 edition.
 
8
CUDA_C_Programming_Guide.
 
Literatur
Zurück zum Zitat Aciu RM, Ciocarlie H (2013) Algorithm for cooperative CPU–GPU computing. In: 15th international symposium on symbolic and numeric algorithms for scientific computing (SYNASC). IEEE, pp 352–358 Aciu RM, Ciocarlie H (2013) Algorithm for cooperative CPU–GPU computing. In: 15th international symposium on symbolic and numeric algorithms for scientific computing (SYNASC). IEEE, pp 352–358
Zurück zum Zitat Arumugam K, Godunov A, Ranjan D et al (2013) A memory efficient algorithm for adaptive multidimensional integration with multiple GPUs. In: 20th international conference on high performance computing (HiPC). IEEE, pp 169–175 Arumugam K, Godunov A, Ranjan D et al (2013) A memory efficient algorithm for adaptive multidimensional integration with multiple GPUs. In: 20th international conference on high performance computing (HiPC). IEEE, pp 169–175
Zurück zum Zitat Breitbart J (2011) Analysis of a memory bandwidth limited scenario for NUMA and GPU systems. In: IPDPSW 2011, international symposium on parallel and distributed processing workshops and PhD forum. IEEE, pp 693–699 Breitbart J (2011) Analysis of a memory bandwidth limited scenario for NUMA and GPU systems. In: IPDPSW 2011, international symposium on parallel and distributed processing workshops and PhD forum. IEEE, pp 693–699
Zurück zum Zitat Domanski L, Bednarz T, Gureyev T et al (2013) Applications of heterogeneous computing in computational and simulation science. Int J Comput Sci Eng 8(3):240–252 Domanski L, Bednarz T, Gureyev T et al (2013) Applications of heterogeneous computing in computational and simulation science. Int J Comput Sci Eng 8(3):240–252
Zurück zum Zitat Du P, Weber R, Luszczek P et al (2012) From CUDA to OpenCL: towards a performance-portable solution for multi-platform GPU programming. Parallel Comput 38(8):391–407CrossRef Du P, Weber R, Luszczek P et al (2012) From CUDA to OpenCL: towards a performance-portable solution for multi-platform GPU programming. Parallel Comput 38(8):391–407CrossRef
Zurück zum Zitat Fujii Y, Azumi T, Nishio N et al (2013) Data transfer matters for GPU computing. In: International conference on parallel and distributed systems (ICPADS). IEEE, pp 275–282 Fujii Y, Azumi T, Nishio N et al (2013) Data transfer matters for GPU computing. In: International conference on parallel and distributed systems (ICPADS). IEEE, pp 275–282
Zurück zum Zitat Gu J, Beckmann BM, Cao T et al (2014) iCHAT: inter-cache hardware-assistant data transfer for heterogeneous chip multiprocessors. In: 2014 9th IEEE international conference on networking, architecture, and storage (NAS). IEEE, pp 242–251 Gu J, Beckmann BM, Cao T et al (2014) iCHAT: inter-cache hardware-assistant data transfer for heterogeneous chip multiprocessors. In: 2014 9th IEEE international conference on networking, architecture, and storage (NAS). IEEE, pp 242–251
Zurück zum Zitat Hagen L, Kahng AB (1992) New spectral methods for ratio cut participating and clustering. IEEE Trans Comput Aided Des 11(9):1074–1085CrossRef Hagen L, Kahng AB (1992) New spectral methods for ratio cut participating and clustering. IEEE Trans Comput Aided Des 11(9):1074–1085CrossRef
Zurück zum Zitat Hou Q, Sun X, Zhou K et al (2011) Memory-scalable GPU spatial hierarchy construction. IEEE Trans Vis Comput Graph 17(4):466–474CrossRef Hou Q, Sun X, Zhou K et al (2011) Memory-scalable GPU spatial hierarchy construction. IEEE Trans Vis Comput Graph 17(4):466–474CrossRef
Zurück zum Zitat Huet S, Boulos V, Fristot V et al (2011) DFG implementation on multi GPU cluster with computation-communication overlap. In: 2011 conference on design and architectures for signal and image processing (DASIP). IEEE, pp 1–8 Huet S, Boulos V, Fristot V et al (2011) DFG implementation on multi GPU cluster with computation-communication overlap. In: 2011 conference on design and architectures for signal and image processing (DASIP). IEEE, pp 1–8
Zurück zum Zitat Jablin TB, Prabhu P, Jablin JA et al (2011) Automatic CPU–GPU communication management and optimization. ACM SIGPLAN Not ACM 46(6):142–151CrossRef Jablin TB, Prabhu P, Jablin JA et al (2011) Automatic CPU–GPU communication management and optimization. ACM SIGPLAN Not ACM 46(6):142–151CrossRef
Zurück zum Zitat Kim Y, Lee J, Kim D et al (2014a) ScaleGPU: GPU architecture for memory-unaware GPU programming. IEEE Comput Archit Lett 13(2):101–104 Kim Y, Lee J, Kim D et al (2014a) ScaleGPU: GPU architecture for memory-unaware GPU programming. IEEE Comput Archit Lett 13(2):101–104
Zurück zum Zitat Kim Y, Lee J, Jo JE et al (2014b) GPUdmm: a high-performance and memory-oblivious GPU architecture using dynamic memory management. In: IEEE 20th international symposium on high performance computer architecture (HPCA). IEEE, pp 546–557 Kim Y, Lee J, Jo JE et al (2014b) GPUdmm: a high-performance and memory-oblivious GPU architecture using dynamic memory management. In: IEEE 20th international symposium on high performance computer architecture (HPCA). IEEE, pp 546–557
Zurück zum Zitat Kothapalli K, Banerjee DS, Narayanan PJ et al (2013) CPU and/or GPU: revisiting the GPU vs. CPU myth. Preprint. arXiv:1303.2171 Kothapalli K, Banerjee DS, Narayanan PJ et al (2013) CPU and/or GPU: revisiting the GPU vs. CPU myth. Preprint. arXiv:​1303.​2171
Zurück zum Zitat Li Y, Zhang Y (2014) An automatic performance tuning framework for FFT on heterogeneous platforms. J Comput Res Dev 51(3):637–649 Li Y, Zhang Y (2014) An automatic performance tuning framework for FFT on heterogeneous platforms. J Comput Res Dev 51(3):637–649
Zurück zum Zitat Li T, Li H, Liu X et al (2013) GPU acceleration of interior point methods in large scale SVM training. In: TrustCom2013, 12th IEEE international conference on trust, security and privacy in computing and communications. IEEE, pp 863–870 Li T, Li H, Liu X et al (2013) GPU acceleration of interior point methods in large scale SVM training. In: TrustCom2013, 12th IEEE international conference on trust, security and privacy in computing and communications. IEEE, pp 863–870
Zurück zum Zitat Li T, Wang D, Zhang S et al (2014) Parallel rank coherence in networks for inferring disease phenotype and gene set associations. In: Advanced computer architecture. Springer, Berlin, pp 163–176 Li T, Wang D, Zhang S et al (2014) Parallel rank coherence in networks for inferring disease phenotype and gene set associations. In: Advanced computer architecture. Springer, Berlin, pp 163–176
Zurück zum Zitat Luk C K, Hong S, Kim H (2009) Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In: Proceedings of the 42nd annual IEEE/ACM international symposium on microarchitecture. ACM, pp 45–55 Luk C K, Hong S, Kim H (2009) Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In: Proceedings of the 42nd annual IEEE/ACM international symposium on microarchitecture. ACM, pp 45–55
Zurück zum Zitat Luszczek Dongarra J, Petitet A (2001) The LINPACK benchmark: past, present and future. Mimeo, University of Tennessee Luszczek Dongarra J, Petitet A (2001) The LINPACK benchmark: past, present and future. Mimeo, University of Tennessee
Zurück zum Zitat Mohanavalli S, Jaisakthi SM, Aravindan C (2011) Strategies for parallelizing \(K\)-means data clustering algorithm. In: Das VV, Thomas G, Lumban Gaol F (eds) Information technology and mobile communication. Springer, vol 147. Berlin, Heidelberg, pp 427–430 Mohanavalli S, Jaisakthi SM, Aravindan C (2011) Strategies for parallelizing \(K\)-means data clustering algorithm. In: Das VV, Thomas G, Lumban Gaol F (eds) Information technology and mobile communication. Springer, vol 147. Berlin, Heidelberg, pp 427–430
Zurück zum Zitat Pienaar JA, Chakradhar S, Raghunathan A (2012) Automatic generation of software pipelines for heterogeneous parallel systems. In: International conference for high performance computing, networking, storage and analysis (SC). IEEE, pp 1–12 Pienaar JA, Chakradhar S, Raghunathan A (2012) Automatic generation of software pipelines for heterogeneous parallel systems. In: International conference for high performance computing, networking, storage and analysis (SC). IEEE, pp 1–12
Zurück zum Zitat Vestias M, Neto H (2014) Trends of CPU, GPU and FPGA for high-performance computing. In: 24th international conference on field programmable logic and applications (FPL). IEEE, pp 1–6 Vestias M, Neto H (2014) Trends of CPU, GPU and FPGA for high-performance computing. In: 24th international conference on field programmable logic and applications (FPL). IEEE, pp 1–6
Zurück zum Zitat Wang Y, Jin X, Cheng X (2013) Network big data: present and future. Chin J Comput 36(6):1125–1138CrossRef Wang Y, Jin X, Cheng X (2013) Network big data: present and future. Chin J Comput 36(6):1125–1138CrossRef
Zurück zum Zitat Wang H, Potluri S, Bureddy D et al (2014) GPU-aware MPI on RDMA-enabled clusters: design, implementation and evaluation. IEEE Trans Parallel Distrib Syst 25(10):2595–2605CrossRef Wang H, Potluri S, Bureddy D et al (2014) GPU-aware MPI on RDMA-enabled clusters: design, implementation and evaluation. IEEE Trans Parallel Distrib Syst 25(10):2595–2605CrossRef
Zurück zum Zitat Werkhoven B, Maassen J, Seinstra FJ et al (2014) Performance models for CPU–GPU data transfers. In: 14th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGrid). IEEE, pp 11–20 Werkhoven B, Maassen J, Seinstra FJ et al (2014) Performance models for CPU–GPU data transfers. In: 14th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGrid). IEEE, pp 11–20
Zurück zum Zitat Zhang B, Cao H, Dong X, Li D, Hu L (2011) Novel GPU data partitioning method to overlap communication and computation. J Xi’an Jiaotong Univ 45(4):1–6 Zhang B, Cao H, Dong X, Li D, Hu L (2011) Novel GPU data partitioning method to overlap communication and computation. J Xi’an Jiaotong Univ 45(4):1–6
Zurück zum Zitat Zhang S, Li T, Jiao X, Wang Y, Yang Y (2014) Hlanc: heterogeneous parallel implementation of the implicitly restarted Lanczos method. In: The 3rd international workshop on heterogeneous and unconventional cluster architectures and applications, Minneapolis, Sept. 9–12 Zhang S, Li T, Jiao X, Wang Y, Yang Y (2014) Hlanc: heterogeneous parallel implementation of the implicitly restarted Lanczos method. In: The 3rd international workshop on heterogeneous and unconventional cluster architectures and applications, Minneapolis, Sept. 9–12
Metadaten
Titel
Dual buffer rotation four-stage pipeline for CPU–GPU cooperative computing
verfasst von
Tao Li
Qiankun Dong
Yifeng Wang
Xiaoli Gong
Yulu Yang
Publikationsdatum
06.09.2017
Verlag
Springer Berlin Heidelberg
Erschienen in
Soft Computing / Ausgabe 3/2019
Print ISSN: 1432-7643
Elektronische ISSN: 1433-7479
DOI
https://doi.org/10.1007/s00500-017-2795-0

Weitere Artikel der Ausgabe 3/2019

Soft Computing 3/2019 Zur Ausgabe

Methodologies and Application

Physarum-energy optimization algorithm