Skip to main content
Erschienen in: The Journal of Supercomputing 2/2018

13.09.2017

Out-of-core implementation for accelerator kernels on heterogeneous clouds

verfasst von: Hamidreza Khaleghzadeh, Ziming Zhong, Ravi Reddy, Alexey Lastovetsky

Erschienen in: The Journal of Supercomputing | Ausgabe 2/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Cloud environments today are increasingly featuring hybrid nodes containing multicore CPU processors and a diverse mix of accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors, and Field-Programmable Gate Arrays (FPGAs) to facilitate easier migration to them of HPC workloads. While virtualization of accelerators in clouds is a leading research challenge, we address the programming challenges that assail execution of large instances of data-parallel applications using these accelerators in this paper. In a typical hybrid node in a cloud, the tight integration of accelerators with multicore CPUs via PCI-E communication links contains inherent limitations such as limited main memory of accelerators and limited bandwidth of the PCI-E communication links. These limitations poses formidable programming challenges to execution of large problem sizes on these accelerators. In this paper, we describe a library containing interfaces (HCLOOC) that addresses these challenges. It employs optimal software pipelines to overlap data transfers between host CPU and the accelerator and computations on the accelerator. It is designed using the fundamental building blocks, which are OpenCL command queues for FPGAs, Intel offload streams for Intel Xeon Phis, and CUDA streams and events that allow concurrent utilization of the copy and execution engines provided in NVidia GPUs. We elucidate the key features of our library using an out-of-core implementation of matrix multiplication of large dense matrices on a hybrid node, an Intel Haswell multicore CPU server hosting three accelerators that includes NVidia K40c GPU, Intel Xeon Phi 3120P, and a Xilinx FPGA. Based on experiments with the GPU, we show that our out-of-core implementation achieves 82% of peak double-precision floating performance of the GPU and a speedup of 2.7 times over the NVidia’s out-of-core matrix multiplication implementation (CUBLAS-XT). We also demonstrate that our implementation exhibits 0% drop in performance when the problem size exceeds the main memory of the GPU. We observe this 0% drop also for our implementation for Intel Xeon Phi and Xilinx FPGA.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Filelis-Papadopoulos CK, Grylonakis ENG, Kyziropoulos PE, Gravvanis GA, Morrison JP (2016) Characterization of hardware in self-managing self-organizing cloud environment. In: Proceedings of the 20th Pan-Hellenic Conference on Informatics, Series PCI ’16. ACM, pp 56:1–56:6 Filelis-Papadopoulos CK, Grylonakis ENG, Kyziropoulos PE, Gravvanis GA, Morrison JP (2016) Characterization of hardware in self-managing self-organizing cloud environment. In: Proceedings of the 20th Pan-Hellenic Conference on Informatics, Series PCI ’16. ACM, pp 56:1–56:6
2.
Zurück zum Zitat Lynn T, Xiong H, Dong D, Momani B, Gravvanis GA, Filelis-Papadopoulos CK, Elster AC, Khan MM, Tzovaras D, Giannoutakis KM, Petcu D, Neagul M, Dragon I, Kuppudayar P, Natarajan S, McGrath M, Gaydadjiev G, Becker T, Gourinovitch A, Kenny D, Morrison J (2016) CLOUDLIGHTNING: a framework for a self-organising and self-managing heterogeneous cloud. In: Proceedings of the 6th International Conference on Cloud Computing and Services Science, vols 1 and 2, Series CLOSER 2016. SCITEPRESS - Science and Technology Publications, Lda pp 333–338 Lynn T, Xiong H, Dong D, Momani B, Gravvanis GA, Filelis-Papadopoulos CK, Elster AC, Khan MM, Tzovaras D, Giannoutakis KM, Petcu D, Neagul M, Dragon I, Kuppudayar P, Natarajan S, McGrath M, Gaydadjiev G, Becker T, Gourinovitch A, Kenny D, Morrison J (2016) CLOUDLIGHTNING: a framework for a self-organising and self-managing heterogeneous cloud. In: Proceedings of the 6th International Conference on Cloud Computing and Services Science, vols 1 and 2, Series CLOSER 2016. SCITEPRESS - Science and Technology Publications, Lda pp 333–338
3.
Zurück zum Zitat Hong CH, Spence I, Nikolopoulos D (2017) FairGV: fair and fast GPU virtualization. IEEE Trans Parallel Distrib Syst 99:1–1 Hong CH, Spence I, Nikolopoulos D (2017) FairGV: fair and fast GPU virtualization. IEEE Trans Parallel Distrib Syst 99:1–1
5.
Zurück zum Zitat Tomov S, Dongarra J, Baboulin M (2010) Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Comput 36(5–6):232–240CrossRefMATH Tomov S, Dongarra J, Baboulin M (2010) Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Comput 36(5–6):232–240CrossRefMATH
10.
Zurück zum Zitat Ostermann S, Iosup A, Yigitbasi N, Prodan R, Fahringer T, Epema D (2009) A performance analysis of EC2 cloud computing services for scientific computing. In: International Conference on Cloud Computing. Springer, pp 115–131 Ostermann S, Iosup A, Yigitbasi N, Prodan R, Fahringer T, Epema D (2009) A performance analysis of EC2 cloud computing services for scientific computing. In: International Conference on Cloud Computing. Springer, pp 115–131
11.
Zurück zum Zitat Iosup A, Ostermann S, Yigitbasi MN, Prodan R, Fahringer T, Epema D (2011) Performance analysis of cloud computing services for many-tasks scientific computing. IEEE Trans Parallel Distrib Syst 22(6):931–945CrossRef Iosup A, Ostermann S, Yigitbasi MN, Prodan R, Fahringer T, Epema D (2011) Performance analysis of cloud computing services for many-tasks scientific computing. IEEE Trans Parallel Distrib Syst 22(6):931–945CrossRef
12.
Zurück zum Zitat Gupta A, Kalé LV, Milojicic D, Faraboschi P, Balle SM (2013) HPC-aware VM placement in infrastructure clouds. In: 2013 IEEE International Conference on Cloud Engineering (IC2E), Mar 2013, pp 11–20 Gupta A, Kalé LV, Milojicic D, Faraboschi P, Balle SM (2013) HPC-aware VM placement in infrastructure clouds. In: 2013 IEEE International Conference on Cloud Engineering (IC2E), Mar 2013, pp 11–20
13.
Zurück zum Zitat Parashar M, AbdelBaky M, Rodero I, Devarakonda A (2013) Cloud paradigms and practices for computational and data-enabled science and engineering. Comput Sci Eng 15(4):10–18CrossRef Parashar M, AbdelBaky M, Rodero I, Devarakonda A (2013) Cloud paradigms and practices for computational and data-enabled science and engineering. Comput Sci Eng 15(4):10–18CrossRef
14.
Zurück zum Zitat Mauch V, Kunze M, Hillenbrand M (2013) High performance cloud computing. Future Gener Comput Syst 29(6):1408–1416CrossRef Mauch V, Kunze M, Hillenbrand M (2013) High performance cloud computing. Future Gener Comput Syst 29(6):1408–1416CrossRef
15.
Zurück zum Zitat Giunta G, Montella R, Agrillo G, Coviello G (2010) A GPGPU transparent virtualization component for high performance computing clouds. Springer, BerlinCrossRef Giunta G, Montella R, Agrillo G, Coviello G (2010) A GPGPU transparent virtualization component for high performance computing clouds. Springer, BerlinCrossRef
16.
Zurück zum Zitat Byma S, Steffan JG, Bannazadeh H, Garcia AL, Chow P (2014) FPGAs in the cloud: booting virtualized hardware accelerators with OpenStack. In: 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines, May 2014, pp 109–116 Byma S, Steffan JG, Bannazadeh H, Garcia AL, Chow P (2014) FPGAs in the cloud: booting virtualized hardware accelerators with OpenStack. In: 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines, May 2014, pp 109–116
17.
Zurück zum Zitat Hong C-H, Spence I, Nikolopoulos DS (2017) GPU virtualization and scheduling methods: a comprehensive survey. ACM Comput Surv (CSUR) 50(3):35CrossRef Hong C-H, Spence I, Nikolopoulos DS (2017) GPU virtualization and scheduling methods: a comprehensive survey. ACM Comput Surv (CSUR) 50(3):35CrossRef
18.
Zurück zum Zitat Gu L, Siegel J, Li X (2011) Using GPUs to compute large out-of-card FFTs. In: Proceedings of the International Conference on Supercomputing, Series ICS ’11. ACM, pp 255–264 Gu L, Siegel J, Li X (2011) Using GPUs to compute large out-of-card FFTs. In: Proceedings of the International Conference on Supercomputing, Series ICS ’11. ACM, pp 255–264
19.
Zurück zum Zitat Mu X, Zhou H-X, Chen K, Hong W (2014) Higher order method of moments with a parallel out-of-core LU solver on GPU/CPU platform. IEEE Trans Antennas Propag 62(11):5634–5646MathSciNetCrossRefMATH Mu X, Zhou H-X, Chen K, Hong W (2014) Higher order method of moments with a parallel out-of-core LU solver on GPU/CPU platform. IEEE Trans Antennas Propag 62(11):5634–5646MathSciNetCrossRefMATH
20.
Zurück zum Zitat Zhong Z, Rychkov V, Lastovetsky A (2012) Data partitioning on heterogeneous multicore and multi-GPU systems using functional performance models of data-parallel applications. In: 2012 IEEE International Conference on Cluster Computing (Cluster 2012), 24–28 Sept 2012, pp 191–199 Zhong Z, Rychkov V, Lastovetsky A (2012) Data partitioning on heterogeneous multicore and multi-GPU systems using functional performance models of data-parallel applications. In: 2012 IEEE International Conference on Cluster Computing (Cluster 2012), 24–28 Sept 2012, pp 191–199
21.
Zurück zum Zitat Zhong Z (2014) Optimization of data-parallel scientific applications on highly heterogeneous modern HPC platforms. Ph.D. dissertation, University College Dublin Zhong Z (2014) Optimization of data-parallel scientific applications on highly heterogeneous modern HPC platforms. Ph.D. dissertation, University College Dublin
22.
Zurück zum Zitat Wu J, Jaja J (2016) Achieving native GPU performance for out-of-card large dense matrix multiplication. Parallel Process Lett 26(02):1650007MathSciNetCrossRefMATH Wu J, Jaja J (2016) Achieving native GPU performance for out-of-card large dense matrix multiplication. Parallel Process Lett 26(02):1650007MathSciNetCrossRefMATH
Metadaten
Titel
Out-of-core implementation for accelerator kernels on heterogeneous clouds
verfasst von
Hamidreza Khaleghzadeh
Ziming Zhong
Ravi Reddy
Alexey Lastovetsky
Publikationsdatum
13.09.2017
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 2/2018
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-017-2141-4

Weitere Artikel der Ausgabe 2/2018

The Journal of Supercomputing 2/2018 Zur Ausgabe