Skip to main content
Erschienen in: International Journal of Parallel Programming 2/2019

31.01.2018

Heterogeneous parallel_for Template for CPU–GPU Chips

verfasst von: Angeles Navarro, Francisco Corbera, Andres Rodriguez, Antonio Vilches, Rafael Asenjo

Erschienen in: International Journal of Parallel Programming | Ausgabe 2/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Heterogeneous processors, comprising CPU cores and a GPU, are the de facto standard in desktop and mobile platforms. In many cases it is worthwhile to exploit both the CPU and GPU simultaneously. However, the workload distribution poses a challenge when running irregular applications. In this paper, we present LogFit, a novel adaptive partitioning strategy for parallel loops, specially designed for applications with irregular data accesses running on heterogeneous CPU–GPU architectures. Our algorithm dynamically finds the optimal chunk size that must be assigned to the GPU. Also, the number of iterations assigned to the CPU cores are adaptively computed to avoid load unbalance. In addition, we also strive to increase the programmer’s productivity by providing a high level template that eases the coding of heterogeneous parallel loops. We evaluate LogFit’s performance and energy consumption by using a set of irregular benchmarks running on a heterogeneous CPU–GPU processor, an Intel Haswell. Our experimental results show that we outperform Oracle-like static and other dynamic state-of-the-art approaches both in terms of performance, up to 57%, and energy saving, up to 31%.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
2
For instance, Threading Building Blocks library (TBB) [22], recommends to have CPU chunk sizes that take 100,000 clock cycles at least.
 
3
RO = Read-Only; WO = Write-Only; RW = Read–Write
 
4
\(nEU =\,\)clGetDeviceInfo(deviceId, CL_DEVICE_MAX_COMPUTE_UNITS)
 
Literatur
1.
Zurück zum Zitat Augonnet, C., Clet-Ortega, J., Thibault, S., Namyst, R.: Data-aware task scheduling on multi-accelerator based platforms. In: Proceedings of ICPADS, pp. 291–298 (2010) Augonnet, C., Clet-Ortega, J., Thibault, S., Namyst, R.: Data-aware task scheduling on multi-accelerator based platforms. In: Proceedings of ICPADS, pp. 291–298 (2010)
2.
Zurück zum Zitat Belviranli, M., Bhuyan, L., Gupta, R.: A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Trans. Archit. Code Optim. 9(4), 57 (2013)CrossRef Belviranli, M., Bhuyan, L., Gupta, R.: A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Trans. Archit. Code Optim. 9(4), 57 (2013)CrossRef
3.
Zurück zum Zitat Bueno, J., Planas, J., Duran, A., Badia, R., Martorell, X., Ayguade, E., Labarta, J.: Productive programming of GPU clusters with OmpSs. In: Proceedings of IPDPS (2012) Bueno, J., Planas, J., Duran, A., Badia, R., Martorell, X., Ayguade, E., Labarta, J.: Productive programming of GPU clusters with OmpSs. In: Proceedings of IPDPS (2012)
4.
Zurück zum Zitat Burtscher, M., Nasre, R., Pingali, K.: A quantitative study of irregular programs on GPUs. In: Proceedings of IISWC, pp. 141–151 (2012) Burtscher, M., Nasre, R., Pingali, K.: A quantitative study of irregular programs on GPUs. In: Proceedings of IISWC, pp. 141–151 (2012)
5.
Zurück zum Zitat Chatterjee, S., Grossman, M., Sbirlea, A., Sarkar, V.: Dynamic task parallelism with a GPU work-stealing runtime system. In: LNCS Series, vol. 7146, pp. 203–217 (2011) Chatterjee, S., Grossman, M., Sbirlea, A., Sarkar, V.: Dynamic task parallelism with a GPU work-stealing runtime system. In: LNCS Series, vol. 7146, pp. 203–217 (2011)
6.
Zurück zum Zitat Che, S., et al.: A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads. In: IISWC, pp. 1–11 (2010) Che, S., et al.: A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads. In: IISWC, pp. 1–11 (2010)
7.
Zurück zum Zitat Danalis, A., Marin, G., McCurdy, C., et al.: The scalable heterogeneous computing (SHOC) benchmark suite. In: GPGPU, pp. 63–74 (2010) Danalis, A., Marin, G., McCurdy, C., et al.: The scalable heterogeneous computing (SHOC) benchmark suite. In: GPGPU, pp. 63–74 (2010)
8.
Zurück zum Zitat Davis, T.A., Hu, Y.: The University of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1–25 (2011)MathSciNetMATH Davis, T.A., Hu, Y.: The University of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1–25 (2011)MathSciNetMATH
10.
Zurück zum Zitat Gibbon, P., Frings, W., Mohr, B.: Performance analysis and visualization of the n-body tree code PEPC on massively parallel computers. In: PARCO, pp. 367–374 (2005) Gibbon, P., Frings, W., Mohr, B.: Performance analysis and visualization of the n-body tree code PEPC on massively parallel computers. In: PARCO, pp. 367–374 (2005)
11.
Zurück zum Zitat Hart, A.: The OpenACC programming model. Technical Report, Cray Exascale Research Initiative Europe (2012) Hart, A.: The OpenACC programming model. Technical Report, Cray Exascale Research Initiative Europe (2012)
12.
Zurück zum Zitat Intel: Intel OpenCL N-Body Sample (2014) Intel: Intel OpenCL N-Body Sample (2014)
14.
Zurück zum Zitat Kaleem, R., et al.: Adaptive heterogeneous scheduling for integrated GPUs. In: International Conference on Parallel Architectures and Compilation, PACT ’14, pp. 151–162 (2014) Kaleem, R., et al.: Adaptive heterogeneous scheduling for integrated GPUs. In: International Conference on Parallel Architectures and Compilation, PACT ’14, pp. 151–162 (2014)
15.
Zurück zum Zitat Kulkarni, M., Burtscher, M., Cascaval, C., Pingali, K.: Lonestar: a suite of parallel irregular programs. In: ISPASS, pp. 65–76 (2009) Kulkarni, M., Burtscher, M., Cascaval, C., Pingali, K.: Lonestar: a suite of parallel irregular programs. In: ISPASS, pp. 65–76 (2009)
16.
Zurück zum Zitat Li, D., Rhu, M., et al.: Priority-based cache allocation in throughput processors. In: International Symposium on High Performance Computer Architecture (HPCA) (2015) Li, D., Rhu, M., et al.: Priority-based cache allocation in throughput processors. In: International Symposium on High Performance Computer Architecture (HPCA) (2015)
17.
Zurück zum Zitat Lima, J., Gautier, T., Maillard, N., Danjean, V.: Exploiting concurrent GPU operations for efficient work stealing on multi-GPUs. In: SBAC-PAD’12, pp. 75–82 (2012) Lima, J., Gautier, T., Maillard, N., Danjean, V.: Exploiting concurrent GPU operations for efficient work stealing on multi-GPUs. In: SBAC-PAD’12, pp. 75–82 (2012)
18.
Zurück zum Zitat Luk, C.K., Hong, S., Kim, H.: Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In: Proceedings of Microarchitecture, pp. 45–55 (2009) Luk, C.K., Hong, S., Kim, H.: Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In: Proceedings of Microarchitecture, pp. 45–55 (2009)
19.
Zurück zum Zitat Navarro, A., Vilches, A., Corbera, F., Asenjo, R.: Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures. J. Supercomput. 70, 756–771 (2014)CrossRef Navarro, A., Vilches, A., Corbera, F., Asenjo, R.: Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures. J. Supercomput. 70, 756–771 (2014)CrossRef
20.
Zurück zum Zitat NVidia: CUDA Toolkit 5.0 Performance Report (2013) NVidia: CUDA Toolkit 5.0 Performance Report (2013)
21.
Zurück zum Zitat Pandit, P., Govindarajan, R.: Fluidic kernels: cooperative execution of OpenCL programs on multiple heterogeneous devices. In: CGO (2014) Pandit, P., Govindarajan, R.: Fluidic kernels: cooperative execution of OpenCL programs on multiple heterogeneous devices. In: CGO (2014)
22.
Zurück zum Zitat Reinders, J.: Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. O’Reilly Media, Inc. (2007) Reinders, J.: Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. O’Reilly Media, Inc. (2007)
23.
Zurück zum Zitat Rogers, T.G., O’Connor, M., Aamodt, T.M.: Cache-conscious wavefront scheduling. In: IEEE/ACM International Symposium on Microarchitecture, MICRO-45 (2012) Rogers, T.G., O’Connor, M., Aamodt, T.M.: Cache-conscious wavefront scheduling. In: IEEE/ACM International Symposium on Microarchitecture, MICRO-45 (2012)
24.
Zurück zum Zitat Russel, S.: Levering GPGPU and OpenCL technologies for natural user interaces. Technical Report, You i Labs inc (2012) Russel, S.: Levering GPGPU and OpenCL technologies for natural user interaces. Technical Report, You i Labs inc (2012)
25.
Zurück zum Zitat Sbirlea, A., Zou, Y., Budimlic, Z., Cong, J., Sarkar, V.: Mapping a data-flow programming model onto heterogeneous platforms. In: Proceedings of LCTES, pp. 61–70 (2012) Sbirlea, A., Zou, Y., Budimlic, Z., Cong, J., Sarkar, V.: Mapping a data-flow programming model onto heterogeneous platforms. In: Proceedings of LCTES, pp. 61–70 (2012)
26.
Zurück zum Zitat Wang, Z., Zheng, L., Chen, Q., Guo, M.: CPU + GPU scheduling with asymptotic profiling. Parallel Comput. 40(2), 107–115 (2014)CrossRef Wang, Z., Zheng, L., Chen, Q., Guo, M.: CPU + GPU scheduling with asymptotic profiling. Parallel Comput. 40(2), 107–115 (2014)CrossRef
Metadaten
Titel
Heterogeneous parallel_for Template for CPU–GPU Chips
verfasst von
Angeles Navarro
Francisco Corbera
Andres Rodriguez
Antonio Vilches
Rafael Asenjo
Publikationsdatum
31.01.2018
Verlag
Springer US
Erschienen in
International Journal of Parallel Programming / Ausgabe 2/2019
Print ISSN: 0885-7458
Elektronische ISSN: 1573-7640
DOI
https://doi.org/10.1007/s10766-018-0555-0

Weitere Artikel der Ausgabe 2/2019

International Journal of Parallel Programming 2/2019 Zur Ausgabe