nach oben

International Journal of Parallel Programming

Erschienen in:

31.01.2018

Heterogeneous parallel_for Template for CPU–GPU Chips

verfasst von: Angeles Navarro, Francisco Corbera, Andres Rodriguez, Antonio Vilches, Rafael Asenjo

Erschienen in: International Journal of Parallel Programming | Ausgabe 2/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Heterogeneous processors, comprising CPU cores and a GPU, are the de facto standard in desktop and mobile platforms. In many cases it is worthwhile to exploit both the CPU and GPU simultaneously. However, the workload distribution poses a challenge when running irregular applications. In this paper, we present LogFit, a novel adaptive partitioning strategy for parallel loops, specially designed for applications with irregular data accesses running on heterogeneous CPU–GPU architectures. Our algorithm dynamically finds the optimal chunk size that must be assigned to the GPU. Also, the number of iterations assigned to the CPU cores are adaptively computed to avoid load unbalance. In addition, we also strive to increase the programmer’s productivity by providing a high level template that eases the coding of heterogeneous parallel loops. We evaluate LogFit’s performance and energy consumption by using a set of irregular benchmarks running on a heterogeneous CPU–GPU processor, an Intel Haswell. Our experimental results show that we outperform Oracle-like static and other dynamic state-of-the-art approaches both in terms of performance, up to 57%, and energy saving, up to 31%.

Vorheriger Artikel Automatic Cost Analysis for Imperative BSP Programs

Nächster Artikel Fish School Search with Algorithmic Skeletons

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

https://github.com/avilchess/barneshut

For instance, Threading Building Blocks library (TBB) [22], recommends to have CPU chunk sizes that take 100,000 clock cycles at least.

RO = Read-Only; WO = Write-Only; RW = Read–Write

\(nEU =\,\)clGetDeviceInfo(deviceId, CL_DEVICE_MAX_COMPUTE_UNITS)

Augonnet, C., Clet-Ortega, J., Thibault, S., Namyst, R.: Data-aware task scheduling on multi-accelerator based platforms. In: Proceedings of ICPADS, pp. 291–298 (2010)

Belviranli, M., Bhuyan, L., Gupta, R.: A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Trans. Archit. Code Optim. 9(4), 57 (2013)CrossRef

Bueno, J., Planas, J., Duran, A., Badia, R., Martorell, X., Ayguade, E., Labarta, J.: Productive programming of GPU clusters with OmpSs. In: Proceedings of IPDPS (2012)

Burtscher, M., Nasre, R., Pingali, K.: A quantitative study of irregular programs on GPUs. In: Proceedings of IISWC, pp. 141–151 (2012)

Chatterjee, S., Grossman, M., Sbirlea, A., Sarkar, V.: Dynamic task parallelism with a GPU work-stealing runtime system. In: LNCS Series, vol. 7146, pp. 203–217 (2011)

Che, S., et al.: A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads. In: IISWC, pp. 1–11 (2010)

Danalis, A., Marin, G., McCurdy, C., et al.: The scalable heterogeneous computing (SHOC) benchmark suite. In: GPGPU, pp. 63–74 (2010)

Davis, T.A., Hu, Y.: The University of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1–25 (2011)MathSciNetMATH

Dementiev, R., Willhalm, T., Bruggeman, O., et al.: Intel Performance Counter Monitor (2012). www.intel.com/software/pcm

10.

Gibbon, P., Frings, W., Mohr, B.: Performance analysis and visualization of the n-body tree code PEPC on massively parallel computers. In: PARCO, pp. 367–374 (2005)

11.

Hart, A.: The OpenACC programming model. Technical Report, Cray Exascale Research Initiative Europe (2012)

12.

Intel: Intel OpenCL N-Body Sample (2014)

13.

Intel VTune Amplifier 2015 (2014). https://software.intel.com/en-us/intel-vtune-amplifier-xe

14.

Kaleem, R., et al.: Adaptive heterogeneous scheduling for integrated GPUs. In: International Conference on Parallel Architectures and Compilation, PACT ’14, pp. 151–162 (2014)

15.

Kulkarni, M., Burtscher, M., Cascaval, C., Pingali, K.: Lonestar: a suite of parallel irregular programs. In: ISPASS, pp. 65–76 (2009)

16.

Li, D., Rhu, M., et al.: Priority-based cache allocation in throughput processors. In: International Symposium on High Performance Computer Architecture (HPCA) (2015)

17.

Lima, J., Gautier, T., Maillard, N., Danjean, V.: Exploiting concurrent GPU operations for efficient work stealing on multi-GPUs. In: SBAC-PAD’12, pp. 75–82 (2012)

18.

Luk, C.K., Hong, S., Kim, H.: Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In: Proceedings of Microarchitecture, pp. 45–55 (2009)

19.

Navarro, A., Vilches, A., Corbera, F., Asenjo, R.: Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures. J. Supercomput. 70, 756–771 (2014)CrossRef

20.

NVidia: CUDA Toolkit 5.0 Performance Report (2013)

21.

Pandit, P., Govindarajan, R.: Fluidic kernels: cooperative execution of OpenCL programs on multiple heterogeneous devices. In: CGO (2014)

22.

Reinders, J.: Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. O’Reilly Media, Inc. (2007)

23.

Rogers, T.G., O’Connor, M., Aamodt, T.M.: Cache-conscious wavefront scheduling. In: IEEE/ACM International Symposium on Microarchitecture, MICRO-45 (2012)

24.

Russel, S.: Levering GPGPU and OpenCL technologies for natural user interaces. Technical Report, You i Labs inc (2012)

25.

Sbirlea, A., Zou, Y., Budimlic, Z., Cong, J., Sarkar, V.: Mapping a data-flow programming model onto heterogeneous platforms. In: Proceedings of LCTES, pp. 61–70 (2012)

26.

Wang, Z., Zheng, L., Chen, Q., Guo, M.: CPU + GPU scheduling with asymptotic profiling. Parallel Comput. 40(2), 107–115 (2014)CrossRef

Titel: Heterogeneous parallel_for Template for CPU–GPU Chips
verfasst von: Angeles Navarro
Francisco Corbera
Andres Rodriguez
Antonio Vilches
Rafael Asenjo
Publikationsdatum: 31.01.2018
Verlag: Springer US
Erschienen in: International Journal of Parallel Programming / Ausgabe 2/2019
Print ISSN: 0885-7458
Elektronische ISSN: 1573-7640
DOI: https://doi.org/10.1007/s10766-018-0555-0

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 2/2019

Guest Editorial: High-Level Parallel Programming and the Road to High Performance

SWIMM 2.0: Enhanced Smith–Waterman on Intel’s Multicore and Manycore Architectures Based on AVX-512 Vector Extensions

Persistent Asynchronous Adaptive Specialization for Generic Array Programming

High-Level and Productive Stream Parallelism for Dedup, Ferret, and Bzip2

Fish School Search with Algorithmic Skeletons

Automatic Cost Analysis for Imperative BSP Programs