Skip to main content
Erschienen in: International Journal of Parallel Programming 1/2019

30.12.2017

Compiler Optimization of Accelerator Data Transfers

verfasst von: Matthew B. Ashcraft, Alexander Lemon, David A. Penry, Quinn Snell

Erschienen in: International Journal of Parallel Programming | Ausgabe 1/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Accelerators such as GPUs, FPGAs, and many-core processors can provide significant performance improvements, but their effectiveness is dependent upon the skill of programmers to manage their complex architectures. One area of difficulty is determining which data to transfer on and off of the accelerator and when. Poorly placed data transfers can result in overheads that completely dwarf the benefits of using accelerators. To know what data to transfer, and when, the programmer must understand the data-flow of the transferred memory locations throughout the program, and how the accelerator region fits into the program as a whole. We argue that compilers should take on the responsibility of data transfer scheduling, thereby reducing the demands on the programmer, and resulting in improved program performance and program efficiency from the reduction in the number of bytes transferred. We show that by performing whole-program transfer scheduling on accelerator data transfers we are able to automatically eliminate up to 99% of the bytes transferred to and from the accelerator compared to transfering all data immediately before and after kernel execution for all data involved. The analysis and optimization are language and accelerator-agnostic, but for our examples and testing they have been implemented into an OpenMP to LLVM-IR to CUDA workflow.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Bourgoin, M., Emmanuel, C.: GPGPU composition with OCaml. In: Poceedings of ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, ARRAY14, pp. 32–37 (2012) Bourgoin, M., Emmanuel, C.: GPGPU composition with OCaml. In: Poceedings of ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, ARRAY14, pp. 32–37 (2012)
2.
Zurück zum Zitat Bourgoin, M., Chailloux, E., Lamotte, J.L.: SPOC: GPGPU programming through stream processing with OCaml. Parallel Process. Lett. 22, 1240007 (2012)MathSciNetCrossRef Bourgoin, M., Chailloux, E., Lamotte, J.L.: SPOC: GPGPU programming through stream processing with OCaml. Parallel Process. Lett. 22, 1240007 (2012)MathSciNetCrossRef
3.
Zurück zum Zitat Bourgoin, M., Chailloux, E., Lamotte, J.L.: Efficient abstractions for gpgpu programming. IJPP 42, 583–600 (2014) Bourgoin, M., Chailloux, E., Lamotte, J.L.: Efficient abstractions for gpgpu programming. IJPP 42, 583–600 (2014)
4.
Zurück zum Zitat Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J. W., Lee, S.H., Skadron, K.: Rodinia: a benchmark suite for heterogeneous computing. In: Proceedings of the IEEE International Symposium on Workload Characterization, pp. 44–54 (2009) Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J. W., Lee, S.H., Skadron, K.: Rodinia: a benchmark suite for heterogeneous computing. In: Proceedings of the IEEE International Symposium on Workload Characterization, pp. 44–54 (2009)
5.
Zurück zum Zitat CUDA C Programming Guide, Version 8.0. NVIDIA Corporation (2016) CUDA C Programming Guide, Version 8.0. NVIDIA Corporation (2016)
6.
Zurück zum Zitat Fujii, Y., Azumi, T., Nishio, N., Kato, S., Edahiro, M.: Data transfer matters for GPU computing. In: ICPADS (2013) Fujii, Y., Azumi, T., Nishio, N., Kato, S., Edahiro, M.: Data transfer matters for GPU computing. In: ICPADS (2013)
7.
Zurück zum Zitat Gelado, I., Stone, J.E., Cabezas, J., Patel, J., Navarro, N., Mei W., Hwu, W.: An asymmetric distributed shared memory model for heterogeneous parallel systems. In: Proceedings of the 15th International Symposium on Architectural Support for Programming Languages and Operating Systems, pp. 247–258 (2010) Gelado, I., Stone, J.E., Cabezas, J., Patel, J., Navarro, N., Mei W., Hwu, W.: An asymmetric distributed shared memory model for heterogeneous parallel systems. In: Proceedings of the 15th International Symposium on Architectural Support for Programming Languages and Operating Systems, pp. 247–258 (2010)
8.
Zurück zum Zitat Ishizaki, K., Hayashi, A., Koblents, G., Sarkar, V.: Compiling and optimizing Java 8 programs for GPU execution. In: Proceedings of the 24th International Conference on Parallel Architectures and Compilation Techniques (2015) Ishizaki, K., Hayashi, A., Koblents, G., Sarkar, V.: Compiling and optimizing Java 8 programs for GPU execution. In: Proceedings of the 24th International Conference on Parallel Architectures and Compilation Techniques (2015)
9.
Zurück zum Zitat Kim, J., Lee, Y.J., Park, J., Lee, J.: Translating OpenMP device constructs to OpenCL using unnecessary data transfer elimination. In: Proceedings of the 2016 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (2016) Kim, J., Lee, Y.J., Park, J., Lee, J.: Translating OpenMP device constructs to OpenCL using unnecessary data transfer elimination. In: Proceedings of the 2016 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (2016)
10.
Zurück zum Zitat Lattner, C., Adve, V.: LLVM: a compilation framework for lifelong program analysis and transformation. In: Proceedings of the International Symposium on Code Generation and Optimization, pp. 75–86 (2004) Lattner, C., Adve, V.: LLVM: a compilation framework for lifelong program analysis and transformation. In: Proceedings of the International Symposium on Code Generation and Optimization, pp. 75–86 (2004)
11.
Zurück zum Zitat Lattner, C., Lenharth, A., Adve, V.: Making context-sensitive points-to analysis with heap cloning practical. In: Proceedings of the 2007 Conference on Programming Language Design and Implementation (2007) Lattner, C., Lenharth, A., Adve, V.: Making context-sensitive points-to analysis with heap cloning practical. In: Proceedings of the 2007 Conference on Programming Language Design and Implementation (2007)
12.
Zurück zum Zitat Lee, S., Eigenmann, R.: OpenMPC: extended OpenMP programming and tuning for GPUs. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (2010) Lee, S., Eigenmann, R.: OpenMPC: extended OpenMP programming and tuning for GPUs. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (2010)
13.
Zurück zum Zitat Lee, S., Min, S.J., Eigenmann, R.: OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2009) Lee, S., Min, S.J., Eigenmann, R.: OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2009)
14.
Zurück zum Zitat Lengauer, T., Tarjan, R.E.: A fast algorithm for finding dominators in a flowgraph. ACM Trans. Program. Lang. Syst. 1, 121–141 (1979)CrossRefMATH Lengauer, T., Tarjan, R.E.: A fast algorithm for finding dominators in a flowgraph. ACM Trans. Program. Lang. Syst. 1, 121–141 (1979)CrossRefMATH
15.
Zurück zum Zitat Leroy, X., Doligez, D., Firsch, A., Garrigue, J., Remy, D.R., Vouillon, J.: The OCaml System Release 4.01: Documentation and Users Manual (2013) Leroy, X., Doligez, D., Firsch, A., Garrigue, J., Remy, D.R., Vouillon, J.: The OCaml System Release 4.01: Documentation and Users Manual (2013)
16.
Zurück zum Zitat Lustig, D., Martonosi, M.: Reducing GPU offload latency via fine-grained CPU-GPU synchronization. In: Proceedings of the 19th International Symposium on High-Performance Computer Architecture, pp. 354–365 (2013) Lustig, D., Martonosi, M.: Reducing GPU offload latency via fine-grained CPU-GPU synchronization. In: Proceedings of the 19th International Symposium on High-Performance Computer Architecture, pp. 354–365 (2013)
17.
Zurück zum Zitat OpenMP Application Program Interface, Version 4.0. OpenMP Architecture Review Board (2013) OpenMP Application Program Interface, Version 4.0. OpenMP Architecture Review Board (2013)
18.
Zurück zum Zitat The OpenCL Specification, Version 2.2. Khronos OpenCL Working Group (2016) The OpenCL Specification, Version 2.2. Khronos OpenCL Working Group (2016)
19.
Zurück zum Zitat Vassiliadis, V., Antonopoulos, C.D., Zindros, G.: Automating data management in heterogeneous systems using polyhedral analysis. In: Proceedings of the 19th Panhellenic Conference on Informatics, pp. 317–322 (2015) Vassiliadis, V., Antonopoulos, C.D., Zindros, G.: Automating data management in heterogeneous systems using polyhedral analysis. In: Proceedings of the 19th Panhellenic Conference on Informatics, pp. 317–322 (2015)
20.
Zurück zum Zitat Yang, Y., Xiang, P., Kong, J., Zhou, H.: A GPGPU compiler for memory optimizations and parallelism management. In: Proceedings of the 31st Conference on Programming Language Design and Implementation, pp. 86–97 (2010) Yang, Y., Xiang, P., Kong, J., Zhou, H.: A GPGPU compiler for memory optimizations and parallelism management. In: Proceedings of the 31st Conference on Programming Language Design and Implementation, pp. 86–97 (2010)
Metadaten
Titel
Compiler Optimization of Accelerator Data Transfers
verfasst von
Matthew B. Ashcraft
Alexander Lemon
David A. Penry
Quinn Snell
Publikationsdatum
30.12.2017
Verlag
Springer US
Erschienen in
International Journal of Parallel Programming / Ausgabe 1/2019
Print ISSN: 0885-7458
Elektronische ISSN: 1573-7640
DOI
https://doi.org/10.1007/s10766-017-0549-3

Weitere Artikel der Ausgabe 1/2019

International Journal of Parallel Programming 1/2019 Zur Ausgabe