Skip to main content

2019 | OriginalPaper | Buchkapitel

Enhancing the Programmability and Performance Portability of GPU Tensor Operations

verfasst von : Arya Mazaheri, Johannes Schulte, Matthew W. Moskewicz, Felix Wolf, Ali Jannesari

Erschienen in: Euro-Par 2019: Parallel Processing

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Deep-learning models with convolutional networks are widely used for many artificial-intelligence tasks, thanks to the increasing adoption of high-throughput GPUs, even in mobile phones. CUDA and OpenCL are the two largely used programming interfaces for accessing the computing power of GPUs. However, attaining code portability has always been a challenge, until the introduction of the Vulkan API. Still, performance portability is not necessarily provided. In this paper, we investigate the unique characteristics of CUDA, OpenCL, and Vulkan kernels and propose a method for abstracting away syntactic differences. Such abstraction creates a single-source kernel which we use for generating code for each GPU programming interface. In addition, we expose auto-tuning parameters to further enhance performance portability. We implemented a selection of convolution operations, covering the core operations needed for deploying three common image-processing neural networks, and tuned them for NVIDIA, AMD, and ARM Mali GPUs. Our experiments show that we can generate deep-learning kernels with minimal effort for new platforms and achieve reasonable performance. Specifically, our Vulkan backend is able to provide competitive performance compared to vendor deep-learning libraries.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Chen, T., et al.: TVM: an automated end-to-end optimizing compiler for deep learning. In: 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, pp. 578–594 (2018) Chen, T., et al.: TVM: an automated end-to-end optimizing compiler for deep learning. In: 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, pp. 578–594 (2018)
3.
Zurück zum Zitat Da Silva, H.C., Pisani, F., Borin, E.: A comparative study of SYCL, OpenCL, and OpenMP. In: Proceedings of International Symposium on Computer Architecture and High-Performance Computing Workshops, SBAC-PADW 2016, pp. 61–66. IEEE (2016) Da Silva, H.C., Pisani, F., Borin, E.: A comparative study of SYCL, OpenCL, and OpenMP. In: Proceedings of International Symposium on Computer Architecture and High-Performance Computing Workshops, SBAC-PADW 2016, pp. 61–66. IEEE (2016)
4.
Zurück zum Zitat Du, P., Weber, R., Luszczek, P., Tomov, S., Peterson, G., Dongarra, J.: From CUDA to OpenCL: towards a performance-portable solution for multi-platform GPU programming. Parallel Comput. 38(8), 391–407 (2012)CrossRef Du, P., Weber, R., Luszczek, P., Tomov, S., Peterson, G., Dongarra, J.: From CUDA to OpenCL: towards a performance-portable solution for multi-platform GPU programming. Parallel Comput. 38(8), 391–407 (2012)CrossRef
5.
Zurück zum Zitat Fang, J., Varbanescu, A.L., Sips, H.: A comprehensive performance comparison of CUDA and OpenCL. In: Proceedings of International Conference on Parallel Processing (ICPP), pp. 216–225. IEEE (2011) Fang, J., Varbanescu, A.L., Sips, H.: A comprehensive performance comparison of CUDA and OpenCL. In: Proceedings of International Conference on Parallel Processing (ICPP), pp. 216–225. IEEE (2011)
6.
Zurück zum Zitat Huynh, L.N., Lee, Y., Balan, R.K.: DeepMon: mobile GPU-based deep learning framework for continuous vision applications. In: Proceedings of 15th Annual International Conference on Mobile Systems, Applications, and Services, MobiSys 2017, pp. 82–95. ACM (2017) Huynh, L.N., Lee, Y., Balan, R.K.: DeepMon: mobile GPU-based deep learning framework for continuous vision applications. In: Proceedings of 15th Annual International Conference on Mobile Systems, Applications, and Services, MobiSys 2017, pp. 82–95. ACM (2017)
8.
9.
Zurück zum Zitat Kim, J., Dao, T.T., Jung, J., Joo, J., Lee, J.: Bridging OpenCL and CUDA: a comparative analysis and translation. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, pp. 1–12. ACM (2015) Kim, J., Dao, T.T., Jung, J., Joo, J., Lee, J.: Bridging OpenCL and CUDA: a comparative analysis and translation. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, pp. 1–12. ACM (2015)
10.
Zurück zum Zitat Mammeri, N., Juurlink, B.: VComputeBench: a Vulkan benchmark suite for GPGPU on mobile and embedded GPUs. In: Proceedings of International Symposium on Workload Characterization, IISWC 2018, pp. 25–35. IEEE (2018) Mammeri, N., Juurlink, B.: VComputeBench: a Vulkan benchmark suite for GPGPU on mobile and embedded GPUs. In: Proceedings of International Symposium on Workload Characterization, IISWC 2018, pp. 25–35. IEEE (2018)
12.
Zurück zum Zitat Memeti, S., Li, L., Pllana, S., Kołodziej, J., Kessler, C.: Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: programming productivity, performance, and energy consumption. In: Proceedings of Workshop on Adaptive Resource Management and Scheduling for Cloud Computing, pp. 1–6. ACM (2017) Memeti, S., Li, L., Pllana, S., Kołodziej, J., Kessler, C.: Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: programming productivity, performance, and energy consumption. In: Proceedings of Workshop on Adaptive Resource Management and Scheduling for Cloud Computing, pp. 1–6. ACM (2017)
13.
Zurück zum Zitat Moskewicz, M.W., Jannesari, A., Keutzer, K.: A metaprogramming and autotuning framework for deploying deep learning applications. arXiv preprint arXiv:1611.06945 (2016) Moskewicz, M.W., Jannesari, A., Keutzer, K.: A metaprogramming and autotuning framework for deploying deep learning applications. arXiv preprint arXiv:​1611.​06945 (2016)
14.
Zurück zum Zitat Moskewicz, M.W., Jannesari, A., Keutzer, K.: Boda: a holistic approach for implementing neural network computations. In: Proceedings of International Conference on Computing Frontier, CF 2017, pp. 53–62. ACM (2017) Moskewicz, M.W., Jannesari, A., Keutzer, K.: Boda: a holistic approach for implementing neural network computations. In: Proceedings of International Conference on Computing Frontier, CF 2017, pp. 53–62. ACM (2017)
16.
Zurück zum Zitat Sampson, A.: Let’s fix OpenGL. In: Leibniz International Proceedings in Informatics, LIPIcs 2017, vol. 71. Schloss Dagstuhl, Leibniz-Zentrum füer Informatik (2017) Sampson, A.: Let’s fix OpenGL. In: Leibniz International Proceedings in Informatics, LIPIcs 2017, vol. 71. Schloss Dagstuhl, Leibniz-Zentrum füer Informatik (2017)
17.
Zurück zum Zitat Su, C.L., Chen, P.Y., Lan, C.C., Huang, L.S., Wu, K.H.: Overview and comparison of OpenCL and CUDA technology for GPGPU. In: Proceedings of Asia Pacific Conference on Circuits and Systems, APCCAS 2012, pp. 448–451. IEEE (2012) Su, C.L., Chen, P.Y., Lan, C.C., Huang, L.S., Wu, K.H.: Overview and comparison of OpenCL and CUDA technology for GPGPU. In: Proceedings of Asia Pacific Conference on Circuits and Systems, APCCAS 2012, pp. 448–451. IEEE (2012)
20.
Zurück zum Zitat Vasilache, N., et al.: Tensor comprehensions: framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:1802.04730 (2018) Vasilache, N., et al.: Tensor comprehensions: framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:​1802.​04730 (2018)
Metadaten
Titel
Enhancing the Programmability and Performance Portability of GPU Tensor Operations
verfasst von
Arya Mazaheri
Johannes Schulte
Matthew W. Moskewicz
Felix Wolf
Ali Jannesari
Copyright-Jahr
2019
DOI
https://doi.org/10.1007/978-3-030-29400-7_16