Skip to main content
Top

2019 | OriginalPaper | Chapter

Enhancing the Programmability and Performance Portability of GPU Tensor Operations

Authors : Arya Mazaheri, Johannes Schulte, Matthew W. Moskewicz, Felix Wolf, Ali Jannesari

Published in: Euro-Par 2019: Parallel Processing

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Deep-learning models with convolutional networks are widely used for many artificial-intelligence tasks, thanks to the increasing adoption of high-throughput GPUs, even in mobile phones. CUDA and OpenCL are the two largely used programming interfaces for accessing the computing power of GPUs. However, attaining code portability has always been a challenge, until the introduction of the Vulkan API. Still, performance portability is not necessarily provided. In this paper, we investigate the unique characteristics of CUDA, OpenCL, and Vulkan kernels and propose a method for abstracting away syntactic differences. Such abstraction creates a single-source kernel which we use for generating code for each GPU programming interface. In addition, we expose auto-tuning parameters to further enhance performance portability. We implemented a selection of convolution operations, covering the core operations needed for deploying three common image-processing neural networks, and tuned them for NVIDIA, AMD, and ARM Mali GPUs. Our experiments show that we can generate deep-learning kernels with minimal effort for new platforms and achieve reasonable performance. Specifically, our Vulkan backend is able to provide competitive performance compared to vendor deep-learning libraries.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Chen, T., et al.: TVM: an automated end-to-end optimizing compiler for deep learning. In: 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, pp. 578–594 (2018) Chen, T., et al.: TVM: an automated end-to-end optimizing compiler for deep learning. In: 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, pp. 578–594 (2018)
3.
go back to reference Da Silva, H.C., Pisani, F., Borin, E.: A comparative study of SYCL, OpenCL, and OpenMP. In: Proceedings of International Symposium on Computer Architecture and High-Performance Computing Workshops, SBAC-PADW 2016, pp. 61–66. IEEE (2016) Da Silva, H.C., Pisani, F., Borin, E.: A comparative study of SYCL, OpenCL, and OpenMP. In: Proceedings of International Symposium on Computer Architecture and High-Performance Computing Workshops, SBAC-PADW 2016, pp. 61–66. IEEE (2016)
4.
go back to reference Du, P., Weber, R., Luszczek, P., Tomov, S., Peterson, G., Dongarra, J.: From CUDA to OpenCL: towards a performance-portable solution for multi-platform GPU programming. Parallel Comput. 38(8), 391–407 (2012)CrossRef Du, P., Weber, R., Luszczek, P., Tomov, S., Peterson, G., Dongarra, J.: From CUDA to OpenCL: towards a performance-portable solution for multi-platform GPU programming. Parallel Comput. 38(8), 391–407 (2012)CrossRef
5.
go back to reference Fang, J., Varbanescu, A.L., Sips, H.: A comprehensive performance comparison of CUDA and OpenCL. In: Proceedings of International Conference on Parallel Processing (ICPP), pp. 216–225. IEEE (2011) Fang, J., Varbanescu, A.L., Sips, H.: A comprehensive performance comparison of CUDA and OpenCL. In: Proceedings of International Conference on Parallel Processing (ICPP), pp. 216–225. IEEE (2011)
6.
go back to reference Huynh, L.N., Lee, Y., Balan, R.K.: DeepMon: mobile GPU-based deep learning framework for continuous vision applications. In: Proceedings of 15th Annual International Conference on Mobile Systems, Applications, and Services, MobiSys 2017, pp. 82–95. ACM (2017) Huynh, L.N., Lee, Y., Balan, R.K.: DeepMon: mobile GPU-based deep learning framework for continuous vision applications. In: Proceedings of 15th Annual International Conference on Mobile Systems, Applications, and Services, MobiSys 2017, pp. 82–95. ACM (2017)
9.
go back to reference Kim, J., Dao, T.T., Jung, J., Joo, J., Lee, J.: Bridging OpenCL and CUDA: a comparative analysis and translation. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, pp. 1–12. ACM (2015) Kim, J., Dao, T.T., Jung, J., Joo, J., Lee, J.: Bridging OpenCL and CUDA: a comparative analysis and translation. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, pp. 1–12. ACM (2015)
10.
go back to reference Mammeri, N., Juurlink, B.: VComputeBench: a Vulkan benchmark suite for GPGPU on mobile and embedded GPUs. In: Proceedings of International Symposium on Workload Characterization, IISWC 2018, pp. 25–35. IEEE (2018) Mammeri, N., Juurlink, B.: VComputeBench: a Vulkan benchmark suite for GPGPU on mobile and embedded GPUs. In: Proceedings of International Symposium on Workload Characterization, IISWC 2018, pp. 25–35. IEEE (2018)
12.
go back to reference Memeti, S., Li, L., Pllana, S., Kołodziej, J., Kessler, C.: Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: programming productivity, performance, and energy consumption. In: Proceedings of Workshop on Adaptive Resource Management and Scheduling for Cloud Computing, pp. 1–6. ACM (2017) Memeti, S., Li, L., Pllana, S., Kołodziej, J., Kessler, C.: Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: programming productivity, performance, and energy consumption. In: Proceedings of Workshop on Adaptive Resource Management and Scheduling for Cloud Computing, pp. 1–6. ACM (2017)
13.
go back to reference Moskewicz, M.W., Jannesari, A., Keutzer, K.: A metaprogramming and autotuning framework for deploying deep learning applications. arXiv preprint arXiv:1611.06945 (2016) Moskewicz, M.W., Jannesari, A., Keutzer, K.: A metaprogramming and autotuning framework for deploying deep learning applications. arXiv preprint arXiv:​1611.​06945 (2016)
14.
go back to reference Moskewicz, M.W., Jannesari, A., Keutzer, K.: Boda: a holistic approach for implementing neural network computations. In: Proceedings of International Conference on Computing Frontier, CF 2017, pp. 53–62. ACM (2017) Moskewicz, M.W., Jannesari, A., Keutzer, K.: Boda: a holistic approach for implementing neural network computations. In: Proceedings of International Conference on Computing Frontier, CF 2017, pp. 53–62. ACM (2017)
16.
go back to reference Sampson, A.: Let’s fix OpenGL. In: Leibniz International Proceedings in Informatics, LIPIcs 2017, vol. 71. Schloss Dagstuhl, Leibniz-Zentrum füer Informatik (2017) Sampson, A.: Let’s fix OpenGL. In: Leibniz International Proceedings in Informatics, LIPIcs 2017, vol. 71. Schloss Dagstuhl, Leibniz-Zentrum füer Informatik (2017)
17.
go back to reference Su, C.L., Chen, P.Y., Lan, C.C., Huang, L.S., Wu, K.H.: Overview and comparison of OpenCL and CUDA technology for GPGPU. In: Proceedings of Asia Pacific Conference on Circuits and Systems, APCCAS 2012, pp. 448–451. IEEE (2012) Su, C.L., Chen, P.Y., Lan, C.C., Huang, L.S., Wu, K.H.: Overview and comparison of OpenCL and CUDA technology for GPGPU. In: Proceedings of Asia Pacific Conference on Circuits and Systems, APCCAS 2012, pp. 448–451. IEEE (2012)
20.
go back to reference Vasilache, N., et al.: Tensor comprehensions: framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:1802.04730 (2018) Vasilache, N., et al.: Tensor comprehensions: framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:​1802.​04730 (2018)
Metadata
Title
Enhancing the Programmability and Performance Portability of GPU Tensor Operations
Authors
Arya Mazaheri
Johannes Schulte
Matthew W. Moskewicz
Felix Wolf
Ali Jannesari
Copyright Year
2019
DOI
https://doi.org/10.1007/978-3-030-29400-7_16

Premium Partner