tutorial

Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices

Authors:
Prasanna Pandit

Supercomputer Education and Research Centre, Indian Institute of Science, Bangalore, India

Supercomputer Education and Research Centre, Indian Institute of Science, Bangalore, India
View Profile

,
R. Govindarajan

Supercomputer Education and Research Centre, Indian Institute of Science, Bangalore, India

Supercomputer Education and Research Centre, Indian Institute of Science, Bangalore, India
View Profile

CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and OptimizationFebruary 2014Pages 273–283https://doi.org/10.1145/2544137.2544163

Published:16 October 2018Publication History

CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Pages 273–283

ABSTRACT

Programming heterogeneous computing systems with Graphics Processing Units (GPU) and multi-core CPUs in them is complex and time-consuming. OpenCL has emerged as an attractive programming framework for heterogeneous systems. But utilizing multiple devices in OpenCL is a challenge because it requires the programmer to explicitly map data and computation to each device. The problem becomes even more complex if the same OpenCL kernel has to be executed synergistically using multiple devices, as the relative execution time of the kernel on different devices can vary significantly, making it difficult to determine the work partitioning across these devices a priori. Also, after each kernel execution, a coherent version of the data needs to be established.

In this work, we present FluidiCL, an OpenCL runtime that takes a program written for a single device and uses both the CPU and the GPU to execute it. Since we consider a setup with devices having discrete address spaces, our solution ensures that execution of OpenCL work-groups on devices is adjusted by taking into account the overheads for data management. The data transfers and data merging needed to ensure coherence are handled transparently without requiring any effort from the programmer. FluidiCL also does not require prior training or profiling and is completely portable across different machines. Across a set of diverse benchmarks having multiple kernels, our runtime shows a geomean speedup of nearly 64% over a high-end GPU and 88% over a 4-core CPU. In all benchmarks, performance of our runtime comes to within 13% of the best of the two devices.

References

C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience, 23(2):187--198, 2011. Google ScholarDigital Library
M. E. Belviranli, L. N. Bhuyan, and R. Gupta. A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Trans. Archit. Code Optim., 9(4):57:1--57:20, Jan. 2013. Google ScholarDigital Library
J. Bueno, J. Planas, A. Duran, R. Badia, X. Martorell, E. Ayguade, and J. Labarta. Productive programming of gpu clusters with ompss. In Parallel Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, pages 557--568, 2012. Google ScholarDigital Library
J. V. Ferreira Lima, T. Gautier, N. Maillard, and V. Danjean. Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs. In 24rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pages 75--82, Columbia University, New York, États-Unis, Oct. 2012. Google ScholarDigital Library
S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. Auto-tuning a high-level language targeted to gpu codes. In Innovative Parallel Computing (InPar), 2012, pages 1--10, 2012.Google ScholarCross Ref
D. Grewe and M. F. O'Boyle. A static task partitioning approach for heterogeneous systems using opencl. In CC '11: Proceedings of the 20th International Conference on Compiler Construction. Springer, 2011. Google ScholarDigital Library
D. Grewe, Z. Wang, and M. F. O'Boyle. Portable mapping of data parallel programs to opencl for heterogeneous systems. In CGO '13: Proceedings of the 11th International Symposium on Code Generation and Optimization. ACM, 2013. Google ScholarDigital Library
J. Gummaraju, L. Morichetti, M. Houston, B. Sander, B. R. Gaster, and B. Zheng. Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques, PACT '10, pages 205--216, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
Intel. Intel Launches SDK for OpenCL Applications at SIGGRAPH 2012, 2012.Google Scholar
Khronos Group. Conformant Products, 2012.Google Scholar
Khronos Group. OpenCL 1.2 Specification, 2012.Google Scholar
Khronos Group. OpenCL - The open standard for parallel programming of heterogeneous systems, 2013.Google Scholar
J. Kim, H. Kim, J. H. Lee, and J. Lee. Achieving a single compute device image in opencl for multiple gpus. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP '11, pages 277--288, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee. Snucl: an opencl framework for heterogeneous cpu/gpu clusters. In Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, pages 341--352, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
L. Nyland, M. Harris, J. Prins. Fast N-Body Simulation with CUDA, 2012.Google Scholar
J. Lee, J. Kim, J. Kim, S. Seo, and J. Lee. An opencl framework for homogeneous manycores with no hardware cache coherence. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on, pages 56--67, 2011. Google ScholarDigital Library
J. Lee, J. Kim, S. Seo, S. Kim, J. Park, H. Kim, T. T. Dao, Y. Cho, S. J. Seo, S. H. Lee, S. M. Cho, H. J. Song, S.-B. Suh, and J.-D. Choi. An opencl framework for heterogeneous multicores with local memory. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques, PACT '10, pages 193--204, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
J. Lee, M. Samadi, Y. Park, and S. Mahlke. Transparent cpu-gpu collaboration for data-parallel kernels on heterogeneous systems. In Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques, PACT '13, pages 245--256, Piscataway, NJ, USA, 2013. IEEE Press. Google ScholarDigital Library
V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey. Debunking the 100x gpu vs. cpu myth: an evaluation of throughput computing on cpu and gpu. In Proceedings of the 37th annual international symposium on Computer architecture, ISCA '10, pages 451--460, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
C.-K. Luk, S. Hong, and H. Kim. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, pages 45--55, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
NVidia. NEW TOP500 LIST: 4X MORE GPU SUPERCOMPUTERS, 2012.Google Scholar
A. Prasad, J. Anantpur, and R. Govindarajan. Automatic compilation of matlab programs for synergistic execution on heterogeneous processors. In Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation, PLDI '11, pages 152--163, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
V. T. Ravi, W. Ma, D. Chiu, and G. Agrawal. Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations. In Proceedings of the 24th ACM International Conference on Supercomputing, ICS '10, pages 137--146, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
T. Scogland, B. Rountree, W. chun Feng, and B. De Supinski. Heterogeneous task scheduling for accelerated openmp. In Parallel Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, pages 144--155, 2012. Google ScholarDigital Library
Sylvain Henry. SOCL -- OpenCL Frontend for StarPU, 2013.Google Scholar
Y. Yang, P. Xiang, M. Mantor, and H. Zhou. Cpu-assisted gpgpu on fused cpu-gpu architectures. In Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture, HPCA '12, pages 1--12, Washington, DC, USA, 2012. IEEE Computer Society. Google ScholarDigital Library
E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen. On-the-fly elimination of dynamic irregularities for gpu computing. In Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems, ASPLOS XVI, pages 369--380, New York, NY, USA, 2011. ACM. Google ScholarDigital Library

Index Terms

Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices

Recommendations

Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques

Modern processors are evolving into hybrid, heterogeneous processors with both CPU and GPU cores used for general purpose computation. Several languages such as Brook, CUDA, and more recently OpenCL are being developed to fully harness the potential of ...
Read More
Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Programming heterogeneous computing systems with Graphics Processing Units (GPU) and multi-core CPUs in them is complex and time-consuming. OpenCL has emerged as an attractive programming framework for heterogeneous systems. But utilizing multiple ...
Read More
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
February 2014
328 pages
ISBN:9781450326704
DOI:10.1145/2581122
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 October 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
FluidiCL
GPGPU
Heterogeneous Devices
OpenCL
Runtime
Qualifiers
- tutorial
- Refereed limited
Conference

Acceptance Rates
CGO '14 Paper Acceptance Rate29of100submissions,29%Overall Acceptance Rate312of1,061submissions,29%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 36
  Total Citations
  View Citations
- 44
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices

CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

ABSTRACT

References

Cited By

Index Terms

Recommendations

Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors

Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices

CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

ABSTRACT

References

Cited By

Index Terms

Recommendations

Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors

Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media