research-article

Achieving a single compute device image in OpenCL for multiple GPUs

Authors:
Jungwon Kim

Seoul National University, Seoul, South Korea

Seoul National University, Seoul, South Korea
View Profile

,
Honggyu Kim

Seoul National University, Seoul, South Korea

Seoul National University, Seoul, South Korea
View Profile

,
Joo Hwan Lee

Seoul National University, Seoul, South Korea

Seoul National University, Seoul, South Korea
View Profile

,
Jaejin Lee

Seoul National University, Seoul, South Korea

Seoul National University, Seoul, South Korea
View Profile

PPoPP '11: Proceedings of the 16th ACM symposium on Principles and practice of parallel programmingFebruary 2011Pages 277–288https://doi.org/10.1145/1941553.1941591

Published:12 February 2011Publication History

PPoPP '11: Proceedings of the 16th ACM symposium on Principles and practice of parallel programming

Pages 277–288

ABSTRACT

In this paper, we propose an OpenCL framework that combines multiple GPUs and treats them as a single compute device. Providing a single virtual compute device image to the user makes an OpenCL application written for a single GPU portable to the platform that has multiple GPU devices. It also makes the application exploit full computing power of the multiple GPU devices and the total amount of GPU memories available in the platform. Our OpenCL framework automatically distributes at run-time the OpenCL kernel written for a single GPU into multiple CUDA kernels that execute on the multiple GPU devices. It applies a run-time memory access range analysis to the kernel by performing a sampling run and identifies an optimal workload distribution for the kernel. To achieve a single compute device image, the runtime maintains virtual device memory that is allocated in the main memory. The OpenCL runtime treats the memory as if it were the memory of a single GPU device and keeps it consistent to the memories of the multiple GPU devices. Our OpenCL-C-to-C translator generates the sampling code from the OpenCL kernel code and OpenCL-C-to-CUDA-C translator generates the CUDA kernel code for the distributed OpenCL kernel. We show the effectiveness of our OpenCL framework by implementing the OpenCL runtime and two source-to-source translators. We evaluate its performance with a system that contains 8 GPUs using 11 OpenCL benchmark applications.

References

ATI Stream Software Development Ket (SDK) v2.1. AMD, 2010. http://developer.amd.com/gpu/atistreamsdk/pages/default.aspx.Google Scholar
G. M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In AFIPS'67 (Spring): Proceedings of the April 18--20, 1967, spring joint computer conference, pages 483--485. ACM, 1967. Google ScholarDigital Library
C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In PACT'08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 72--81. ACM, October 2008. Google ScholarDigital Library
F. Darema. The SPMD Model: Past, Present and Future. Lecture Notes in Computer Science, 2131 (1): 1--1, January 2001. Google ScholarDigital Library
I. Gelado, J. H. Kelm, S. Ryoo, S. S. Lumetta, N. Navarro, and W.-m. W. Hwu. CUBA: an architecture for efficient CPU/co-processor data communication. In ICS'08: Proceedings of the 22nd annual international conference on Supercomputing, pages 299--308. ACM, June 2008. Google ScholarDigital Library
J. Gummaraju, L. Morichetti, M. Houston, B. Sander, B. R. Gaster, and B. Zheng. Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors. In PACT'10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques, pages 205--216. ACM, 2010. Google ScholarDigital Library
Khronos OpenCL Working Group. The OpenCL Specification Version 1.0. Khronos Group, 2009. http://www.khronos.org/opencl.Google Scholar
D. B. Kirk and W.-m. W. Hwu. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2010. ISBN 0123814723, 9780123814722. Google ScholarDigital Library
C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In CGO'04: Proceedings of the international symposium on Code generation and optimization, pages 75--86, Washington, DC, USA, March 2004. IEEE Computer Society. Google ScholarDigital Library
J. Lee, J. Kim, S. Seo, S. Kim, J. Park, H. Kim, T. T. Dao, Y. Cho, S. J. Seo, S. H. Lee, S. M. Cho, H. J. Song, S.-B. Suh, and J.-D. Choi. An OpenCL framework for heterogeneous multicores with local memory. In PACT'10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques, pages 193--204. ACM, 2010. Google ScholarDigital Library
S. S. Muchnick. Advanced compiler design and implementation. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997. ISBN 1-55860-320-4. Google ScholarDigital Library
NASA Advanced Supercomputing Division. NAS Parallel Benchmarks. http://www.nas.nasa.gov/Resources/Software/npb.html.Google Scholar
NVIDIA Fermi Compute Architecture White Paper. NVIDIA, 2009. http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.Google Scholar
NVIDIA CUDA C Best Practices Guide 3.1. NVIDIA, May 2010.Google Scholar
NVIDIA CUDA C Programming Guide 3.1.1. NVIDIA, July 2010.Google Scholar
NVIDIA CUDA Zone. NVIDIA, July 2010. http://www.nvidia.com/object/cuda_home_new.html.Google Scholar
NVIDIA GPU Computing Software Development Kit. NVIDIA, June 2010. http://developer.nvidia.com/object/cuda_3_1_downloads.html.Google Scholar
Tesla M2050/M2070 GPU Computing Module. NVIDIA, 2010. http://www.nvidia.com/object/product_tesla_M2050_M2070_us.html.Google Scholar
J. C. Phillips, J. E. Stone, and K. Schulten. Adapting a message-driven parallel application to GPU-accelerated clusters. In SC'08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--9, Piscataway, NJ, USA, November 2008. IEEE Press. Google ScholarDigital Library
G. Quintana-Ortí, F. D. Igual, E. S. Quintana-Ortí, and R. A. van de Geijn. Solving dense linear systems on platforms with multiple hardware accelerators. In PPoPP'09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 121--130. ACM, 2009. Google ScholarDigital Library
J. W. Romein, P. C. Broekema, J. D. Mol, and R. V. van Nieuwpoort. The LOFAR correlator: implementation and performance analysis. In PPoPP'10: Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 169--178. ACM, 2010. Google ScholarDigital Library
D. Schaa and D. Kaeli. Exploring the multiple-GPU design space. In IPDPS '09: Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, pages 1--12, May 2009. Google ScholarDigital Library
M. Strengert, C. Müler, C. Dachsbacher, and T. Ertl. CUDASA: Compute Unified Device and Systems Architecture. In Eurographics Symposium on Parallel Graphics and Visualization (EGPGV08), pages 49--56. Eurographics Association, April 2008. Google ScholarDigital Library
The IMPACT Research Group. Parboil Benchmark suite. http://impact.crhc.illinois.edu/parboil.php, 2009.Google Scholar
F. Tip. A Survey of Program Slicing Techniques. Technical report, Amsterdam, The Netherlands, The Netherlands, 1994. Google ScholarDigital Library
V. Volkov and J. W. Demmel. Benchmarking gpus to tune dense linear algebra. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--11, Piscataway, NJ, USA, 2008. IEEE Press. Google ScholarDigital Library
M. Weiser. Program Slicing. In ICSE'81: Proceedings of the 5th International Conference on Software Engineering, pages 439--449, Piscataway, NJ, USA, 1981. IEEE Press. Google ScholarDigital Library
C. Yang, F. Wang, Y. Du, J. Chen, J. Liu, H. Yi, and K. Lu. Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing. In IEEE Cluster'10: Proceedings of IEEE International Conference on Cluster Computing, pages 19--28, Los Alamitos, CA, USA, 2010. IEEE Computer Society. Google ScholarDigital Library
Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU compiler for memory optimization and parallelism management. In PLDI'10: Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation, pages 86--97. ACM, June 2010. Google ScholarDigital Library

Index Terms

Achieving a single compute device image in OpenCL for multiple GPUs
1. Computing methodologies
  1. Concurrent computing methodologies
    1. Concurrent programming languages
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments
      2. Source code generation
    2. General programming languages
      1. Language types
        Concurrent programming languages

Recommendations

Achieving a single compute device image in OpenCL for multiple GPUs
PPoPP '11

In this paper, we propose an OpenCL framework that combines multiple GPUs and treats them as a single compute device. Providing a single virtual compute device image to the user makes an OpenCL application written for a single GPU portable to the ...
Read More
An OpenCL framework for heterogeneous multicores with local memory
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques

In this paper, we present the design and implementation of an Open Computing Language (OpenCL) framework that targets heterogeneous accelerator multicore architectures with local memory. The architecture consists of a general-purpose processor core and ...
Read More
Enabling PoCL-based runtime frameworks on the HSA for OpenCL 2.0 support

The heterogeneous system architecture (HSA), announced by the HSA Foundation, is an approach to integrate central processing unit (CPU) and graphics processing unit (GPU) architectures. The open computing language (OpenCL) is a programming framework ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PPoPP '11: Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
February 2011
326 pages
ISBN:9781450301190
DOI:10.1145/1941553
General Chair:
Calin Cascaval
Qualcomm Research, USA
,
Program Chair:
Pen-Chung Yew
Academia Sinica, Taiwan and University of Minnesota at Twin Cities, USA
ACM SIGPLAN Notices Volume 46, Issue 8
PPoPP '11
August 2011
300 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2038037
Issue’s Table of Contents
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 February 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
access range analysis
compilers
opencl
runtime
virtual device memory
workload distribution
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate230of1,014submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 115
  Total Citations
  View Citations
- 2,178
  Total Downloads
- Downloads (Last 12 months)31
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Achieving a single compute device image in OpenCL for multiple GPUs

PPoPP '11: Proceedings of the 16th ACM symposium on Principles and practice of parallel programming

ABSTRACT

References

Cited By

Index Terms

Recommendations

Achieving a single compute device image in OpenCL for multiple GPUs

An OpenCL framework for heterogeneous multicores with local memory

Enabling PoCL-based runtime frameworks on the HSA for OpenCL 2.0 support