ABSTRACT
In this paper, we propose an OpenCL framework that combines multiple GPUs and treats them as a single compute device. Providing a single virtual compute device image to the user makes an OpenCL application written for a single GPU portable to the platform that has multiple GPU devices. It also makes the application exploit full computing power of the multiple GPU devices and the total amount of GPU memories available in the platform. Our OpenCL framework automatically distributes at run-time the OpenCL kernel written for a single GPU into multiple CUDA kernels that execute on the multiple GPU devices. It applies a run-time memory access range analysis to the kernel by performing a sampling run and identifies an optimal workload distribution for the kernel. To achieve a single compute device image, the runtime maintains virtual device memory that is allocated in the main memory. The OpenCL runtime treats the memory as if it were the memory of a single GPU device and keeps it consistent to the memories of the multiple GPU devices. Our OpenCL-C-to-C translator generates the sampling code from the OpenCL kernel code and OpenCL-C-to-CUDA-C translator generates the CUDA kernel code for the distributed OpenCL kernel. We show the effectiveness of our OpenCL framework by implementing the OpenCL runtime and two source-to-source translators. We evaluate its performance with a system that contains 8 GPUs using 11 OpenCL benchmark applications.
- ATI Stream Software Development Ket (SDK) v2.1. AMD, 2010. http://developer.amd.com/gpu/atistreamsdk/pages/default.aspx.Google Scholar
- G. M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In AFIPS'67 (Spring): Proceedings of the April 18--20, 1967, spring joint computer conference, pages 483--485. ACM, 1967. Google ScholarDigital Library
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In PACT'08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 72--81. ACM, October 2008. Google ScholarDigital Library
- F. Darema. The SPMD Model: Past, Present and Future. Lecture Notes in Computer Science, 2131 (1): 1--1, January 2001. Google ScholarDigital Library
- I. Gelado, J. H. Kelm, S. Ryoo, S. S. Lumetta, N. Navarro, and W.-m. W. Hwu. CUBA: an architecture for efficient CPU/co-processor data communication. In ICS'08: Proceedings of the 22nd annual international conference on Supercomputing, pages 299--308. ACM, June 2008. Google ScholarDigital Library
- J. Gummaraju, L. Morichetti, M. Houston, B. Sander, B. R. Gaster, and B. Zheng. Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors. In PACT'10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques, pages 205--216. ACM, 2010. Google ScholarDigital Library
- Khronos OpenCL Working Group. The OpenCL Specification Version 1.0. Khronos Group, 2009. http://www.khronos.org/opencl.Google Scholar
- D. B. Kirk and W.-m. W. Hwu. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2010. ISBN 0123814723, 9780123814722. Google ScholarDigital Library
- C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In CGO'04: Proceedings of the international symposium on Code generation and optimization, pages 75--86, Washington, DC, USA, March 2004. IEEE Computer Society. Google ScholarDigital Library
- J. Lee, J. Kim, S. Seo, S. Kim, J. Park, H. Kim, T. T. Dao, Y. Cho, S. J. Seo, S. H. Lee, S. M. Cho, H. J. Song, S.-B. Suh, and J.-D. Choi. An OpenCL framework for heterogeneous multicores with local memory. In PACT'10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques, pages 193--204. ACM, 2010. Google ScholarDigital Library
- S. S. Muchnick. Advanced compiler design and implementation. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997. ISBN 1-55860-320-4. Google ScholarDigital Library
- NASA Advanced Supercomputing Division. NAS Parallel Benchmarks. http://www.nas.nasa.gov/Resources/Software/npb.html.Google Scholar
- NVIDIA Fermi Compute Architecture White Paper. NVIDIA, 2009. http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.Google Scholar
- NVIDIA CUDA C Best Practices Guide 3.1. NVIDIA, May 2010.Google Scholar
- NVIDIA CUDA C Programming Guide 3.1.1. NVIDIA, July 2010.Google Scholar
- NVIDIA CUDA Zone. NVIDIA, July 2010. http://www.nvidia.com/object/cuda_home_new.html.Google Scholar
- NVIDIA GPU Computing Software Development Kit. NVIDIA, June 2010. http://developer.nvidia.com/object/cuda_3_1_downloads.html.Google Scholar
- Tesla M2050/M2070 GPU Computing Module. NVIDIA, 2010. http://www.nvidia.com/object/product_tesla_M2050_M2070_us.html.Google Scholar
- J. C. Phillips, J. E. Stone, and K. Schulten. Adapting a message-driven parallel application to GPU-accelerated clusters. In SC'08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--9, Piscataway, NJ, USA, November 2008. IEEE Press. Google ScholarDigital Library
- G. Quintana-Ortí, F. D. Igual, E. S. Quintana-Ortí, and R. A. van de Geijn. Solving dense linear systems on platforms with multiple hardware accelerators. In PPoPP'09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 121--130. ACM, 2009. Google ScholarDigital Library
- J. W. Romein, P. C. Broekema, J. D. Mol, and R. V. van Nieuwpoort. The LOFAR correlator: implementation and performance analysis. In PPoPP'10: Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 169--178. ACM, 2010. Google ScholarDigital Library
- D. Schaa and D. Kaeli. Exploring the multiple-GPU design space. In IPDPS '09: Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, pages 1--12, May 2009. Google ScholarDigital Library
- M. Strengert, C. Müler, C. Dachsbacher, and T. Ertl. CUDASA: Compute Unified Device and Systems Architecture. In Eurographics Symposium on Parallel Graphics and Visualization (EGPGV08), pages 49--56. Eurographics Association, April 2008. Google ScholarDigital Library
- The IMPACT Research Group. Parboil Benchmark suite. http://impact.crhc.illinois.edu/parboil.php, 2009.Google Scholar
- F. Tip. A Survey of Program Slicing Techniques. Technical report, Amsterdam, The Netherlands, The Netherlands, 1994. Google ScholarDigital Library
- V. Volkov and J. W. Demmel. Benchmarking gpus to tune dense linear algebra. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--11, Piscataway, NJ, USA, 2008. IEEE Press. Google ScholarDigital Library
- M. Weiser. Program Slicing. In ICSE'81: Proceedings of the 5th International Conference on Software Engineering, pages 439--449, Piscataway, NJ, USA, 1981. IEEE Press. Google ScholarDigital Library
- C. Yang, F. Wang, Y. Du, J. Chen, J. Liu, H. Yi, and K. Lu. Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing. In IEEE Cluster'10: Proceedings of IEEE International Conference on Cluster Computing, pages 19--28, Los Alamitos, CA, USA, 2010. IEEE Computer Society. Google ScholarDigital Library
- Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU compiler for memory optimization and parallelism management. In PLDI'10: Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation, pages 86--97. ACM, June 2010. Google ScholarDigital Library
Index Terms
- Achieving a single compute device image in OpenCL for multiple GPUs
Recommendations
Achieving a single compute device image in OpenCL for multiple GPUs
PPoPP '11In this paper, we propose an OpenCL framework that combines multiple GPUs and treats them as a single compute device. Providing a single virtual compute device image to the user makes an OpenCL application written for a single GPU portable to the ...
An OpenCL framework for heterogeneous multicores with local memory
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniquesIn this paper, we present the design and implementation of an Open Computing Language (OpenCL) framework that targets heterogeneous accelerator multicore architectures with local memory. The architecture consists of a general-purpose processor core and ...
Enabling PoCL-based runtime frameworks on the HSA for OpenCL 2.0 support
The heterogeneous system architecture (HSA), announced by the HSA Foundation, is an approach to integrate central processing unit (CPU) and graphics processing unit (GPU) architectures. The open computing language (OpenCL) is a programming framework ...
Comments