skip to main content
10.1145/1941553.1941591acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Achieving a single compute device image in OpenCL for multiple GPUs

Published:12 February 2011Publication History

ABSTRACT

In this paper, we propose an OpenCL framework that combines multiple GPUs and treats them as a single compute device. Providing a single virtual compute device image to the user makes an OpenCL application written for a single GPU portable to the platform that has multiple GPU devices. It also makes the application exploit full computing power of the multiple GPU devices and the total amount of GPU memories available in the platform. Our OpenCL framework automatically distributes at run-time the OpenCL kernel written for a single GPU into multiple CUDA kernels that execute on the multiple GPU devices. It applies a run-time memory access range analysis to the kernel by performing a sampling run and identifies an optimal workload distribution for the kernel. To achieve a single compute device image, the runtime maintains virtual device memory that is allocated in the main memory. The OpenCL runtime treats the memory as if it were the memory of a single GPU device and keeps it consistent to the memories of the multiple GPU devices. Our OpenCL-C-to-C translator generates the sampling code from the OpenCL kernel code and OpenCL-C-to-CUDA-C translator generates the CUDA kernel code for the distributed OpenCL kernel. We show the effectiveness of our OpenCL framework by implementing the OpenCL runtime and two source-to-source translators. We evaluate its performance with a system that contains 8 GPUs using 11 OpenCL benchmark applications.

References

  1. ATI Stream Software Development Ket (SDK) v2.1. AMD, 2010. http://developer.amd.com/gpu/atistreamsdk/pages/default.aspx.Google ScholarGoogle Scholar
  2. G. M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In AFIPS'67 (Spring): Proceedings of the April 18--20, 1967, spring joint computer conference, pages 483--485. ACM, 1967. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In PACT'08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 72--81. ACM, October 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. F. Darema. The SPMD Model: Past, Present and Future. Lecture Notes in Computer Science, 2131 (1): 1--1, January 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. I. Gelado, J. H. Kelm, S. Ryoo, S. S. Lumetta, N. Navarro, and W.-m. W. Hwu. CUBA: an architecture for efficient CPU/co-processor data communication. In ICS'08: Proceedings of the 22nd annual international conference on Supercomputing, pages 299--308. ACM, June 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Gummaraju, L. Morichetti, M. Houston, B. Sander, B. R. Gaster, and B. Zheng. Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors. In PACT'10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques, pages 205--216. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Khronos OpenCL Working Group. The OpenCL Specification Version 1.0. Khronos Group, 2009. http://www.khronos.org/opencl.Google ScholarGoogle Scholar
  8. D. B. Kirk and W.-m. W. Hwu. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2010. ISBN 0123814723, 9780123814722. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In CGO'04: Proceedings of the international symposium on Code generation and optimization, pages 75--86, Washington, DC, USA, March 2004. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Lee, J. Kim, S. Seo, S. Kim, J. Park, H. Kim, T. T. Dao, Y. Cho, S. J. Seo, S. H. Lee, S. M. Cho, H. J. Song, S.-B. Suh, and J.-D. Choi. An OpenCL framework for heterogeneous multicores with local memory. In PACT'10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques, pages 193--204. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. S. Muchnick. Advanced compiler design and implementation. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997. ISBN 1-55860-320-4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. NASA Advanced Supercomputing Division. NAS Parallel Benchmarks. http://www.nas.nasa.gov/Resources/Software/npb.html.Google ScholarGoogle Scholar
  13. NVIDIA Fermi Compute Architecture White Paper. NVIDIA, 2009. http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.Google ScholarGoogle Scholar
  14. NVIDIA CUDA C Best Practices Guide 3.1. NVIDIA, May 2010.Google ScholarGoogle Scholar
  15. NVIDIA CUDA C Programming Guide 3.1.1. NVIDIA, July 2010.Google ScholarGoogle Scholar
  16. NVIDIA CUDA Zone. NVIDIA, July 2010. http://www.nvidia.com/object/cuda_home_new.html.Google ScholarGoogle Scholar
  17. NVIDIA GPU Computing Software Development Kit. NVIDIA, June 2010. http://developer.nvidia.com/object/cuda_3_1_downloads.html.Google ScholarGoogle Scholar
  18. Tesla M2050/M2070 GPU Computing Module. NVIDIA, 2010. http://www.nvidia.com/object/product_tesla_M2050_M2070_us.html.Google ScholarGoogle Scholar
  19. J. C. Phillips, J. E. Stone, and K. Schulten. Adapting a message-driven parallel application to GPU-accelerated clusters. In SC'08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--9, Piscataway, NJ, USA, November 2008. IEEE Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. G. Quintana-Ortí, F. D. Igual, E. S. Quintana-Ortí, and R. A. van de Geijn. Solving dense linear systems on platforms with multiple hardware accelerators. In PPoPP'09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 121--130. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. W. Romein, P. C. Broekema, J. D. Mol, and R. V. van Nieuwpoort. The LOFAR correlator: implementation and performance analysis. In PPoPP'10: Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 169--178. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Schaa and D. Kaeli. Exploring the multiple-GPU design space. In IPDPS '09: Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, pages 1--12, May 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Strengert, C. Müler, C. Dachsbacher, and T. Ertl. CUDASA: Compute Unified Device and Systems Architecture. In Eurographics Symposium on Parallel Graphics and Visualization (EGPGV08), pages 49--56. Eurographics Association, April 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. The IMPACT Research Group. Parboil Benchmark suite. http://impact.crhc.illinois.edu/parboil.php, 2009.Google ScholarGoogle Scholar
  25. F. Tip. A Survey of Program Slicing Techniques. Technical report, Amsterdam, The Netherlands, The Netherlands, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. V. Volkov and J. W. Demmel. Benchmarking gpus to tune dense linear algebra. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--11, Piscataway, NJ, USA, 2008. IEEE Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Weiser. Program Slicing. In ICSE'81: Proceedings of the 5th International Conference on Software Engineering, pages 439--449, Piscataway, NJ, USA, 1981. IEEE Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. C. Yang, F. Wang, Y. Du, J. Chen, J. Liu, H. Yi, and K. Lu. Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing. In IEEE Cluster'10: Proceedings of IEEE International Conference on Cluster Computing, pages 19--28, Los Alamitos, CA, USA, 2010. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU compiler for memory optimization and parallelism management. In PLDI'10: Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation, pages 86--97. ACM, June 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Achieving a single compute device image in OpenCL for multiple GPUs

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            PPoPP '11: Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
            February 2011
            326 pages
            ISBN:9781450301190
            DOI:10.1145/1941553
            • General Chair:
            • Calin Cascaval,
            • Program Chair:
            • Pen-Chung Yew
            • cover image ACM SIGPLAN Notices
              ACM SIGPLAN Notices  Volume 46, Issue 8
              PPoPP '11
              August 2011
              300 pages
              ISSN:0362-1340
              EISSN:1558-1160
              DOI:10.1145/2038037
              Issue’s Table of Contents

            Copyright © 2011 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 12 February 2011

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate230of1,014submissions,23%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader