skip to main content
10.1145/1810085.1810106acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations

Published:02 June 2010Publication History

ABSTRACT

A trend that has materialized, and has given rise to much attention, is of the increasingly heterogeneous computing platforms. Presently, it has become very common for a desktop or a notebook computer to come equipped with both a multi-core CPU and a GPU. Capitalizing on the maximum computational power of such architectures (i.e., by simultaneously exploiting both the multi-core CPU and the GPU) starting from a high-level API is a critical challenge. We believe that it would be highly desirable to support a simple way for programmers to realize the full potential of today's heterogeneous machines.

This paper describes a compiler and runtime framework that can map a class of applications, namely those characterized by generalized reductions, to a system with a multi-core CPU and GPU. Starting with simple C functions with added annotations, we automatically generate the middleware API code for the multi-core, as well as CUDA code to exploit the GPU simultaneously. The runtime system provides efficient schemes for dynamically partitioning the work between CPU cores and the GPU. Our experimental results from two applications, e.g., k-means clustering and Principal Component Analysis (PCA), show that, through effectively harnessing the heterogeneous architecture, we can achieve significantly higher performance compared to using only the GPU or the multi-core CPU. In k-means, the heterogeneous version with 8 CPU cores and a GPU achieved a speedup of about 32.09x relative to 1-thread CPU. When compared to the faster of CPU-only and GPU-only executions, we were able to achieve a performance gain of about 60%. In PCA, the heterogeneous version attained a speedup of 10.4x relative to the 1-thread CPU version. When compared to the faster of CPU-only and GPU-only versions, we achieved a performance gain of about 63.8%.

References

  1. L. O. Andersen. Program Analysis and Specialization for the C Programming Language. PhD thesis, DIKU, University of Copenhagen, 1994.Google ScholarGoogle Scholar
  2. S. Baghsorkhi, M. Lathara, and W. mei Hwu. CUDA-lite: Reducing GPU Programming Complexity. In LCPC 2008, 2008.Google ScholarGoogle Scholar
  3. M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs. In International Conference on Supercomputing, pages 225--234, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. Berenbrink, T. Friedetzky, and L. A. Goldberg. The natural work-stealing algorithm is stable. SIAM J. Comput., 32(5):1260--1279, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. In SFCS '94: Proceedings of the 35th Annual Symposium on Foundations of Computer Science, pages 356--368, Washington, DC, USA, 1994. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. D. Chamberlain, J. M. Lancaster, and R. K. Cytron. Visions for application development on hybrid computing systems. Parallel Comput., 34(4--5):201--216, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Cong, S. Kodali, S. Krishnamoorthy, D. Lea, V. Saraswat, and T. Wen. Solving large, irregular graph problems using adaptive work-stealing. In ICPP '08: Proceedings of the 2008 37th International Conference on Parallel Processing, pages 536--545, Washington, DC, USA, 2008. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha. Scalable work stealing. In SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pages 1--11, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang. Mars: A MapReduce Framework on Graphics Processors. In PACT08: IEEE International Conference on Parallel Architecture and Compilation Techniques 2008, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Jin and G. Agrawal. A middleware for developing parallel data mining implementations. In Proceedings of the first SIAM conference on Data Mining, Apr. 2001.Google ScholarGoogle ScholarCross RefCross Ref
  13. R. Jin and G. Agrawal. Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance. In Proceedings of the second SIAM conference on Data Mining, Apr. 2002.Google ScholarGoogle ScholarCross RefCross Ref
  14. Khronos. OpenCL 1.0. http://www.khronos.org/opencl/.Google ScholarGoogle Scholar
  15. A. Klockner. PyCuda. http://mathema.tician.de/software/pycuda, 2008.Google ScholarGoogle Scholar
  16. R. Kumar, D. M. Tullsen, N. P. Jouppi, and P. Ranganathan. Heterogeneous chip multiprocessors. Computer, 38(11):32--38, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. M. Kunzman and L. V. Kalé. Towards a framework for abstracting accelerators in parallel applications: experience with cell. In SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pages 1--12, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO'04), Palo Alto, California, Mar 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 101--110, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C.-K. Luk, S. Hong, and H. Kim. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Micro-42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 45--55, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. W. Ma and G. Agrawal. A translation system for enabling data mining applications on gpus. In ICS '09: Proceedings of the 23rd international conference on Conference on Supercomputing, pages 400--409, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. E. B. Nightingale, O. Hodson, R. McIlroy, C. Hawblitzel, and G. Hunt. Helios: heterogeneous multiprocessing with satellite kernels. In SOSP '09: Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 221--234, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. NVidia. NVIDIA CUDA Compute Unified Device Architecture Programming Guide. version 2.0. http://developer.download.nvidia.com/compute/cuda/2.0-Beta2/docs/Programming_Guide_2.0beta2.pdf, June 7 2008.Google ScholarGoogle Scholar
  24. K. Pearson. On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(6):559--572, 1901.Google ScholarGoogle ScholarCross RefCross Ref
  25. C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In Proceedings of International Symposium on High Performance Computer Architecture, 2007, pages 13--24, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S.-Z. Ueng, J. Stratton, and W. mei Hwu. Program Optimization Space Pruning for a Multithreaded GPU. In Proceedings of the 2008 International Symposium on Code Generation and Optimization, April 2008, pages 195--204. ACM, April 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. B. Saha, X. Zhou, H. Chen, Y. Gao, S. Yan, M. Rajagopalan, J. Fang, P. Zhang, R. Ronen, and A. Mendelson. Programming model for a heterogeneous x86 platform. SIGPLAN Not., 44(6):431--440, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. N. Sundaram, A. Raghunathan, and S. Chakradhar. A framework for efficient and scalable execution of domain-specific templates on GPUs. In IPDPS, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. D. Tarditi, S. Puri, and J. Oglesby. Accelerator: using data parallelism to program gpus for general-purpose uses. In J. P. Shen and M. Martonosi, editors, ASPLOS, pages 325--335. ACM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Venkatasubramanian and R. W. Vuduc. Tuned and wildly asynchronous stencil kernels for hybrid cpu/gpu systems. In ICS '09: Proceedings of the 23rd international conference on Supercomputing, pages 244--255, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. P. H. Wang, J. D. Collins, G. N. Chinya, H. Jiang, X. Tian, M. Girkar, N. Y. Yang, G.-Y. Lueh, and H. Wang. Exochi: architecture and programming environment for a heterogeneous multi-core multithreaded system. In PLDI '07: Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation, pages 156--166. ACM Press, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. W. Zhao and J. A. Stankovic. Performance analysis of fcfs and improved fcfs scheduling algorithms for dynamic real-time computer systems. In IEEE Real-Time Systems Symposium, pages 156--165, 1989.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ICS '10: Proceedings of the 24th ACM International Conference on Supercomputing
      June 2010
      365 pages
      ISBN:9781450300186
      DOI:10.1145/1810085

      Copyright © 2010 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 2 June 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate584of2,055submissions,28%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader