research-article

Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations

Authors:
Vignesh T. Ravi

The Ohio State University, Columbus, OH

The Ohio State University, Columbus, OH
View Profile

,
Wenjing Ma

The Ohio State University, Columbus, OH

The Ohio State University, Columbus, OH
View Profile

,
David Chiu

The Ohio State University, Columbus, OH

The Ohio State University, Columbus, OH
View Profile

,
Gagan Agrawal

The Ohio State University, Columbus, OH

The Ohio State University, Columbus, OH
View Profile

ICS '10: Proceedings of the 24th ACM International Conference on SupercomputingJune 2010Pages 137–146https://doi.org/10.1145/1810085.1810106

Published:02 June 2010Publication History

ICS '10: Proceedings of the 24th ACM International Conference on Supercomputing

Pages 137–146

ABSTRACT

A trend that has materialized, and has given rise to much attention, is of the increasingly heterogeneous computing platforms. Presently, it has become very common for a desktop or a notebook computer to come equipped with both a multi-core CPU and a GPU. Capitalizing on the maximum computational power of such architectures (i.e., by simultaneously exploiting both the multi-core CPU and the GPU) starting from a high-level API is a critical challenge. We believe that it would be highly desirable to support a simple way for programmers to realize the full potential of today's heterogeneous machines.

This paper describes a compiler and runtime framework that can map a class of applications, namely those characterized by generalized reductions, to a system with a multi-core CPU and GPU. Starting with simple C functions with added annotations, we automatically generate the middleware API code for the multi-core, as well as CUDA code to exploit the GPU simultaneously. The runtime system provides efficient schemes for dynamically partitioning the work between CPU cores and the GPU. Our experimental results from two applications, e.g., k-means clustering and Principal Component Analysis (PCA), show that, through effectively harnessing the heterogeneous architecture, we can achieve significantly higher performance compared to using only the GPU or the multi-core CPU. In k-means, the heterogeneous version with 8 CPU cores and a GPU achieved a speedup of about 32.09x relative to 1-thread CPU. When compared to the faster of CPU-only and GPU-only executions, we were able to achieve a performance gain of about 60%. In PCA, the heterogeneous version attained a speedup of 10.4x relative to the 1-thread CPU version. When compared to the faster of CPU-only and GPU-only versions, we achieved a performance gain of about 63.8%.

References

L. O. Andersen. Program Analysis and Specialization for the C Programming Language. PhD thesis, DIKU, University of Copenhagen, 1994.Google Scholar
S. Baghsorkhi, M. Lathara, and W. mei Hwu. CUDA-lite: Reducing GPU Programming Complexity. In LCPC 2008, 2008.Google Scholar
M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs. In International Conference on Supercomputing, pages 225--234, 2008. Google ScholarDigital Library
P. Berenbrink, T. Friedetzky, and L. A. Goldberg. The natural work-stealing algorithm is stable. SIAM J. Comput., 32(5):1260--1279, 2003. Google ScholarDigital Library
R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. In SFCS '94: Proceedings of the 35th Annual Symposium on Foundations of Computer Science, pages 356--368, Washington, DC, USA, 1994. IEEE Computer Society. Google ScholarDigital Library
R. D. Chamberlain, J. M. Lancaster, and R. K. Cytron. Visions for application development on hybrid computing systems. Parallel Comput., 34(4--5):201--216, 2008. Google ScholarDigital Library
G. Cong, S. Kodali, S. Krishnamoorthy, D. Lea, V. Saraswat, and T. Wen. Solving large, irregular graph problems using adaptive work-stealing. In ICPP '08: Proceedings of the 2008 37th International Conference on Parallel Processing, pages 536--545, Washington, DC, USA, 2008. IEEE Computer Society. Google ScholarDigital Library
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google ScholarDigital Library
J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha. Scalable work stealing. In SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pages 1--11, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang. Mars: A MapReduce Framework on Graphics Processors. In PACT08: IEEE International Conference on Parallel Architecture and Compilation Techniques 2008, 2008. Google ScholarDigital Library
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. Google ScholarDigital Library
R. Jin and G. Agrawal. A middleware for developing parallel data mining implementations. In Proceedings of the first SIAM conference on Data Mining, Apr. 2001.Google ScholarCross Ref
R. Jin and G. Agrawal. Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance. In Proceedings of the second SIAM conference on Data Mining, Apr. 2002.Google ScholarCross Ref
Khronos. OpenCL 1.0. http://www.khronos.org/opencl/.Google Scholar
A. Klockner. PyCuda. http://mathema.tician.de/software/pycuda, 2008.Google Scholar
R. Kumar, D. M. Tullsen, N. P. Jouppi, and P. Ranganathan. Heterogeneous chip multiprocessors. Computer, 38(11):32--38, 2005. Google ScholarDigital Library
D. M. Kunzman and L. V. Kalé. Towards a framework for abstracting accelerators in parallel applications: experience with cell. In SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pages 1--12, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO'04), Palo Alto, California, Mar 2004. Google ScholarDigital Library
S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 101--110, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
C.-K. Luk, S. Hong, and H. Kim. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Micro-42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 45--55, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
W. Ma and G. Agrawal. A translation system for enabling data mining applications on gpus. In ICS '09: Proceedings of the 23rd international conference on Conference on Supercomputing, pages 400--409, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
E. B. Nightingale, O. Hodson, R. McIlroy, C. Hawblitzel, and G. Hunt. Helios: heterogeneous multiprocessing with satellite kernels. In SOSP '09: Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 221--234, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
NVidia. NVIDIA CUDA Compute Unified Device Architecture Programming Guide. version 2.0. http://developer.download.nvidia.com/compute/cuda/2.0-Beta2/docs/Programming_Guide_2.0beta2.pdf, June 7 2008.Google Scholar
K. Pearson. On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(6):559--572, 1901.Google ScholarCross Ref
C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In Proceedings of International Symposium on High Performance Computer Architecture, 2007, pages 13--24, 2007. Google ScholarDigital Library
S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S.-Z. Ueng, J. Stratton, and W. mei Hwu. Program Optimization Space Pruning for a Multithreaded GPU. In Proceedings of the 2008 International Symposium on Code Generation and Optimization, April 2008, pages 195--204. ACM, April 2008. Google ScholarDigital Library
B. Saha, X. Zhou, H. Chen, Y. Gao, S. Yan, M. Rajagopalan, J. Fang, P. Zhang, R. Ronen, and A. Mendelson. Programming model for a heterogeneous x86 platform. SIGPLAN Not., 44(6):431--440, 2009. Google ScholarDigital Library
N. Sundaram, A. Raghunathan, and S. Chakradhar. A framework for efficient and scalable execution of domain-specific templates on GPUs. In IPDPS, 2009. Google ScholarDigital Library
D. Tarditi, S. Puri, and J. Oglesby. Accelerator: using data parallelism to program gpus for general-purpose uses. In J. P. Shen and M. Martonosi, editors, ASPLOS, pages 325--335. ACM, 2006. Google ScholarDigital Library
S. Venkatasubramanian and R. W. Vuduc. Tuned and wildly asynchronous stencil kernels for hybrid cpu/gpu systems. In ICS '09: Proceedings of the 23rd international conference on Supercomputing, pages 244--255, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
P. H. Wang, J. D. Collins, G. N. Chinya, H. Jiang, X. Tian, M. Girkar, N. Y. Yang, G.-Y. Lueh, and H. Wang. Exochi: architecture and programming environment for a heterogeneous multi-core multithreaded system. In PLDI '07: Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation, pages 156--166. ACM Press, 2007. Google ScholarDigital Library
W. Zhao and J. A. Stankovic. Performance analysis of fcfs and improved fcfs scheduling algorithms for dynamic real-time computer systems. In IEEE Real-Time Systems Symposium, pages 156--165, 1989.Google ScholarCross Ref

Index Terms

Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations
1. Computer systems organization
  1. Architectures
    1. Other architectures

Recommendations

Compiler and runtime support for enabling reduction computations on heterogeneous systems

A trend that has materialized, and has given rise to much attention, is of the increasingly heterogeneous computing platforms. Presently, it has become very common for a desktop or a notebook computer to come equipped with both a multi-core CPU and a ...
Read More
dOpenCL: Towards uniform programming of distributed heterogeneous multi-/many-core systems

Modern computer systems become increasingly distributed and heterogeneous by comprising multi-core CPUs, GPUs, and other accelerators. Current programming approaches for such systems usually require the application developer to use a combination of ...
Read More
A compiler and runtime system for enabling data mining applications on gpus
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

With increasing need for accelerating data mining and scientific data analysis on large data sets, and less chance to improve processor performance by simply increasing clock frequencies, multi-core architectures and accelerators like FPGAs and GPUs ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICS '10: Proceedings of the 24th ACM International Conference on Supercomputing
June 2010
365 pages
ISBN:9781450300186
DOI:10.1145/1810085
General Chair:
Taisuke Boku
University of Tsukuba
,
Program Chairs:
Hiroshi Nakashima
Kyoto University
,
Avi Mendelson
Microsoft
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 June 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPGPU
dynamic work distribution
generalized reductions
heterogeneous systems
multi-cores
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate584of2,055submissions,28%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 79
  Total Citations
  View Citations
- 1,075
  Total Downloads
- Downloads (Last 12 months)16
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations

ICS '10: Proceedings of the 24th ACM International Conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Compiler and runtime support for enabling reduction computations on heterogeneous systems

dOpenCL: Towards uniform programming of distributed heterogeneous multi-/many-core systems

A compiler and runtime system for enabling data mining applications on gpus