research-article

Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory

Authors:
Michela Becchi

NEC Laboratories America, Inc., Princeton, NJ, USA

NEC Laboratories America, Inc., Princeton, NJ, USA
View Profile

,
Surendra Byna

NEC Laboratories America, Inc., Princeton, NJ, USA

NEC Laboratories America, Inc., Princeton, NJ, USA
View Profile

,
Srihari Cadambi

NEC Laboratories America, Inc., Princeton, NJ, USA

NEC Laboratories America, Inc., Princeton, NJ, USA
View Profile

,
Srimat Chakradhar

NEC Laboratories America, Inc., Princeton, NJ, USA

NEC Laboratories America, Inc., Princeton, NJ, USA
View Profile

SPAA '10: Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architecturesJune 2010Pages 82–91https://doi.org/10.1145/1810479.1810498

Published:13 June 2010Publication History

SPAA '10: Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures

Pages 82–91

ABSTRACT

In this paper, we describe a runtime to automatically enhance the performance of applications running on heterogeneous platforms consisting of a multi-core (CPU) and a throughput-oriented many-core (GPU). The CPU and GPU are connected by a non-coherent interconnect such as PCI-E, and as such do not have shared memory. Heterogeneous platforms available today such as [9] are of this type. Our goal is to enable the programmer to seamlessly use such a system without rewriting the application and with minimal knowledge of the underlying architectural details. Assuming that applications perform function calls to computational kernels with available CPU and GPU implementations, our runtime achieves this goal by automatically scheduling the kernels and managing data placement. In particular, it intercepts function calls to well-known computational kernels and schedules them on CPU or GPU based on their argument size and location. To improve performance, it defers all data transfers between the CPU and the GPU until necessary. By managing data placement transparently to the programmer, it provides a unified memory view despite the underlying separate memory sub-systems.

We experimentally evaluate our runtime on a heterogeneous platform consisting of a 2.5GHz quad-core Xeon CPU and an NVIDIA C870 GPU. Given array sorting, parallel reduction, dense and sparse matrix operations and ranking as computational kernels, we use our runtime to automatically retarget SSI [25], K-means [32] and two synthetic applications to the above platform with no code changes. We find that, in most cases, performance improves if the computation is moved to the data, and not vice-versa. For instance, even if a particular instance of a kernel is slower on the GPU than on the CPU, the overall application may be faster if the kernel is scheduled on the GPU anyway, especially if the kernel data is already located on the GPU memory due to prior decisions. Our results show that data-aware CPU/GPU scheduling improves performance by up to 25% over the best data-agnostic scheduling on the same platform.

References

K. Fatahalian et al, "Sequoia: Programming the memory hierarchy," in Proc. of the 2006 ACM/IEEE Conference on Supercomputing, Tampa, FL. Google ScholarDigital Library
T. J. Knight et al, "Compilation for Explicitly Managed Memory Hierarchies," in Proc. of PPoPP 2007, San Jose, CA.G.. Google ScholarDigital Library
Diamos and S. Yalamanchili, "Harmony: an execution model and runtime for heterogeneous many core systems," in Proc. of HPDC 2008, New York, NY.. Google ScholarDigital Library
C. Luk, S. Hong and H. Kim, "Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping," in Proc. of MICRO 2009, New York, NY. Google ScholarDigital Library
B. Saha et al, "Programming model for a heterogeneous x86 platform," in Proc. of PLDI 2009, Dublin, Ireland.. Google ScholarDigital Library
I. Gelado et al, "CUBA: An Architecture for Efficient CPU/Co-processor Data Communication," in Proc. of ICS'08, Island of Kos, Greece. Google ScholarDigital Library
M. Becchi, S. Cadambi and S. T. Chakradhar, "Enabling Legacy Applications on Heterogeneous Platforms," in Proc. of HotPar 2010, Berkeley, CA, June 2010.Google Scholar
CUDA documentation: http://www.nvidia.com/object/cuda_develop.html.Google Scholar
http://www.supermicro.com/products/nfo/files/GPU/GPU_White_Paper.pdf: "Shattering the 1U Server Performance Record.Google Scholar
AMD, AMD Stream SDK User Guide v 2.0, 2009.Google Scholar
Intel, Intel Threading Building Blocks 2.2: http://www.threadingbuildingblocks.org.Google Scholar
A. Ghuloum et al, "Future-Proof Data Parallel Algorithms and Software on Intel Multi-Core Architecture", Intel Technology Journal 11, 4, 333-348, Nov 2007.Google ScholarCross Ref
D. Tarditi, S. Puri and J. Oglesby, "Accelerator: Using Data Parallelism to Program GPUs for General-Purpose Uses," in Proc. of the 2006 ASPLOS, October 2006. Google ScholarDigital Library
Intel RapidMind, http://software.intel.com/en-us/articles/rapidmind.Google Scholar
Peakstream, "Peakstream Stream Platform API C++ Programming Guide v 1.0", May 2007.Google Scholar
PGI, PGI Accelerator Compilers, http://www.pgroup.com/resources/accel.htm..Google Scholar
CAPS, HMPP Workbench, http://www.caps-entreprise.com/hmpp.html.Google Scholar
HPC Project, Par4All, http://www.par4all.org.Google Scholar
J. A. Stratton, S. S. Stone and W-m. W. Hwu, "MCUDA: An Efficient Implementation of CUDA Kernels from Multi-Core CPUs," in Proc. of the 2008 Workshop on Languages and Compilers for Parallel Computing, 2008. Google ScholarDigital Library
G. Diamos et al, "GPUocelot -- A binary Translator Framework for GPGPU" http://code.google.com/p/gpuocelot.Google Scholar
S.-W. Liao et al, "Data and Computation Transformations for Brook Streaming Applications on Multiprocessors," in Proc. of the 4th Conference on CGO, March 2006. Google ScholarDigital Library
A. Munshi, "OpenCL Parallel Computing on the GPU and CPU", in ACM SIGGRAPH 2008.Google Scholar
K. O'Brien et al, "Supporting OpenMP on Cell," in International Journal on Parallel Programming, 36, 289--311, 2008. Google ScholarDigital Library
M. D. Linderman et al, "Merge: A Programming Model for Heterogeneous Multi-core Systems," in Proc. of the 2008 ASPLOS, March 2008. Google ScholarDigital Library
B. Bai et al, "Learning to Rank with (a lot of) word features," in Special Issue: Learning to Rank for Information Retrieval. Information Retrieval. 2009. Google ScholarDigital Library
Intel MKL: http://software.intel.com/en-us/intel-mkl.Google Scholar
http://developer.download.nvidia.com/compute/cuda/1_0/CUBLAS_Library_1.0.pdf.Google Scholar
T. Kosar, "A New Paradigm in Data Intensive Computing: Stork and the Data-Aware Schedulers," in Proc. of Challenges of Large Applications in Distributed Environments, 2006.Google Scholar
J. Bent et al, "Coordination of Data Movement with Computation Scheduling on a Cluster," in Proceedings of Challenges of Large Applications in Distributed Environments, 2005.Google Scholar
G. Khanna, "A Data-Locality Aware Mapping and Scheduling Framework for Data-Intensive Computing", MS Thesis, Dept. of Computer Science and Engineering, The Ohio State University, 2008. Google ScholarDigital Library
Nvidia, "CUDA SDK Code examples", http://www.nvidia.com/object/cuda_get.html.Google Scholar
J. B. MacQueen, "Some methods for classification and analysis of multivariate observation," in Proc. of the Berkeley Symposium on Math. Stat. and Prob., pp 281--297.Google Scholar
S.P. Lloyd, "Least squares quantization in PCM," IEEE Transactions on Information Theory 28 (2): pp 129--137. Google ScholarDigital Library
C. Augonnet and R. Namyst, "A unified runtime system for heterogeneous multicore architectures," in Proc. of HPPC'08, Las Palmas de Gran Canaria, Spain, August 2008.Google Scholar
C. Augonnet et al, "StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures," in Proc. of the 15th International Euro-Par Conference, Delft, The Netherlands, August 2009. Google ScholarDigital Library
http://www.cise.ufl.edu/research/sparse/matrices/Google Scholar

Index Terms

Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments

Recommendations

Performance Evaluation of Fast Fourier Transform Application on Heterogeneous Platforms
CYBERC '11: Proceedings of the 2011 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery

Heterogeneous platforms, integrating SMPs, clusters, GPUs, FPGAs, etc. are becoming the most popular architectures of supercomputers. Achieving high performance on CPUs or GPUs requires careful consideration of their different architectures, which ...
Read More
Accelerating the 3D euler atmospheric solver through heterogeneous CPU-GPU platforms
CF '16: Proceedings of the ACM International Conference on Computing Frontiers

In climate change studies, the atmospheric model is an essential component for building a high-resolution climate simulation system. While the accuracy of atmospheric simulations has long been limited by the computational capabilities of CPU platforms, ...
Read More
Optimizing Techniques for OpenCL Programs on Heterogeneous Platforms

Heterogeneous platforms that are consisted of CPU and add-on streaming processors are widely used in modern computer systems. These add-on processors provide substantially more computation capability and memory bandwidth than conventional multi-cores ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SPAA '10: Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
June 2010
378 pages
ISBN:9781450300797
DOI:10.1145/1810479
General Chairs:
Friedhelm Meyer auf der Heide
University of Paderborn, Germany
,
Cynthia Phillips
Sandia National Laboratories, USA
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 June 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
accelerators
distributed memory
heterogeneous platforms
multi-core processors
runtime
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate447of1,461submissions,31%
Upcoming Conference
SPAA '24

Sponsor:

sigact

sigact

36th ACM Symposium on Parallelism in Algorithms and Architectures

June 17 - 21, 2024

Nantes , France
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 38
  Total Citations
  View Citations
- 891
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory

SPAA '10: Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures

ABSTRACT

References

Cited By

Index Terms

Recommendations

Performance Evaluation of Fast Fourier Transform Application on Heterogeneous Platforms

Accelerating the 3D euler atmospheric solver through heterogeneous CPU-GPU platforms

Optimizing Techniques for OpenCL Programs on Heterogeneous Platforms

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory

SPAA '10: Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures

ABSTRACT

References

Cited By

Index Terms

Recommendations

Performance Evaluation of Fast Fourier Transform Application on Heterogeneous Platforms

Accelerating the 3D euler atmospheric solver through heterogeneous CPU-GPU platforms

Optimizing Techniques for OpenCL Programs on Heterogeneous Platforms

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media