ABSTRACT
In this paper, we describe a runtime to automatically enhance the performance of applications running on heterogeneous platforms consisting of a multi-core (CPU) and a throughput-oriented many-core (GPU). The CPU and GPU are connected by a non-coherent interconnect such as PCI-E, and as such do not have shared memory. Heterogeneous platforms available today such as [9] are of this type. Our goal is to enable the programmer to seamlessly use such a system without rewriting the application and with minimal knowledge of the underlying architectural details. Assuming that applications perform function calls to computational kernels with available CPU and GPU implementations, our runtime achieves this goal by automatically scheduling the kernels and managing data placement. In particular, it intercepts function calls to well-known computational kernels and schedules them on CPU or GPU based on their argument size and location. To improve performance, it defers all data transfers between the CPU and the GPU until necessary. By managing data placement transparently to the programmer, it provides a unified memory view despite the underlying separate memory sub-systems.
We experimentally evaluate our runtime on a heterogeneous platform consisting of a 2.5GHz quad-core Xeon CPU and an NVIDIA C870 GPU. Given array sorting, parallel reduction, dense and sparse matrix operations and ranking as computational kernels, we use our runtime to automatically retarget SSI [25], K-means [32] and two synthetic applications to the above platform with no code changes. We find that, in most cases, performance improves if the computation is moved to the data, and not vice-versa. For instance, even if a particular instance of a kernel is slower on the GPU than on the CPU, the overall application may be faster if the kernel is scheduled on the GPU anyway, especially if the kernel data is already located on the GPU memory due to prior decisions. Our results show that data-aware CPU/GPU scheduling improves performance by up to 25% over the best data-agnostic scheduling on the same platform.
- K. Fatahalian et al, "Sequoia: Programming the memory hierarchy," in Proc. of the 2006 ACM/IEEE Conference on Supercomputing, Tampa, FL. Google ScholarDigital Library
- T. J. Knight et al, "Compilation for Explicitly Managed Memory Hierarchies," in Proc. of PPoPP 2007, San Jose, CA.G.. Google ScholarDigital Library
- Diamos and S. Yalamanchili, "Harmony: an execution model and runtime for heterogeneous many core systems," in Proc. of HPDC 2008, New York, NY.. Google ScholarDigital Library
- C. Luk, S. Hong and H. Kim, "Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping," in Proc. of MICRO 2009, New York, NY. Google ScholarDigital Library
- B. Saha et al, "Programming model for a heterogeneous x86 platform," in Proc. of PLDI 2009, Dublin, Ireland.. Google ScholarDigital Library
- I. Gelado et al, "CUBA: An Architecture for Efficient CPU/Co-processor Data Communication," in Proc. of ICS'08, Island of Kos, Greece. Google ScholarDigital Library
- M. Becchi, S. Cadambi and S. T. Chakradhar, "Enabling Legacy Applications on Heterogeneous Platforms," in Proc. of HotPar 2010, Berkeley, CA, June 2010.Google Scholar
- CUDA documentation: http://www.nvidia.com/object/cuda_develop.html.Google Scholar
- http://www.supermicro.com/products/nfo/files/GPU/GPU_White_Paper.pdf: "Shattering the 1U Server Performance Record.Google Scholar
- AMD, AMD Stream SDK User Guide v 2.0, 2009.Google Scholar
- Intel, Intel Threading Building Blocks 2.2: http://www.threadingbuildingblocks.org.Google Scholar
- A. Ghuloum et al, "Future-Proof Data Parallel Algorithms and Software on Intel Multi-Core Architecture", Intel Technology Journal 11, 4, 333-348, Nov 2007.Google ScholarCross Ref
- D. Tarditi, S. Puri and J. Oglesby, "Accelerator: Using Data Parallelism to Program GPUs for General-Purpose Uses," in Proc. of the 2006 ASPLOS, October 2006. Google ScholarDigital Library
- Intel RapidMind, http://software.intel.com/en-us/articles/rapidmind.Google Scholar
- Peakstream, "Peakstream Stream Platform API C++ Programming Guide v 1.0", May 2007.Google Scholar
- PGI, PGI Accelerator Compilers, http://www.pgroup.com/resources/accel.htm..Google Scholar
- CAPS, HMPP Workbench, http://www.caps-entreprise.com/hmpp.html.Google Scholar
- HPC Project, Par4All, http://www.par4all.org.Google Scholar
- J. A. Stratton, S. S. Stone and W-m. W. Hwu, "MCUDA: An Efficient Implementation of CUDA Kernels from Multi-Core CPUs," in Proc. of the 2008 Workshop on Languages and Compilers for Parallel Computing, 2008. Google ScholarDigital Library
- G. Diamos et al, "GPUocelot -- A binary Translator Framework for GPGPU" http://code.google.com/p/gpuocelot.Google Scholar
- S.-W. Liao et al, "Data and Computation Transformations for Brook Streaming Applications on Multiprocessors," in Proc. of the 4th Conference on CGO, March 2006. Google ScholarDigital Library
- A. Munshi, "OpenCL Parallel Computing on the GPU and CPU", in ACM SIGGRAPH 2008.Google Scholar
- K. O'Brien et al, "Supporting OpenMP on Cell," in International Journal on Parallel Programming, 36, 289--311, 2008. Google ScholarDigital Library
- M. D. Linderman et al, "Merge: A Programming Model for Heterogeneous Multi-core Systems," in Proc. of the 2008 ASPLOS, March 2008. Google ScholarDigital Library
- B. Bai et al, "Learning to Rank with (a lot of) word features," in Special Issue: Learning to Rank for Information Retrieval. Information Retrieval. 2009. Google ScholarDigital Library
- Intel MKL: http://software.intel.com/en-us/intel-mkl.Google Scholar
- http://developer.download.nvidia.com/compute/cuda/1_0/CUBLAS_Library_1.0.pdf.Google Scholar
- T. Kosar, "A New Paradigm in Data Intensive Computing: Stork and the Data-Aware Schedulers," in Proc. of Challenges of Large Applications in Distributed Environments, 2006.Google Scholar
- J. Bent et al, "Coordination of Data Movement with Computation Scheduling on a Cluster," in Proceedings of Challenges of Large Applications in Distributed Environments, 2005.Google Scholar
- G. Khanna, "A Data-Locality Aware Mapping and Scheduling Framework for Data-Intensive Computing", MS Thesis, Dept. of Computer Science and Engineering, The Ohio State University, 2008. Google ScholarDigital Library
- Nvidia, "CUDA SDK Code examples", http://www.nvidia.com/object/cuda_get.html.Google Scholar
- J. B. MacQueen, "Some methods for classification and analysis of multivariate observation," in Proc. of the Berkeley Symposium on Math. Stat. and Prob., pp 281--297.Google Scholar
- S.P. Lloyd, "Least squares quantization in PCM," IEEE Transactions on Information Theory 28 (2): pp 129--137. Google ScholarDigital Library
- C. Augonnet and R. Namyst, "A unified runtime system for heterogeneous multicore architectures," in Proc. of HPPC'08, Las Palmas de Gran Canaria, Spain, August 2008.Google Scholar
- C. Augonnet et al, "StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures," in Proc. of the 15th International Euro-Par Conference, Delft, The Netherlands, August 2009. Google ScholarDigital Library
- http://www.cise.ufl.edu/research/sparse/matrices/Google Scholar
Index Terms
- Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory
Recommendations
Performance Evaluation of Fast Fourier Transform Application on Heterogeneous Platforms
CYBERC '11: Proceedings of the 2011 International Conference on Cyber-Enabled Distributed Computing and Knowledge DiscoveryHeterogeneous platforms, integrating SMPs, clusters, GPUs, FPGAs, etc. are becoming the most popular architectures of supercomputers. Achieving high performance on CPUs or GPUs requires careful consideration of their different architectures, which ...
Accelerating the 3D euler atmospheric solver through heterogeneous CPU-GPU platforms
CF '16: Proceedings of the ACM International Conference on Computing FrontiersIn climate change studies, the atmospheric model is an essential component for building a high-resolution climate simulation system. While the accuracy of atmospheric simulations has long been limited by the computational capabilities of CPU platforms, ...
Optimizing Techniques for OpenCL Programs on Heterogeneous Platforms
Heterogeneous platforms that are consisted of CPU and add-on streaming processors are widely used in modern computer systems. These add-on processors provide substantially more computation capability and memory bandwidth than conventional multi-cores ...
Comments