skip to main content
10.1145/1810479.1810498acmconferencesArticle/Chapter ViewAbstractPublication PagesspaaConference Proceedingsconference-collections
research-article

Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory

Published:13 June 2010Publication History

ABSTRACT

In this paper, we describe a runtime to automatically enhance the performance of applications running on heterogeneous platforms consisting of a multi-core (CPU) and a throughput-oriented many-core (GPU). The CPU and GPU are connected by a non-coherent interconnect such as PCI-E, and as such do not have shared memory. Heterogeneous platforms available today such as [9] are of this type. Our goal is to enable the programmer to seamlessly use such a system without rewriting the application and with minimal knowledge of the underlying architectural details. Assuming that applications perform function calls to computational kernels with available CPU and GPU implementations, our runtime achieves this goal by automatically scheduling the kernels and managing data placement. In particular, it intercepts function calls to well-known computational kernels and schedules them on CPU or GPU based on their argument size and location. To improve performance, it defers all data transfers between the CPU and the GPU until necessary. By managing data placement transparently to the programmer, it provides a unified memory view despite the underlying separate memory sub-systems.

We experimentally evaluate our runtime on a heterogeneous platform consisting of a 2.5GHz quad-core Xeon CPU and an NVIDIA C870 GPU. Given array sorting, parallel reduction, dense and sparse matrix operations and ranking as computational kernels, we use our runtime to automatically retarget SSI [25], K-means [32] and two synthetic applications to the above platform with no code changes. We find that, in most cases, performance improves if the computation is moved to the data, and not vice-versa. For instance, even if a particular instance of a kernel is slower on the GPU than on the CPU, the overall application may be faster if the kernel is scheduled on the GPU anyway, especially if the kernel data is already located on the GPU memory due to prior decisions. Our results show that data-aware CPU/GPU scheduling improves performance by up to 25% over the best data-agnostic scheduling on the same platform.

References

  1. K. Fatahalian et al, "Sequoia: Programming the memory hierarchy," in Proc. of the 2006 ACM/IEEE Conference on Supercomputing, Tampa, FL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. T. J. Knight et al, "Compilation for Explicitly Managed Memory Hierarchies," in Proc. of PPoPP 2007, San Jose, CA.G.. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Diamos and S. Yalamanchili, "Harmony: an execution model and runtime for heterogeneous many core systems," in Proc. of HPDC 2008, New York, NY.. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. Luk, S. Hong and H. Kim, "Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping," in Proc. of MICRO 2009, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. Saha et al, "Programming model for a heterogeneous x86 platform," in Proc. of PLDI 2009, Dublin, Ireland.. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. I. Gelado et al, "CUBA: An Architecture for Efficient CPU/Co-processor Data Communication," in Proc. of ICS'08, Island of Kos, Greece. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Becchi, S. Cadambi and S. T. Chakradhar, "Enabling Legacy Applications on Heterogeneous Platforms," in Proc. of HotPar 2010, Berkeley, CA, June 2010.Google ScholarGoogle Scholar
  8. CUDA documentation: http://www.nvidia.com/object/cuda_develop.html.Google ScholarGoogle Scholar
  9. http://www.supermicro.com/products/nfo/files/GPU/GPU_White_Paper.pdf: "Shattering the 1U Server Performance Record.Google ScholarGoogle Scholar
  10. AMD, AMD Stream SDK User Guide v 2.0, 2009.Google ScholarGoogle Scholar
  11. Intel, Intel Threading Building Blocks 2.2: http://www.threadingbuildingblocks.org.Google ScholarGoogle Scholar
  12. A. Ghuloum et al, "Future-Proof Data Parallel Algorithms and Software on Intel Multi-Core Architecture", Intel Technology Journal 11, 4, 333-348, Nov 2007.Google ScholarGoogle ScholarCross RefCross Ref
  13. D. Tarditi, S. Puri and J. Oglesby, "Accelerator: Using Data Parallelism to Program GPUs for General-Purpose Uses," in Proc. of the 2006 ASPLOS, October 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Intel RapidMind, http://software.intel.com/en-us/articles/rapidmind.Google ScholarGoogle Scholar
  15. Peakstream, "Peakstream Stream Platform API C++ Programming Guide v 1.0", May 2007.Google ScholarGoogle Scholar
  16. PGI, PGI Accelerator Compilers, http://www.pgroup.com/resources/accel.htm..Google ScholarGoogle Scholar
  17. CAPS, HMPP Workbench, http://www.caps-entreprise.com/hmpp.html.Google ScholarGoogle Scholar
  18. HPC Project, Par4All, http://www.par4all.org.Google ScholarGoogle Scholar
  19. J. A. Stratton, S. S. Stone and W-m. W. Hwu, "MCUDA: An Efficient Implementation of CUDA Kernels from Multi-Core CPUs," in Proc. of the 2008 Workshop on Languages and Compilers for Parallel Computing, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. G. Diamos et al, "GPUocelot -- A binary Translator Framework for GPGPU" http://code.google.com/p/gpuocelot.Google ScholarGoogle Scholar
  21. S.-W. Liao et al, "Data and Computation Transformations for Brook Streaming Applications on Multiprocessors," in Proc. of the 4th Conference on CGO, March 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Munshi, "OpenCL Parallel Computing on the GPU and CPU", in ACM SIGGRAPH 2008.Google ScholarGoogle Scholar
  23. K. O'Brien et al, "Supporting OpenMP on Cell," in International Journal on Parallel Programming, 36, 289--311, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. D. Linderman et al, "Merge: A Programming Model for Heterogeneous Multi-core Systems," in Proc. of the 2008 ASPLOS, March 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. B. Bai et al, "Learning to Rank with (a lot of) word features," in Special Issue: Learning to Rank for Information Retrieval. Information Retrieval. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Intel MKL: http://software.intel.com/en-us/intel-mkl.Google ScholarGoogle Scholar
  27. http://developer.download.nvidia.com/compute/cuda/1_0/CUBLAS_Library_1.0.pdf.Google ScholarGoogle Scholar
  28. T. Kosar, "A New Paradigm in Data Intensive Computing: Stork and the Data-Aware Schedulers," in Proc. of Challenges of Large Applications in Distributed Environments, 2006.Google ScholarGoogle Scholar
  29. J. Bent et al, "Coordination of Data Movement with Computation Scheduling on a Cluster," in Proceedings of Challenges of Large Applications in Distributed Environments, 2005.Google ScholarGoogle Scholar
  30. G. Khanna, "A Data-Locality Aware Mapping and Scheduling Framework for Data-Intensive Computing", MS Thesis, Dept. of Computer Science and Engineering, The Ohio State University, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Nvidia, "CUDA SDK Code examples", http://www.nvidia.com/object/cuda_get.html.Google ScholarGoogle Scholar
  32. J. B. MacQueen, "Some methods for classification and analysis of multivariate observation," in Proc. of the Berkeley Symposium on Math. Stat. and Prob., pp 281--297.Google ScholarGoogle Scholar
  33. S.P. Lloyd, "Least squares quantization in PCM," IEEE Transactions on Information Theory 28 (2): pp 129--137. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. C. Augonnet and R. Namyst, "A unified runtime system for heterogeneous multicore architectures," in Proc. of HPPC'08, Las Palmas de Gran Canaria, Spain, August 2008.Google ScholarGoogle Scholar
  35. C. Augonnet et al, "StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures," in Proc. of the 15th International Euro-Par Conference, Delft, The Netherlands, August 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. http://www.cise.ufl.edu/research/sparse/matrices/Google ScholarGoogle Scholar

Index Terms

  1. Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SPAA '10: Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
      June 2010
      378 pages
      ISBN:9781450300797
      DOI:10.1145/1810479

      Copyright © 2010 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 June 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate447of1,461submissions,31%

      Upcoming Conference

      SPAA '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader