research-article

Adaptive heterogeneous scheduling for integrated GPUs

Authors:
Rashid Kaleem

University of Texas, Austin, TX, USA

University of Texas, Austin, TX, USA
View Profile

,
Rajkishore Barik

Intel Labs, Santa Clara, CA, USA

Intel Labs, Santa Clara, CA, USA
View Profile

,
Tatiana Shpeisman

Intel Labs, Santa Clara, CA, USA

Intel Labs, Santa Clara, CA, USA
View Profile

,
Brian T. Lewis

Intel Labs, Santa Clara, CA, USA

Intel Labs, Santa Clara, CA, USA
View Profile

,
Chunling Hu

Intel Labs, Santa Clara, CA, USA

Intel Labs, Santa Clara, CA, USA
View Profile

,
Keshav Pingali

University of Texas, Austin, TX, USA

University of Texas, Austin, TX, USA
View Profile

PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilationAugust 2014Pages 151–162https://doi.org/10.1145/2628071.2628088

Published:24 August 2014Publication History

PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilation

Pages 151–162

ABSTRACT

Many processors today integrate a CPU and GPU on the same die, which allows them to share resources like physical memory and lowers the cost of CPU-GPU communication. As a consequence, programmers can effectively utilize both the CPU and GPU to execute a single application. This paper presents novel adaptive scheduling techniques for integrated CPU-GPU processors. We present two online profiling-based scheduling algorithms: naïve and asymmetric. Our asymmetric scheduling algorithm uses low-overhead online profiling to automatically partition the work of data-parallel kernels between the CPU and GPU without input from application developers. It does profiling on the CPU and GPU in a way that it doesn't penalize GPU-centric workloads that run significantly faster on the GPU. It adapts to application characteristics by addressing: 1) load imbalance via irregularity caused by, e.g., data-dependent control flow, 2) different amounts of work on each kernel call, and 3) multiple kernels with different characteristics. Unlike many existing approaches primarily targeting NVIDIA discrete GPUs, our scheduling algorithm does not require offline processing.

We evaluate our asymmetric scheduling algorithm on a desktop system with an Intel 4^th generation Core processor using a set of sixteen regular and irregular workloads from diverse application areas. On average, our asymmetric scheduling algorithm performs within 3.2% of the maximum throughput with a perfect CPU-and-GPU oracle that always chooses the ideal work partitioning between the CPU and GPU. These results underscore the feasibility of online profile-based heterogeneous scheduling on integrated CPU-GPU processors.

References

Intel thread building blocks.Google Scholar
Opensource computer vision library.Google Scholar
J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. Petabricks: a language and compiler for algorithmic choice. In Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation, PLDI '09, pages 38--49, NY, USA, 2009. ACM. Google ScholarDigital Library
C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience, 23(2):187--198, 2011. Google ScholarDigital Library
R. Barik, R. Kaleem, D. Majeti, B. Lewis, T. Shpeisman, C. Hu, Y. Ni, and A.-R. Adl-Tabatabai. Efficient mapping of irregular C++ applications to integrated GPUs. In IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2014. Google ScholarDigital Library
J. Barnes and P. Hut. A hierarchical O(Nlog N) force calculation algorithm. Nature, 324:446--449, 1986.Google ScholarCross Ref
M. E. Belviranli, L. N. Bhuyan, and R. Gupta. A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Trans. Archit. Code Optim., 9(4):57:1--57:20, Jan. 2013. Google ScholarDigital Library
C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques, PACT '08, pages 72--81, NY, USA, 2008. ACM. Google ScholarDigital Library
R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. J. ACM, 46(5):720--748, Sept. 1999. Google ScholarDigital Library
C. Cascaval, S. Chatterjee, H. Franke, K. Gildea, and P. Pattnaik. A taxonomy of accelerator architectures and their programming models. IBM Journal of Research and Development, 54(5):5:1--5:10, 2010. Google ScholarDigital Library
D. Cederman and P. Tsigas. On dynamic load balancing on graphics processors. In Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, GH '08, pages 57--64, Aire-la-Ville, Switzerland, 2008. Google ScholarDigital Library
S. Chatterjee, M. Grossman, A. Sbirlea, and V. Sarkar. Dynamic task parallelism with a GPU work-stealing runtime system. In Languages and Compilers for Parallel Computing, volume 7146 of Lecture Notes in Computer Science, pages 203--217. Springer Berlin Heidelberg, 2011.Google Scholar
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC), IISWC '09, pages 44--54, Washington, DC, USA, 2009. IEEE Computer Society. Google ScholarDigital Library
S. Che, J. W. Sheaffer, and K. Skadron. Dymaxion: optimizing memory access patterns for heterogeneous systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 13:1--13:11, NY, USA, 2011. ACM. Google ScholarDigital Library
L. Chen, O. Villa, S. Krishnamoorthy, and G. Gao. Dynamic load balancing on single- and multi-GPU systems. In IEEE International Symposium on Parallel Distributed Processing (IPDPS), pages 1--12, 2010.Google ScholarCross Ref
C. Dubach, P. Cheng, R. Rabbah, D. F. Bacon, and S. J. Fink. Compiling a high-level language for GPUs: (via language support for architectures and compilers). In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '12, pages 1--12, NY, USA, 2012. ACM. Google ScholarDigital Library
D. Grewe, Z. Wang, and M. O'Boyle. OpenCL task partitioning in the presence of GPU Contention. In S. Rajopadhye and M. Mills Strout, editors, Languages and Compilers for Parallel Computing, Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2013.Google Scholar
D. Grewe, Z. Wang, and M. O'Boyle. Portable mapping of data parallel programs to OpenCL for heterogeneous systems. In IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 1--10, 2013. Google ScholarDigital Library
S. Hong and H. Kim. An integrated GPU power and performance model. SIGARCH Comput. Archit. News, 38(3):280--289, June 2010. Google ScholarDigital Library
S. Keckler, W. Dally, B. Khailany, M. Garland, and D. Glasco. GPUs and the future of parallel computing. Micro, IEEE, 31(5):7--17, 2011. Google ScholarDigital Library
J. Kim, H. Kim, J. H. Lee, and J. Lee. Achieving a single compute device image in OpenCL for multiple GPUs. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP '11, pages 277--288, NY, USA, 2011. ACM. Google ScholarDigital Library
J. Lee, M. Samadi, Y. Park, and S. Mahlke. Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems. In Proceedings of the 22nd international conference on Parallel architectures and compilation techniques, PACT, 2013. Google ScholarDigital Library
C. E. Leiserson. The cilk++ concurrency platform. In Proceedings of the 46th Annual Design Automation Conference, DAC '09, pages 522--527, NY, USA, 2009. ACM. Google ScholarDigital Library
C.-K. Luk, S. Hong, and H. Kim. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, pages 45--55, NY, USA, 2009. ACM. Google ScholarDigital Library
Y. Ogata, T. Endo, N. Maruyama, and S. Matsuoka. An efficient, model-based CPU-GPU heterogeneous FFT library. In IEEE International Symposium on Parallel and Distributed Processing. IPDPS, pages 1--10, 2008.Google Scholar
P. Pandit and R. Govindarajan. Fluidic kernels: Cooperative execution of opencl programs on multiple heterogeneous devices. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO '14, pages 273:273--273:283, NY, USA, 2014. ACM. Google ScholarDigital Library
P. M. Phothilimthana, J. Ansel, J. Ragan-Kelley, and S. Amarasinghe. Portable performance on heterogeneous architectures. In Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems, ASPLOS '13, pages 431--444, NY, USA, 2013. ACM. Google ScholarDigital Library
K. Pingali, D. Nguyen, M. Kulkarni, M. Burtscher, M. A. Hassaan, R. Kaleem, T.-H. Lee, A. Lenharth, R. Manevich, M. Méndez-Lojo, D. Prountzos, and X. Sui. The tao of parallelism in algorithms. In Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation, PLDI '11, pages 12--25, NY, USA, 2011. ACM. Google ScholarDigital Library
V. Ravi and G. Agrawal. A dynamic scheduling framework for emerging heterogeneous systems. In High Performance Computing (HiPC), 2011 18th International Conference on, pages 1--10, Dec 2011. Google ScholarDigital Library
V. T. Ravi, W. Ma, D. Chiu, and G. Agrawal. Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations. In Proceedings of the 24th ACM International Conference on Supercomputing, ICS '10, pages 137--146, NY, USA, 2010. ACM. Google ScholarDigital Library
C. J. Rossbach, Y. Yu, J. Currey, J.-P. Martin, and D. Fetterly. Dandelion: a compiler and runtime for heterogeneous systems. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP '13, pages 49--68, NY, USA, 2013. ACM. Google ScholarDigital Library
A. Sbîrlea, Y. Zou, Z. Budimlíc, J. Cong, and V. Sarkar. Mapping a data-flow programming model onto heterogeneous platforms. In Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems, LCTES '12, pages 61--70, NY, USA, 2012. ACM. Google ScholarDigital Library
D. Schaa and D. Kaeli. Exploring the multiple-GPU design space. In IEEE International Symposium on Parallel Distributed Processing. IPDPS., pages 1--12, 2009. Google ScholarDigital Library
T. Scogland, B. Rountree, W. chun Feng, and B. De Supinski. Heterogeneous task scheduling for accelerated OpenMP. In IEEE 26th International Parallel Distributed Processing Symposium (IPDPS), pages 144--155, May 2012. Google ScholarDigital Library
F. Song and J. Dongarra. A scalable framework for heterogeneous GPU-based clusters. In Proceedinbgs of the 24th ACM symposium on Parallelism in algorithms and architectures, SPAA '12, pages 91--100, NY, USA, 2012. ACM. Google ScholarDigital Library

Index Terms

Adaptive heterogeneous scheduling for integrated GPUs
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages
  2. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management
        Scheduling

Recommendations

Load balancing in a heterogeneous world: CPU-Xeon Phi co-execution of data-parallel kernels

Heterogeneous systems composed by a CPU and a set of different hardware accelerators are very compelling thanks to their excellent performance and energy consumption features. One of the most important problems of those systems is the workload ...
Read More
Hybrid CPU-GPU scheduling and execution of tree traversals
PPoPP '16: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

GPUs offer the promise of massive, power-efficient parallelism. However, exploiting this parallelism requires code to be carefully structured to deal with the limitations of the SIMT execution model. In recent years, there has been much interest in ...
Read More
Hybrid CPU-GPU scheduling and execution of tree traversals
ICS '16: Proceedings of the 2016 International Conference on Supercomputing

GPUs offer the promise of massive, power-efficient parallelism. However, exploiting this parallelism requires code to be carefully structured to deal with the limitations of the SIMT execution model. In recent years, there has been much interest in ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilation
August 2014
514 pages
ISBN:9781450328098
DOI:10.1145/2628071
General Chair:
J. Nelson Amaral
University of Alberta, Canada
,
Program Chair:
Josep Torrellas
University of Illinois, USA
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 August 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
heterogeneous computing
integrated gpus
irregular applications
load balancing
scheduling
Qualifiers
- research-article
Conference

Acceptance Rates
PACT '14 Paper Acceptance Rate54of144submissions,38%Overall Acceptance Rate121of471submissions,26%
More
Upcoming Conference
PACT '24

Sponsor:

sigarch

International Conference on Parallel Architectures and Compilation Techniques

October 14 - 16, 2024

Southern California , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 83
  Total Citations
  View Citations
- 1,269
  Total Downloads
- Downloads (Last 12 months)67
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.