ABSTRACT
Many processors today integrate a CPU and GPU on the same die, which allows them to share resources like physical memory and lowers the cost of CPU-GPU communication. As a consequence, programmers can effectively utilize both the CPU and GPU to execute a single application. This paper presents novel adaptive scheduling techniques for integrated CPU-GPU processors. We present two online profiling-based scheduling algorithms: naïve and asymmetric. Our asymmetric scheduling algorithm uses low-overhead online profiling to automatically partition the work of data-parallel kernels between the CPU and GPU without input from application developers. It does profiling on the CPU and GPU in a way that it doesn't penalize GPU-centric workloads that run significantly faster on the GPU. It adapts to application characteristics by addressing: 1) load imbalance via irregularity caused by, e.g., data-dependent control flow, 2) different amounts of work on each kernel call, and 3) multiple kernels with different characteristics. Unlike many existing approaches primarily targeting NVIDIA discrete GPUs, our scheduling algorithm does not require offline processing.
We evaluate our asymmetric scheduling algorithm on a desktop system with an Intel 4th generation Core processor using a set of sixteen regular and irregular workloads from diverse application areas. On average, our asymmetric scheduling algorithm performs within 3.2% of the maximum throughput with a perfect CPU-and-GPU oracle that always chooses the ideal work partitioning between the CPU and GPU. These results underscore the feasibility of online profile-based heterogeneous scheduling on integrated CPU-GPU processors.
- Intel thread building blocks.Google Scholar
- Opensource computer vision library.Google Scholar
- J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. Petabricks: a language and compiler for algorithmic choice. In Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation, PLDI '09, pages 38--49, NY, USA, 2009. ACM. Google ScholarDigital Library
- C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience, 23(2):187--198, 2011. Google ScholarDigital Library
- R. Barik, R. Kaleem, D. Majeti, B. Lewis, T. Shpeisman, C. Hu, Y. Ni, and A.-R. Adl-Tabatabai. Efficient mapping of irregular C++ applications to integrated GPUs. In IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2014. Google ScholarDigital Library
- J. Barnes and P. Hut. A hierarchical O(Nlog N) force calculation algorithm. Nature, 324:446--449, 1986.Google ScholarCross Ref
- M. E. Belviranli, L. N. Bhuyan, and R. Gupta. A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Trans. Archit. Code Optim., 9(4):57:1--57:20, Jan. 2013. Google ScholarDigital Library
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques, PACT '08, pages 72--81, NY, USA, 2008. ACM. Google ScholarDigital Library
- R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. J. ACM, 46(5):720--748, Sept. 1999. Google ScholarDigital Library
- C. Cascaval, S. Chatterjee, H. Franke, K. Gildea, and P. Pattnaik. A taxonomy of accelerator architectures and their programming models. IBM Journal of Research and Development, 54(5):5:1--5:10, 2010. Google ScholarDigital Library
- D. Cederman and P. Tsigas. On dynamic load balancing on graphics processors. In Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, GH '08, pages 57--64, Aire-la-Ville, Switzerland, 2008. Google ScholarDigital Library
- S. Chatterjee, M. Grossman, A. Sbirlea, and V. Sarkar. Dynamic task parallelism with a GPU work-stealing runtime system. In Languages and Compilers for Parallel Computing, volume 7146 of Lecture Notes in Computer Science, pages 203--217. Springer Berlin Heidelberg, 2011.Google Scholar
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC), IISWC '09, pages 44--54, Washington, DC, USA, 2009. IEEE Computer Society. Google ScholarDigital Library
- S. Che, J. W. Sheaffer, and K. Skadron. Dymaxion: optimizing memory access patterns for heterogeneous systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 13:1--13:11, NY, USA, 2011. ACM. Google ScholarDigital Library
- L. Chen, O. Villa, S. Krishnamoorthy, and G. Gao. Dynamic load balancing on single- and multi-GPU systems. In IEEE International Symposium on Parallel Distributed Processing (IPDPS), pages 1--12, 2010.Google ScholarCross Ref
- C. Dubach, P. Cheng, R. Rabbah, D. F. Bacon, and S. J. Fink. Compiling a high-level language for GPUs: (via language support for architectures and compilers). In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '12, pages 1--12, NY, USA, 2012. ACM. Google ScholarDigital Library
- D. Grewe, Z. Wang, and M. O'Boyle. OpenCL task partitioning in the presence of GPU Contention. In S. Rajopadhye and M. Mills Strout, editors, Languages and Compilers for Parallel Computing, Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2013.Google Scholar
- D. Grewe, Z. Wang, and M. O'Boyle. Portable mapping of data parallel programs to OpenCL for heterogeneous systems. In IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 1--10, 2013. Google ScholarDigital Library
- S. Hong and H. Kim. An integrated GPU power and performance model. SIGARCH Comput. Archit. News, 38(3):280--289, June 2010. Google ScholarDigital Library
- S. Keckler, W. Dally, B. Khailany, M. Garland, and D. Glasco. GPUs and the future of parallel computing. Micro, IEEE, 31(5):7--17, 2011. Google ScholarDigital Library
- J. Kim, H. Kim, J. H. Lee, and J. Lee. Achieving a single compute device image in OpenCL for multiple GPUs. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP '11, pages 277--288, NY, USA, 2011. ACM. Google ScholarDigital Library
- J. Lee, M. Samadi, Y. Park, and S. Mahlke. Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems. In Proceedings of the 22nd international conference on Parallel architectures and compilation techniques, PACT, 2013. Google ScholarDigital Library
- C. E. Leiserson. The cilk++ concurrency platform. In Proceedings of the 46th Annual Design Automation Conference, DAC '09, pages 522--527, NY, USA, 2009. ACM. Google ScholarDigital Library
- C.-K. Luk, S. Hong, and H. Kim. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, pages 45--55, NY, USA, 2009. ACM. Google ScholarDigital Library
- Y. Ogata, T. Endo, N. Maruyama, and S. Matsuoka. An efficient, model-based CPU-GPU heterogeneous FFT library. In IEEE International Symposium on Parallel and Distributed Processing. IPDPS, pages 1--10, 2008.Google Scholar
- P. Pandit and R. Govindarajan. Fluidic kernels: Cooperative execution of opencl programs on multiple heterogeneous devices. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO '14, pages 273:273--273:283, NY, USA, 2014. ACM. Google ScholarDigital Library
- P. M. Phothilimthana, J. Ansel, J. Ragan-Kelley, and S. Amarasinghe. Portable performance on heterogeneous architectures. In Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems, ASPLOS '13, pages 431--444, NY, USA, 2013. ACM. Google ScholarDigital Library
- K. Pingali, D. Nguyen, M. Kulkarni, M. Burtscher, M. A. Hassaan, R. Kaleem, T.-H. Lee, A. Lenharth, R. Manevich, M. Méndez-Lojo, D. Prountzos, and X. Sui. The tao of parallelism in algorithms. In Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation, PLDI '11, pages 12--25, NY, USA, 2011. ACM. Google ScholarDigital Library
- V. Ravi and G. Agrawal. A dynamic scheduling framework for emerging heterogeneous systems. In High Performance Computing (HiPC), 2011 18th International Conference on, pages 1--10, Dec 2011. Google ScholarDigital Library
- V. T. Ravi, W. Ma, D. Chiu, and G. Agrawal. Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations. In Proceedings of the 24th ACM International Conference on Supercomputing, ICS '10, pages 137--146, NY, USA, 2010. ACM. Google ScholarDigital Library
- C. J. Rossbach, Y. Yu, J. Currey, J.-P. Martin, and D. Fetterly. Dandelion: a compiler and runtime for heterogeneous systems. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP '13, pages 49--68, NY, USA, 2013. ACM. Google ScholarDigital Library
- A. Sbîrlea, Y. Zou, Z. Budimlíc, J. Cong, and V. Sarkar. Mapping a data-flow programming model onto heterogeneous platforms. In Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems, LCTES '12, pages 61--70, NY, USA, 2012. ACM. Google ScholarDigital Library
- D. Schaa and D. Kaeli. Exploring the multiple-GPU design space. In IEEE International Symposium on Parallel Distributed Processing. IPDPS., pages 1--12, 2009. Google ScholarDigital Library
- T. Scogland, B. Rountree, W. chun Feng, and B. De Supinski. Heterogeneous task scheduling for accelerated OpenMP. In IEEE 26th International Parallel Distributed Processing Symposium (IPDPS), pages 144--155, May 2012. Google ScholarDigital Library
- F. Song and J. Dongarra. A scalable framework for heterogeneous GPU-based clusters. In Proceedinbgs of the 24th ACM symposium on Parallelism in algorithms and architectures, SPAA '12, pages 91--100, NY, USA, 2012. ACM. Google ScholarDigital Library
Index Terms
- Adaptive heterogeneous scheduling for integrated GPUs
Recommendations
Load balancing in a heterogeneous world: CPU-Xeon Phi co-execution of data-parallel kernels
Heterogeneous systems composed by a CPU and a set of different hardware accelerators are very compelling thanks to their excellent performance and energy consumption features. One of the most important problems of those systems is the workload ...
Hybrid CPU-GPU scheduling and execution of tree traversals
PPoPP '16: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingGPUs offer the promise of massive, power-efficient parallelism. However, exploiting this parallelism requires code to be carefully structured to deal with the limitations of the SIMT execution model. In recent years, there has been much interest in ...
Hybrid CPU-GPU scheduling and execution of tree traversals
ICS '16: Proceedings of the 2016 International Conference on SupercomputingGPUs offer the promise of massive, power-efficient parallelism. However, exploiting this parallelism requires code to be carefully structured to deal with the limitations of the SIMT execution model. In recent years, there has been much interest in ...
Comments