skip to main content
10.1145/2628071.2628088acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Adaptive heterogeneous scheduling for integrated GPUs

Published:24 August 2014Publication History

ABSTRACT

Many processors today integrate a CPU and GPU on the same die, which allows them to share resources like physical memory and lowers the cost of CPU-GPU communication. As a consequence, programmers can effectively utilize both the CPU and GPU to execute a single application. This paper presents novel adaptive scheduling techniques for integrated CPU-GPU processors. We present two online profiling-based scheduling algorithms: naïve and asymmetric. Our asymmetric scheduling algorithm uses low-overhead online profiling to automatically partition the work of data-parallel kernels between the CPU and GPU without input from application developers. It does profiling on the CPU and GPU in a way that it doesn't penalize GPU-centric workloads that run significantly faster on the GPU. It adapts to application characteristics by addressing: 1) load imbalance via irregularity caused by, e.g., data-dependent control flow, 2) different amounts of work on each kernel call, and 3) multiple kernels with different characteristics. Unlike many existing approaches primarily targeting NVIDIA discrete GPUs, our scheduling algorithm does not require offline processing.

We evaluate our asymmetric scheduling algorithm on a desktop system with an Intel 4th generation Core processor using a set of sixteen regular and irregular workloads from diverse application areas. On average, our asymmetric scheduling algorithm performs within 3.2% of the maximum throughput with a perfect CPU-and-GPU oracle that always chooses the ideal work partitioning between the CPU and GPU. These results underscore the feasibility of online profile-based heterogeneous scheduling on integrated CPU-GPU processors.

References

  1. Intel thread building blocks.Google ScholarGoogle Scholar
  2. Opensource computer vision library.Google ScholarGoogle Scholar
  3. J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. Petabricks: a language and compiler for algorithmic choice. In Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation, PLDI '09, pages 38--49, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience, 23(2):187--198, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Barik, R. Kaleem, D. Majeti, B. Lewis, T. Shpeisman, C. Hu, Y. Ni, and A.-R. Adl-Tabatabai. Efficient mapping of irregular C++ applications to integrated GPUs. In IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Barnes and P. Hut. A hierarchical O(Nlog N) force calculation algorithm. Nature, 324:446--449, 1986.Google ScholarGoogle ScholarCross RefCross Ref
  7. M. E. Belviranli, L. N. Bhuyan, and R. Gupta. A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Trans. Archit. Code Optim., 9(4):57:1--57:20, Jan. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques, PACT '08, pages 72--81, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. J. ACM, 46(5):720--748, Sept. 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Cascaval, S. Chatterjee, H. Franke, K. Gildea, and P. Pattnaik. A taxonomy of accelerator architectures and their programming models. IBM Journal of Research and Development, 54(5):5:1--5:10, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Cederman and P. Tsigas. On dynamic load balancing on graphics processors. In Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, GH '08, pages 57--64, Aire-la-Ville, Switzerland, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Chatterjee, M. Grossman, A. Sbirlea, and V. Sarkar. Dynamic task parallelism with a GPU work-stealing runtime system. In Languages and Compilers for Parallel Computing, volume 7146 of Lecture Notes in Computer Science, pages 203--217. Springer Berlin Heidelberg, 2011.Google ScholarGoogle Scholar
  13. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC), IISWC '09, pages 44--54, Washington, DC, USA, 2009. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Che, J. W. Sheaffer, and K. Skadron. Dymaxion: optimizing memory access patterns for heterogeneous systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 13:1--13:11, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. L. Chen, O. Villa, S. Krishnamoorthy, and G. Gao. Dynamic load balancing on single- and multi-GPU systems. In IEEE International Symposium on Parallel Distributed Processing (IPDPS), pages 1--12, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  16. C. Dubach, P. Cheng, R. Rabbah, D. F. Bacon, and S. J. Fink. Compiling a high-level language for GPUs: (via language support for architectures and compilers). In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '12, pages 1--12, NY, USA, 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. Grewe, Z. Wang, and M. O'Boyle. OpenCL task partitioning in the presence of GPU Contention. In S. Rajopadhye and M. Mills Strout, editors, Languages and Compilers for Parallel Computing, Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2013.Google ScholarGoogle Scholar
  18. D. Grewe, Z. Wang, and M. O'Boyle. Portable mapping of data parallel programs to OpenCL for heterogeneous systems. In IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 1--10, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Hong and H. Kim. An integrated GPU power and performance model. SIGARCH Comput. Archit. News, 38(3):280--289, June 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Keckler, W. Dally, B. Khailany, M. Garland, and D. Glasco. GPUs and the future of parallel computing. Micro, IEEE, 31(5):7--17, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Kim, H. Kim, J. H. Lee, and J. Lee. Achieving a single compute device image in OpenCL for multiple GPUs. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP '11, pages 277--288, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Lee, M. Samadi, Y. Park, and S. Mahlke. Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems. In Proceedings of the 22nd international conference on Parallel architectures and compilation techniques, PACT, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. C. E. Leiserson. The cilk++ concurrency platform. In Proceedings of the 46th Annual Design Automation Conference, DAC '09, pages 522--527, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C.-K. Luk, S. Hong, and H. Kim. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, pages 45--55, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Y. Ogata, T. Endo, N. Maruyama, and S. Matsuoka. An efficient, model-based CPU-GPU heterogeneous FFT library. In IEEE International Symposium on Parallel and Distributed Processing. IPDPS, pages 1--10, 2008.Google ScholarGoogle Scholar
  26. P. Pandit and R. Govindarajan. Fluidic kernels: Cooperative execution of opencl programs on multiple heterogeneous devices. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO '14, pages 273:273--273:283, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. P. M. Phothilimthana, J. Ansel, J. Ragan-Kelley, and S. Amarasinghe. Portable performance on heterogeneous architectures. In Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems, ASPLOS '13, pages 431--444, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. K. Pingali, D. Nguyen, M. Kulkarni, M. Burtscher, M. A. Hassaan, R. Kaleem, T.-H. Lee, A. Lenharth, R. Manevich, M. Méndez-Lojo, D. Prountzos, and X. Sui. The tao of parallelism in algorithms. In Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation, PLDI '11, pages 12--25, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. V. Ravi and G. Agrawal. A dynamic scheduling framework for emerging heterogeneous systems. In High Performance Computing (HiPC), 2011 18th International Conference on, pages 1--10, Dec 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. V. T. Ravi, W. Ma, D. Chiu, and G. Agrawal. Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations. In Proceedings of the 24th ACM International Conference on Supercomputing, ICS '10, pages 137--146, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. C. J. Rossbach, Y. Yu, J. Currey, J.-P. Martin, and D. Fetterly. Dandelion: a compiler and runtime for heterogeneous systems. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP '13, pages 49--68, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. Sbîrlea, Y. Zou, Z. Budimlíc, J. Cong, and V. Sarkar. Mapping a data-flow programming model onto heterogeneous platforms. In Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems, LCTES '12, pages 61--70, NY, USA, 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. D. Schaa and D. Kaeli. Exploring the multiple-GPU design space. In IEEE International Symposium on Parallel Distributed Processing. IPDPS., pages 1--12, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. T. Scogland, B. Rountree, W. chun Feng, and B. De Supinski. Heterogeneous task scheduling for accelerated OpenMP. In IEEE 26th International Parallel Distributed Processing Symposium (IPDPS), pages 144--155, May 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. F. Song and J. Dongarra. A scalable framework for heterogeneous GPU-based clusters. In Proceedinbgs of the 24th ACM symposium on Parallelism in algorithms and architectures, SPAA '12, pages 91--100, NY, USA, 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Adaptive heterogeneous scheduling for integrated GPUs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilation
        August 2014
        514 pages
        ISBN:9781450328098
        DOI:10.1145/2628071

        Copyright © 2014 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 August 2014

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        PACT '14 Paper Acceptance Rate54of144submissions,38%Overall Acceptance Rate121of471submissions,26%

        Upcoming Conference

        PACT '24
        International Conference on Parallel Architectures and Compilation Techniques
        October 14 - 16, 2024
        Southern California , CA , USA

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader