skip to main content
10.1145/2493123.2462911acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

A preemption-based runtime to efficiently schedule multi-process applications on heterogeneous clusters with GPUs

Published:17 June 2013Publication History

ABSTRACT

In the last few years, thanks to their computational power, their progressively increasing programmability and their wide adoption in both the research community and in industry, GPUs have become part of HPC clusters (for example, the US Titan and Stampede and the Chinese Tianhe-1A supercomputers). As a result, widely used open-source cluster resource managers (e.g. SLURM and TORQUE) have recently been extended with GPU support capabilities. These systems, however, provide simple scheduling mechanisms that often result in resource underutilization and, thereby, in suboptimal performance. In this paper, we propose a runtime system that can be integrated with existing cluster resource managers to enable a more efficient use of heterogeneous clusters with GPUs. Differently from previous work, we focus on multi-process GPU applications including synchronization (for example, hybrid MPI-CUDA applications). We discuss the limitations and inefficiencies of existing scheduling and resource sharing schemes in the presence of synchronization. We show that preemption is an effective mechanism to allow efficient scheduling of hybrid MPI-CUDA applications. We validate our runtime on a variety of benchmark programs with different computation and communication patterns.

References

  1. M. Becchi, K. Sajjapongse et al., "A virtual memory based runtime to support multi-tenancy in clusters with GPUs," in Proc. of the 21st Int'l Symp. on High-Performance Parallel and Distributed Computing, 2012, pp. 97--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. V. Gupta, A. Gavrilovska et al., "GViM: GPU-accelerated virtual machines," in Proc. of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing, 2009, pp. 17--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. L. Shi, H. Chen, and J. Sun, "vCUDA: GPU accelerated high performance computing in virtual machines," in Proc. of the IEEE Int'l Symposium on Parallel&Distributed Processing, 2009, pp. 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Duato, A. J. Pena et al., "Enabling CUDA acceleration within virtual machines using rCUDA," in Proc. of the 18th Int'l Conference on High Performance Computing, 2011, pp. 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. G. Giunta, R. Montella et al., "A GPGPU transparent virtualization component for high performance computing clouds," in Proc. of the 16th Int'l Euro-Par conference on Parallel processing: Part I, 2010, pp. 379--391. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. V. T. Ravi, M. Becchi et al., "Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework," in Proc. of the 20th Int'l Symp. on High performance distributed computing, 2011, pp. 217--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Mars, N. Vachharajani et al., "Contention aware execution: online contention detection and response," in Proc. of the 8th annual IEEE/ACM Int'l Symp. on Code generation and optimization, Toronto, 2010, pp. 257--265. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Mars, L. Tang et al., "Bubble-Up: increasing utilization in modern warehouse scale computers via sensible co-locations," in Proc. of the 44th Annual IEEE/ACM Int'l Symp. on Microarchitecture, 2011, pp. 248--259. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Kambadur, T. Moseley et al., "Measuring interference between live datacenter applications," in Proc. of the Int'l Conf. on High Performance Computing, Networking, Storage and Analysis, 2012, pp. 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Anglano, "A Performance Comparison of Coscheduling Strategies for Workstation Clusters," Cluster Computing, vol. 4, no. 2, pp. 121--131, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. S. Choi, S. Agarwal et al., "Performance Comparison of Coscheduling Algorithms for Non-Dedicated Clusters Through a Generic Framework," Int. J. High Perform. Comput. Appl., vol. 21, no. 1, pp. 91--105, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. G. Feitelson, and L. Rudolph, "Coscheduling Based on Run-Time Identification of Activity Working Sets," International Journal of Parallel Programming, vol. 23, pp. 135--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Yoo, and M. A. Jette, "An Efficient and Scalable Coscheduling Technique for Large Symmetric Multiprocessor Clusters," in Revised Papers from the 7th International Workshop on Job Scheduling Strategies for Parallel Processing, 2001, pp. 21--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. G. Feitelson, L. Rudolph, and U. Schwiegelshohn, "Parallel job scheduling: a status report," in Proc. of the 10th Int'l Conf. on Job Scheduling Strategies for Parallel Processing, 2005, pp. 1--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Hori, H. Tezuka, and Y. Ishikawa, "Highly efficient gang scheduling implementation," in Proc. of the 1998 ACM/IEEE Conf. on Supercomputing (CDROM), 1998, pp. 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. G. S. Choi, J.-H. Kim et al., "Coscheduling in Clusters: Is It a Viable Alternative?," in Proc. of the 2004 ACM/IEEE Conf. on Supercomputing, 2004, pp. 16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. Sobalvarro, S. Pakin et al., "Dynamic Coscheduling on Workstation Clusters," in Proc. of the Workshop on Job Scheduling Strategies for Parallel Processing, 1998, pp. 231--256. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. C. Arpaci-Dusseau, "Implicit coscheduling: coordinated scheduling with implicit information in distributed systems," ACM Trans. Comput. Syst., vol. 19, no. 3, pp. 283--331, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. C. Dusseau, R. H. Arpaci, and D. E. Culler, "Effective distributed scheduling of parallel workloads," in Proce. of the 1996 ACM SIGMETRICS Int'l Conf. on Measurement and modeling of computer systems, 1996, pp. 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Agarwal, A. B. Yoo et al., "Co-ordinated coscheduling in time-sharing clusters through a generic framework," in Proc. of the IEEE Int'l Conf. on Cluster Computing, 2003, pp. 84--91.Google ScholarGoogle ScholarCross RefCross Ref
  21. V. T. Ravi, M. Becchi et al., "Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes," in Proc. of the 2012 12th IEEE/ACM Int'l Symp. on Cluster, Cloud and Grid Computing, 2012, pp. 140--147. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. E. Irwin, L. E. Grit, and J. S. Chase, "Balancing Risk and Reward in a Market-Based Task Service," in Proc. of the 13th IEEE Int'l Symp. on High Performance Distributed Computing, 2004, pp. 160--169. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. V. T. Ravi, M. Becchi et al., "ValuePack: value-based scheduling framework for CPU-GPU clusters," in Proc. of the Int'l Conf. on High Performance Computing, Networking, Storage and Analysis, 2012, pp. 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. R. Phull, C.-H. Li et al., "Interference-driven resource management for GPU-based heterogeneous clusters," in Proc. of the 21st Int'l Symp. on High-Performance Parallel and Distributed Computing, 2012, pp. 109--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Calhoun, and J. Hai, "Preemption of a CUDA Kernel Function," in Software Engineering, Artificial Intelligence, in Proc. of the Int'l Conf. in Networking and Parallel & Distributed Computing, 2012, pp. 247--252. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A preemption-based runtime to efficiently schedule multi-process applications on heterogeneous clusters with GPUs

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
      June 2013
      276 pages
      ISBN:9781450319102
      DOI:10.1145/2493123
      • General Chairs:
      • Manish Parashar,
      • Jon Weissman,
      • Program Chairs:
      • Dick Epema,
      • Renato Figueiredo

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 17 June 2013

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      HPDC '13 Paper Acceptance Rate20of131submissions,15%Overall Acceptance Rate166of966submissions,17%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader