ABSTRACT
In the last few years, thanks to their computational power, their progressively increasing programmability and their wide adoption in both the research community and in industry, GPUs have become part of HPC clusters (for example, the US Titan and Stampede and the Chinese Tianhe-1A supercomputers). As a result, widely used open-source cluster resource managers (e.g. SLURM and TORQUE) have recently been extended with GPU support capabilities. These systems, however, provide simple scheduling mechanisms that often result in resource underutilization and, thereby, in suboptimal performance. In this paper, we propose a runtime system that can be integrated with existing cluster resource managers to enable a more efficient use of heterogeneous clusters with GPUs. Differently from previous work, we focus on multi-process GPU applications including synchronization (for example, hybrid MPI-CUDA applications). We discuss the limitations and inefficiencies of existing scheduling and resource sharing schemes in the presence of synchronization. We show that preemption is an effective mechanism to allow efficient scheduling of hybrid MPI-CUDA applications. We validate our runtime on a variety of benchmark programs with different computation and communication patterns.
- M. Becchi, K. Sajjapongse et al., "A virtual memory based runtime to support multi-tenancy in clusters with GPUs," in Proc. of the 21st Int'l Symp. on High-Performance Parallel and Distributed Computing, 2012, pp. 97--108. Google ScholarDigital Library
- V. Gupta, A. Gavrilovska et al., "GViM: GPU-accelerated virtual machines," in Proc. of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing, 2009, pp. 17--24. Google ScholarDigital Library
- L. Shi, H. Chen, and J. Sun, "vCUDA: GPU accelerated high performance computing in virtual machines," in Proc. of the IEEE Int'l Symposium on Parallel&Distributed Processing, 2009, pp. 1--11. Google ScholarDigital Library
- J. Duato, A. J. Pena et al., "Enabling CUDA acceleration within virtual machines using rCUDA," in Proc. of the 18th Int'l Conference on High Performance Computing, 2011, pp. 1--10. Google ScholarDigital Library
- G. Giunta, R. Montella et al., "A GPGPU transparent virtualization component for high performance computing clouds," in Proc. of the 16th Int'l Euro-Par conference on Parallel processing: Part I, 2010, pp. 379--391. Google ScholarDigital Library
- V. T. Ravi, M. Becchi et al., "Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework," in Proc. of the 20th Int'l Symp. on High performance distributed computing, 2011, pp. 217--228. Google ScholarDigital Library
- J. Mars, N. Vachharajani et al., "Contention aware execution: online contention detection and response," in Proc. of the 8th annual IEEE/ACM Int'l Symp. on Code generation and optimization, Toronto, 2010, pp. 257--265. Google ScholarDigital Library
- J. Mars, L. Tang et al., "Bubble-Up: increasing utilization in modern warehouse scale computers via sensible co-locations," in Proc. of the 44th Annual IEEE/ACM Int'l Symp. on Microarchitecture, 2011, pp. 248--259. Google ScholarDigital Library
- M. Kambadur, T. Moseley et al., "Measuring interference between live datacenter applications," in Proc. of the Int'l Conf. on High Performance Computing, Networking, Storage and Analysis, 2012, pp. 1--12. Google ScholarDigital Library
- C. Anglano, "A Performance Comparison of Coscheduling Strategies for Workstation Clusters," Cluster Computing, vol. 4, no. 2, pp. 121--131, 2001. Google ScholarDigital Library
- G. S. Choi, S. Agarwal et al., "Performance Comparison of Coscheduling Algorithms for Non-Dedicated Clusters Through a Generic Framework," Int. J. High Perform. Comput. Appl., vol. 21, no. 1, pp. 91--105, 2007. Google ScholarDigital Library
- D. G. Feitelson, and L. Rudolph, "Coscheduling Based on Run-Time Identification of Activity Working Sets," International Journal of Parallel Programming, vol. 23, pp. 135--160. Google ScholarDigital Library
- A. Yoo, and M. A. Jette, "An Efficient and Scalable Coscheduling Technique for Large Symmetric Multiprocessor Clusters," in Revised Papers from the 7th International Workshop on Job Scheduling Strategies for Parallel Processing, 2001, pp. 21--40. Google ScholarDigital Library
- D. G. Feitelson, L. Rudolph, and U. Schwiegelshohn, "Parallel job scheduling: a status report," in Proc. of the 10th Int'l Conf. on Job Scheduling Strategies for Parallel Processing, 2005, pp. 1--16. Google ScholarDigital Library
- A. Hori, H. Tezuka, and Y. Ishikawa, "Highly efficient gang scheduling implementation," in Proc. of the 1998 ACM/IEEE Conf. on Supercomputing (CDROM), 1998, pp. 1--14. Google ScholarDigital Library
- G. S. Choi, J.-H. Kim et al., "Coscheduling in Clusters: Is It a Viable Alternative?," in Proc. of the 2004 ACM/IEEE Conf. on Supercomputing, 2004, pp. 16. Google ScholarDigital Library
- P. Sobalvarro, S. Pakin et al., "Dynamic Coscheduling on Workstation Clusters," in Proc. of the Workshop on Job Scheduling Strategies for Parallel Processing, 1998, pp. 231--256. Google ScholarDigital Library
- A. C. Arpaci-Dusseau, "Implicit coscheduling: coordinated scheduling with implicit information in distributed systems," ACM Trans. Comput. Syst., vol. 19, no. 3, pp. 283--331, 2001. Google ScholarDigital Library
- A. C. Dusseau, R. H. Arpaci, and D. E. Culler, "Effective distributed scheduling of parallel workloads," in Proce. of the 1996 ACM SIGMETRICS Int'l Conf. on Measurement and modeling of computer systems, 1996, pp. 25--36. Google ScholarDigital Library
- S. Agarwal, A. B. Yoo et al., "Co-ordinated coscheduling in time-sharing clusters through a generic framework," in Proc. of the IEEE Int'l Conf. on Cluster Computing, 2003, pp. 84--91.Google ScholarCross Ref
- V. T. Ravi, M. Becchi et al., "Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes," in Proc. of the 2012 12th IEEE/ACM Int'l Symp. on Cluster, Cloud and Grid Computing, 2012, pp. 140--147. Google ScholarDigital Library
- D. E. Irwin, L. E. Grit, and J. S. Chase, "Balancing Risk and Reward in a Market-Based Task Service," in Proc. of the 13th IEEE Int'l Symp. on High Performance Distributed Computing, 2004, pp. 160--169. Google ScholarDigital Library
- V. T. Ravi, M. Becchi et al., "ValuePack: value-based scheduling framework for CPU-GPU clusters," in Proc. of the Int'l Conf. on High Performance Computing, Networking, Storage and Analysis, 2012, pp. 1--12. Google ScholarDigital Library
- R. Phull, C.-H. Li et al., "Interference-driven resource management for GPU-based heterogeneous clusters," in Proc. of the 21st Int'l Symp. on High-Performance Parallel and Distributed Computing, 2012, pp. 109--120. Google ScholarDigital Library
- J. Calhoun, and J. Hai, "Preemption of a CUDA Kernel Function," in Software Engineering, Artificial Intelligence, in Proc. of the Int'l Conf. in Networking and Parallel & Distributed Computing, 2012, pp. 247--252. Google ScholarDigital Library
Index Terms
- A preemption-based runtime to efficiently schedule multi-process applications on heterogeneous clusters with GPUs
Recommendations
A preemption-based runtime to efficiently schedule multi-process applications on heterogeneous clusters with GPUs
HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computingIn the last few years, thanks to their computational power, their progressively increasing programmability and their wide adoption in both the research community and in industry, GPUs have become part of HPC clusters (for example, the US Titan and ...
A virtual memory based runtime to support multi-tenancy in clusters with GPUs
HPDC '12: Proceedings of the 21st international symposium on High-Performance Parallel and Distributed ComputingGraphics Processing Units (GPUs) are increasingly becoming part of HPC clusters. Nevertheless, cloud computing services and resource management frameworks targeting heterogeneous clusters including GPUs are still in their infancy. Further, GPU software ...
Out-of-core implementation for accelerator kernels on heterogeneous clouds
Cloud environments today are increasingly featuring hybrid nodes containing multicore CPU processors and a diverse mix of accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors, and Field-Programmable Gate Arrays (FPGAs) to ...
Comments