ABSTRACT
GPU-based clusters are increasingly being deployed in HPC environments to accelerate a variety of scientific applications. Despite their growing popularity, the GPU devices themselves are under-utilized even for many computationally-intensive jobs. This stems from the fact that the typical GPU usage model is one in which a host processor periodically offloads computationally intensive portions of an application to the coprocessor. Since some portions of code cannot be offloaded to the GPU (for example, code performing network communication in MPI applications), this usage model results in periods of time when the GPU is idle. GPUs could be time-shared across jobs to "fill" these idle periods, but unlike CPU resources such as the cache, the effects of sharing the GPU are not well understood. Specifically, two jobs that time-share a single GPU will experience resource contention and interfere with each other. The resulting slow-down could lead to missed job deadlines. Current cluster managers do not support GPU-sharing, but instead dedicate GPUs to a job for the job's lifetime.
In this paper, we present a framework to predict and handle interference when two or more jobs time-share GPUs in HPC clusters. Our framework consists of an analysis model, and a dynamic interference detection and response mechanism to detect excessive interference and restart the interfering jobs on different nodes. We implement our framework in Torque, an open-source cluster manager, and using real workloads on an HPC cluster, show that interference-aware two-job colocation (although our method is applicable to colocating more than two jobs) improves GPU utilization by 25%, reduces a job's waiting time in the queue by 39% and improves job latencies by around 20%.
- Amber 11 NVIDIA GPU Acceleration Support. http://ambermd.org/gpus/.Google Scholar
- OpenFoam: The Open Source CFD Toolbox. http://www.openfoam.com.Google Scholar
- Top 500 list (November 2011).Google Scholar
- Tsubame2 system architecture.Google Scholar
- Adaptive Computing. Torque Resource Manager. http://www.adaptivecomputing.com.Google Scholar
- Advanced Center for Computing and Communication, RIKEN. Himeno Benchmark. http://accc.riken.jp/HPC_e/himenobmt_e.html.Google Scholar
- R. K. Ahuja, T. L. Magnanti, and J. B. Orlin. Network Flows: Theory, Algorithms, and Applications. Prentice Hall, 1993. Google ScholarDigital Library
- Amazon Elastic Compute Cloud (Amazon EC2). http://www.amazon.com/b?ie=UTF8 &node=201590011.Google Scholar
- M. Becchi, K. Sajjapongse, I. Graves, A. Procter, V. Ravi, and S. Chakradhar. A virtual memory based runtime to support multi-tenancy in clusters with GPUs. In Proc. of the Intl. Symposium on High Perf. Distributed Computing, 2012. Google ScholarDigital Library
- A. Benoit, L. Marchal, J.-F. Pineau, Y. Robert, and F. Vivien. Scheduling concurrent bag-of-tasks applications on heterogeneous platforms. IEEE Computer, 59(2):202--217, Feb. 2010. Google ScholarDigital Library
- S. Byna, J. Meng, S. T. Chakradhar, A. Raghunathan, and S. Cadambi. Best Effort Semantic Document Search on GPUs. In Proc. of the Workshop on GPUGPU, Mar. 2010. Google ScholarDigital Library
- D. G. Feitelson and M. A. Jette. Improved utilization and responsiveness with gang scheduling. In Job Scheduling Strategies for Parallel Processing, Lecture Notes in Computer Science, pages 238--261. Springer Berlin/Heidelberg, 1997. Google ScholarDigital Library
- G. Gardey, D. Lime, M. Magnin, and O. Roux. Roméo: A tool for analyzing time Petri nets. In K. Etessami and S. Rajamani, editors, Computer Aided Verification, volume 3576 of Lecture Notes in Computer Science, pages 261--272. Springer Berlin/Heidelberg, 2005. Google ScholarDigital Library
- IBM. Tivoli workload scheduler LoadLeveler. http://www-03.ibm.com/systems/software/loadleveler.Google Scholar
- John Michalakes and Manish Vachharajani. GPU Acceleration of Numerical Weather Prediction. http://www.mmm.ucar.edu/wrf/WG2/GPU/WSM5.htm.Google Scholar
- D. Lifka. The ANL/IBM SP scheduling system. In Job Scheduling Strategies for Parallel Processing, volume 949 of Lecture Notes in Computer Science, pages 295--303. Springer Berlin/Heidelberg, 1995. Google ScholarDigital Library
- A. Majumdar, S. Cadambi, and S. Chakradhar. An energy-efficient heterogeneous system for embedded learning and classification. volume 3, pages 42--45, march 2011. Google ScholarDigital Library
- J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In Proc. of the Annual IEEE/ACM Intl. Symposium on Microarchitecture, Dec. 2011. Google ScholarDigital Library
- J. Mars, N. Vachharajani, R. Hundt, and M. L. Soffa. Contention aware execution: online contention detection and response. In Proc. of the IEEE/ACM Intl. Symposium on Code Generation and Optimization, pages 257--265, Apr. 2010. Google ScholarDigital Library
- J. Ousterhout. Scheduling techniques for concurrent systems. In Proceedings of the International Conference on Distributed Computing Systems, pages 22--30, Oct. 1982.Google Scholar
- PBS Works. GPU Scheduling with PBS Professional. http://www.pbsworks.com.Google Scholar
- C. Ramchandani. Analysis of Asynchronous Concurrent Systems by Timed Petri Nets. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 1974.Google Scholar
- V. T. Ravi, M. Becchi, G. Agrawal, and S. Chakradhar. Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework. In Proc. of the Intl. Symposium on High Perf. Distributed Computing, pages 217--228, 2011. Google ScholarDigital Library
- G. Ren, E. Tune, T. Moseley, Y. Shi, S. Rus, and R. Hundt. Google-wide profiling: A continuous profiling infrastructure for data centers. IEEE Micro, 30(4):65--79, Jul.--Aug. 2010. Google ScholarDigital Library
- J. Schneider, M. Kraus, and R. Westermann. GPU-based euclidean distance transforms and their application to volume rendering. In Computer Vision, Imaging and Computer Graphics. Theory and Applications, volume 68 of Communications in Computer and Information Science, pages 215--228. Springer Berlin Heidelberg, 2010.Google Scholar
- J. Sifakis. Performance evaluation of systems using nets. In Net Theory and Applications, volume 84 of Lecture Notes in Computer Science, pages 307--319. Springer Berlin/Heidelberg, 1980. Google ScholarDigital Library
- V. Springel. The cosmological simulation code gadget-2. Monthly Notices of the Royal Astronomical Society, 364(4):1105--1134, 2005.Google ScholarCross Ref
- J. A. Stuart, C.-K. Chen, K.-L. Ma, and J. D. Owens. Multi-GPU volume rendering using MapReduce. In Proc. of the Intl. Symposium on High Perf. Distributed Computing, pages 841--848, June 2010. Google ScholarDigital Library
- L. Tang, J. Mars, N. Vachharajani, R. Hundt, and M. L. Soffa. The impact of memory subsystem resource sharing on datacenter applications. In Proc. of the Annual Symposium on Computer Architecture, pages 283--294, June 2011. Google ScholarDigital Library
- J. Vetter and C. Chambreau. mpiP: Lightweight, scalable MPI profiling. http://mpip.sourceforge.net.Google Scholar
- V. Volkov and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra. In Proc. of the ACM/IEEE Conf. on Supercomputing, pages 1--11, Nov. 2008. Google ScholarDigital Library
- Y. Wiseman and D. G. Feitelson. Paired gang scheduling. IEEE Tran. on Parallel and Distributed Syst., 14(6):581--592, June 2003. Google ScholarDigital Library
- Y. Zhang, H. Franke, J. Moreira, and A. Sivasubramaniam. An integrated approach to parallel scheduling using gang-scheduling, backfilling, and migration. IEEE Tran. on Parallel and Distributed Syst., 14(3):236--247, Mar. 2003. Google ScholarDigital Library
- W. M. Zuberek. Timed Petri nets and preliminary performance evaluation. In Proc. of the Annual Symposium on Computer Architecture, pages 88--96, 1980. Google ScholarDigital Library
Index Terms
- Interference-driven resource management for GPU-based heterogeneous clusters
Recommendations
Efficient simulation of agent-based models on multi-GPU and multi-core clusters
SIMUTools '10: Proceedings of the 3rd International ICST Conference on Simulation Tools and TechniquesAn effective latency-hiding mechanism is presented in the parallelization of agent-based model simulations (ABMS) with millions of agents. The mechanism is designed to accommodate the hierarchical organization as well as heterogeneity of current state-...
Accelerating LINPACK with MPI-OpenCL on Clusters of Multi-GPU Nodes
OpenCL is an open standard to write parallel applications for heterogeneous computing systems. Since its usage is restricted to a single operating system instance, programmers need to use a mix of OpenCL and MPI to program a heterogeneous cluster. In this ...
Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing
CLUSTER '10: Proceedings of the 2010 IEEE International Conference on Cluster ComputingIn this paper, we describe our experiment developing an implementation of the Linpack benchmark for TianHe-1, a petascale CPU/GPU supercomputer system, the largest GPU-accelerated system ever attempted before. An adaptive optimization framework is ...
Comments