ABSTRACT
OpenCL has been designed to achieve functional portability across multi-core devices from different vendors. However, the lack of a single cross-target optimizing compiler severely limits performance portability of OpenCL programs. Programmers need to manually tune applications for each specific device, preventing effective portability. We target a compiler transformation specific for data-parallel languages: thread-coarsening and show it can improve performance across different GPU devices. We then address the problem of selecting the best value for the coarsening factor parameter, i.e., deciding how many threads to merge together. We experimentally show that this is a hard problem to solve: good configurations are difficult to find and naive coarsening in fact leads to substantial slowdowns. We propose a solution based on a machine-learning model that predicts the best coarsening factor using kernel-function static features. The model automatically specializes to the different architectures considered. We evaluate our approach on 17 benchmarks on four devices: two Nvidia GPUs and two different generations of AMD GPUs. Using our technique, we achieve speedups between 1.11X and 1.33X on average.
- Nvidia Corporation The Cuda specification.Google Scholar
- Nvidia's Next Generation CUDA Compute Architecture: Fermi http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf, 2009.Google Scholar
- AMD Accelerated parallel processing OpenCL, 2012.Google Scholar
- Nvidia's Next Generation CUDA Compute Architecture: Kepler http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf, 2012.Google Scholar
- HSA Programmer's Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer's Guide, and Object Format (BRIG), 2013.Google Scholar
- The SPIR Specification, Standard Portable Intermediate Representation, Version 1.2, Jan. 2014.Google Scholar
- C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, Inc., New York, NY, USA, 1995. Google ScholarDigital Library
- J. W. Choi, A. Singh, and R. W. Vuduc. Model-driven autotuning of sparse matrix-vector multiply on gpus. PPoPP '10, pages 115--126, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- B. Coutinho, D. Sampaio, F. Pereira, and W. Meira. Divergence analysis and optimizations. PACT, pages 320--329, oct. 2011. Google ScholarDigital Library
- Y. Dotsenko, S. S. Baghsorkhi, B. Lloyd, and N. K. Govindaraju. Auto-tuning of fast fourier transform on graphics processors. SIGPLAN Not., 46(8):257--266, Feb. 2011. Google ScholarDigital Library
- C. Dubach, J. Cavazos, B. Franke, G. Fursin, M. F. O'Boyle, and O. Temam. Fast compiler optimisation evaluation using code-feature based performance prediction. CF '07, pages 131--142, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- C. Dubach, P. Cheng, R. M. Rabbah, D. F. Bacon, and S. J. Fink. Compiling a high-level language for gpus: (via language support for architectures and compilers). In PLDI, pages 1--12, 2012. Google ScholarDigital Library
- A. H. Hormati, M. Samadi, M. Woh, T. Mudge, and S. Mahlke. Sponge: portable stream programming on graphics engines. ASPLOS '11, pages 381--392, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- W. Jia, K. Shaw, and M. Martonosi. Starchart: Hardware and software optimization using recursive partitioning regression trees. PACT '13, 2013. Google ScholarDigital Library
- R. Karrenberg and S. Hack. Improving performance of opencl on cpus. CC, pages 1--20, 2012. Google ScholarDigital Library
- A. Kerr, G. Diamos, and S. Yalamanchili. Modeling gpu-cpu workloads and systems. GPGPU '10, pages 31--42, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- B. C. Lee, D. M. Brooks, B. R. de Supinski, M. Schulz, K. Singh, and S. A. McKee. Methods of inference and learning for performance modeling of parallel applications. PPoPP '07, pages 249--258, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- Y. Lee, R. Krashinsky, V. Grover, S. W. Keckler, and K. Asanović. Convergence and scalarization for data-parallel architectures, 2013.Google Scholar
- Y. Liu, E. Zhang, and X. Shen. A cross-input adaptive framework for gpu program optimizations. IPDPS '09, pages 1--10, may 2009. Google ScholarDigital Library
- S. Long and M. F. O'Boyle. Adptive java optimisation using instance-based learning. ICS, pages 237--246, 2004. Google ScholarDigital Library
- A. Magni, C. Dubach, and M. F. O'Boyle. A large-scale cross-architecture evaluation of thread-coarsening. SC '13. ACM, 2013. Google ScholarDigital Library
- A. Magni, C. Dubach, and M. F. P. O'Boyle. Exploiting gpu hardware saturation for fast compiler optimization. GPGPU-7, 2014. Google ScholarCross Ref
- B. Manly. Multivariate Statistical Methods: A Primer, Third Edition. Taylor & Francis, 2004.Google ScholarCross Ref
- S. Moll. Decompilation of LLVM IR, 2011.Google Scholar
- S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu. Optimization principles and application performance evaluation of a multithreaded gpu using cuda. PPoPP '08, pages 73--82, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- R. Sanchez, J. Amaral, D. Szafron, M. Pirvu, and M. Stoodley. Using machines to learn method-specific compilation strategies. CGO '11, pages 257--266, april 2011. Google ScholarDigital Library
- J. Sim, A. Dasgupta, H. Kim, and R. Vuduc. A performance analysis framework for identifying potential benefits in gpgpu applications. PPoPP '12, pages 11--22, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- M. Stephenson and S. Amarasinghe. Predicting unroll factors using supervised classification. CGO '05, pages 123--134, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library
- K. Stock, L.-N. Pouchet, and P. Sadayappan. Using machine learning to improve automatic vectorization. ACM Trans. Archit. Code Optim., 8(4):50:1--50:23, Jan. 2012. Google ScholarDigital Library
- S. Unkule, C. Shaltz, and A. Qasem. Automatic restructuring of gpu kernels for exploiting inter-thread data locality. CC, pages 21--40, 2012. Google ScholarDigital Library
- V. Volkov and J. W. Demmel. Benchmarking gpus to tune dense linear algebra. SC '08, pages 31:1--31:11, Piscataway, NJ, USA, 2008. IEEE Press. Google ScholarDigital Library
- Z. Wang and M. F. O'Boyle. Partitioning streaming parallelism for multi-cores: A machine learning based approach. PACT, 2010. Google ScholarDigital Library
- P. Xiang, Y. Yang, M. Mantor, N. Rubin, L. R. Hsu, and H. Zhou. Exploiting uniform vector instructions for gpgpu performance, energy efficiency, and opportunistic reliability enhancement. ICS '13, pages 433--442, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- Y. Yang, P. Xiang, J. Kong, M. Mantor, and H. Zhou. A unified optimizing compiler framework for different gpgpu architectures. TACO, 9(2):9, 2012. Google ScholarDigital Library
- E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen. On-the-fly elimination of dynamic irregularities for gpu computing. ASPLOS '11, pages 369--380, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
Index Terms
Automatic optimization of thread-coarsening for graphics processors
Recommendations
Predictable Thread Coarsening
Thread coarsening on GPUs combines the work of several threads into one. We show how thread coarsening can be implemented as a fully automated compile-time optimisation that estimates the optimal coarsening factor based on a low-cost, approximate static ...
A large-scale cross-architecture evaluation of thread-coarsening
SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and AnalysisOpenCL has become the de-facto data parallel programming model for parallel devices in today's high-performance supercomputers. OpenCL was designed with the goal of guaranteeing program portability across hardware from different vendors. However, ...
Performance Evaluation and Optimization Mechanisms for Inter-operable Graphics and Computation on GPUs
GPGPU-7: Proceedings of Workshop on General Purpose Processing Using GPUsGraphics Processing Units (GPUs) have gained recognition as the primary form of accelerators for graphics rendering in the gaming domain. They have also been widely accepted as the computing platform of choice in many scientific and high performance ...
Comments