skip to main content
10.1145/2628071.2628087acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Automatic optimization of thread-coarsening for graphics processors

Published:24 August 2014Publication History

ABSTRACT

OpenCL has been designed to achieve functional portability across multi-core devices from different vendors. However, the lack of a single cross-target optimizing compiler severely limits performance portability of OpenCL programs. Programmers need to manually tune applications for each specific device, preventing effective portability. We target a compiler transformation specific for data-parallel languages: thread-coarsening and show it can improve performance across different GPU devices. We then address the problem of selecting the best value for the coarsening factor parameter, i.e., deciding how many threads to merge together. We experimentally show that this is a hard problem to solve: good configurations are difficult to find and naive coarsening in fact leads to substantial slowdowns. We propose a solution based on a machine-learning model that predicts the best coarsening factor using kernel-function static features. The model automatically specializes to the different architectures considered. We evaluate our approach on 17 benchmarks on four devices: two Nvidia GPUs and two different generations of AMD GPUs. Using our technique, we achieve speedups between 1.11X and 1.33X on average.

References

  1. Nvidia Corporation The Cuda specification.Google ScholarGoogle Scholar
  2. Nvidia's Next Generation CUDA Compute Architecture: Fermi http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf, 2009.Google ScholarGoogle Scholar
  3. AMD Accelerated parallel processing OpenCL, 2012.Google ScholarGoogle Scholar
  4. Nvidia's Next Generation CUDA Compute Architecture: Kepler http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf, 2012.Google ScholarGoogle Scholar
  5. HSA Programmer's Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer's Guide, and Object Format (BRIG), 2013.Google ScholarGoogle Scholar
  6. The SPIR Specification, Standard Portable Intermediate Representation, Version 1.2, Jan. 2014.Google ScholarGoogle Scholar
  7. C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, Inc., New York, NY, USA, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. W. Choi, A. Singh, and R. W. Vuduc. Model-driven autotuning of sparse matrix-vector multiply on gpus. PPoPP '10, pages 115--126, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. B. Coutinho, D. Sampaio, F. Pereira, and W. Meira. Divergence analysis and optimizations. PACT, pages 320--329, oct. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Y. Dotsenko, S. S. Baghsorkhi, B. Lloyd, and N. K. Govindaraju. Auto-tuning of fast fourier transform on graphics processors. SIGPLAN Not., 46(8):257--266, Feb. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. Dubach, J. Cavazos, B. Franke, G. Fursin, M. F. O'Boyle, and O. Temam. Fast compiler optimisation evaluation using code-feature based performance prediction. CF '07, pages 131--142, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. Dubach, P. Cheng, R. M. Rabbah, D. F. Bacon, and S. J. Fink. Compiling a high-level language for gpus: (via language support for architectures and compilers). In PLDI, pages 1--12, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. H. Hormati, M. Samadi, M. Woh, T. Mudge, and S. Mahlke. Sponge: portable stream programming on graphics engines. ASPLOS '11, pages 381--392, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. W. Jia, K. Shaw, and M. Martonosi. Starchart: Hardware and software optimization using recursive partitioning regression trees. PACT '13, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Karrenberg and S. Hack. Improving performance of opencl on cpus. CC, pages 1--20, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Kerr, G. Diamos, and S. Yalamanchili. Modeling gpu-cpu workloads and systems. GPGPU '10, pages 31--42, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. B. C. Lee, D. M. Brooks, B. R. de Supinski, M. Schulz, K. Singh, and S. A. McKee. Methods of inference and learning for performance modeling of parallel applications. PPoPP '07, pages 249--258, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Y. Lee, R. Krashinsky, V. Grover, S. W. Keckler, and K. Asanović. Convergence and scalarization for data-parallel architectures, 2013.Google ScholarGoogle Scholar
  19. Y. Liu, E. Zhang, and X. Shen. A cross-input adaptive framework for gpu program optimizations. IPDPS '09, pages 1--10, may 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Long and M. F. O'Boyle. Adptive java optimisation using instance-based learning. ICS, pages 237--246, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Magni, C. Dubach, and M. F. O'Boyle. A large-scale cross-architecture evaluation of thread-coarsening. SC '13. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Magni, C. Dubach, and M. F. P. O'Boyle. Exploiting gpu hardware saturation for fast compiler optimization. GPGPU-7, 2014. Google ScholarGoogle ScholarCross RefCross Ref
  23. B. Manly. Multivariate Statistical Methods: A Primer, Third Edition. Taylor & Francis, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  24. S. Moll. Decompilation of LLVM IR, 2011.Google ScholarGoogle Scholar
  25. S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu. Optimization principles and application performance evaluation of a multithreaded gpu using cuda. PPoPP '08, pages 73--82, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. R. Sanchez, J. Amaral, D. Szafron, M. Pirvu, and M. Stoodley. Using machines to learn method-specific compilation strategies. CGO '11, pages 257--266, april 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Sim, A. Dasgupta, H. Kim, and R. Vuduc. A performance analysis framework for identifying potential benefits in gpgpu applications. PPoPP '12, pages 11--22, New York, NY, USA, 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Stephenson and S. Amarasinghe. Predicting unroll factors using supervised classification. CGO '05, pages 123--134, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. K. Stock, L.-N. Pouchet, and P. Sadayappan. Using machine learning to improve automatic vectorization. ACM Trans. Archit. Code Optim., 8(4):50:1--50:23, Jan. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Unkule, C. Shaltz, and A. Qasem. Automatic restructuring of gpu kernels for exploiting inter-thread data locality. CC, pages 21--40, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. V. Volkov and J. W. Demmel. Benchmarking gpus to tune dense linear algebra. SC '08, pages 31:1--31:11, Piscataway, NJ, USA, 2008. IEEE Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Z. Wang and M. F. O'Boyle. Partitioning streaming parallelism for multi-cores: A machine learning based approach. PACT, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. P. Xiang, Y. Yang, M. Mantor, N. Rubin, L. R. Hsu, and H. Zhou. Exploiting uniform vector instructions for gpgpu performance, energy efficiency, and opportunistic reliability enhancement. ICS '13, pages 433--442, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Y. Yang, P. Xiang, J. Kong, M. Mantor, and H. Zhou. A unified optimizing compiler framework for different gpgpu architectures. TACO, 9(2):9, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen. On-the-fly elimination of dynamic irregularities for gpu computing. ASPLOS '11, pages 369--380, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Automatic optimization of thread-coarsening for graphics processors

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilation
      August 2014
      514 pages
      ISBN:9781450328098
      DOI:10.1145/2628071

      Copyright © 2014 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 August 2014

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      PACT '14 Paper Acceptance Rate54of144submissions,38%Overall Acceptance Rate121of471submissions,26%

      Upcoming Conference

      PACT '24
      International Conference on Parallel Architectures and Compilation Techniques
      October 14 - 16, 2024
      Southern California , CA , USA

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader