Abstract
GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance bottlenecks of those parallel programs on GPU architectures to improve application performance is even more difficult. Current approaches rely on programmers to tune their applications by exploiting the design space exhaustively without fully understanding the performance characteristics of their applications.
To provide insights into the performance bottlenecks of parallel applications on GPU architectures, we propose a simple analytical model that estimates the execution time of massively parallel programs. The key component of our model is estimating the number of parallel memory requests (we call this the memory warp parallelism) by considering the number of running threads and memory bandwidth. Based on the degree of memory warp parallelism, the model estimates the cost of memory requests, thereby estimating the overall execution time of a program. Comparisons between the outcome of the model and the actual execution time in several GPUs show that the geometric mean of absolute error of our model on micro-benchmarks is 5.4% and on GPU computing applications is 13.3%. All the applications are written in the CUDA programming language.
- ATI Mobility RadeonTM HD4850/4870 Graphics-Overview. http://ati.amd.com/products/radeonhd4800.Google Scholar
- Intel Core2 Quad Processors. http://www.intel.com/products/processor/core2quad.Google Scholar
- NVIDIA GeForce series GTX280, 8800GTX, 8800GT. http://www.nvidia.com/geforce.Google Scholar
- NVIDIA Quadro FX5600. http://www.nvidia.com/quadro.Google Scholar
- Advanced Micro Devices, Inc. AMD Brook+. http://ati.amd.com/technology/streamcomputing/AMDBrookplus.pdf.Google Scholar
- A. Bakhoda, G. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing cuda workloads using a detailed GPU simulator. In IEEE ISPASS, April 2009.Google ScholarCross Ref
- X. E. Chen and T. M. Aamodt. A first-order fine-grained multithreaded throughput model. In HPCA, 2009.Google ScholarCross Ref
- E. Lindholm, J. Nickolls, S.Oberman and J. Montrym. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro, 28(2):39--55, March-April 2008. Google ScholarDigital Library
- M. Fatica, P. LeGresley, I. Buck, J. Stone, J. Phillips, S. Morton, and P. Micikevicius. High Performance Computing with CUDA, SC08, 2008.Google Scholar
- A. Glew. MLP yes! ILP no! In ASPLOS Wild and Crazy Idea Session '98, Oct. 1998.Google Scholar
- GPGPU. General-Purpose Computation Using Graphics Hardware. http://www.gpgpu.org/.Google Scholar
- S. Hong and H. Kim. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. Technical Report TR-2009-003, Atlanta, GA, USA, 2009.Google Scholar
- W. Hwu and D. Kirk. Ece 498al1: Programming massively parallel processors, fall 2007. http://courses.ece.uiuc.edu/ece498/al1/.Google Scholar
- Intel SSE / MMX2 / KNI documentation. http://www.intel80386.com/simd/mmx2-doc.html.Google Scholar
- T. S. Karkhanis and J. E. Smith. A first-order superscalar processor model. In ISCA, 2004. Google ScholarDigital Library
- Khronos. Opencl - the open standard for parallel programming of heterogeneous systems. http://www.khronos.org/opencl/.Google Scholar
- M. D. Linderman, J. D. Collins, H. Wang, and T. H. Meng. Merge: a programming model for heterogeneous multi-core systems. In ASPLOS XIII, 2008. Google ScholarDigital Library
- P. Michaud and A. Seznec. Data-flow prescheduling for large instruction windows in out-of-order processors. In HPCA, 2001. Google ScholarDigital Library
- P. Michaud, A. Seznec, and S. Jourdan. Exploring instruction-fetch bandwidth requirement in wide-issue superscalar processors. In PACT, 1999. Google ScholarDigital Library
- J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable Parallel Programming with CUDA. ACM Queue, 6(2):40--53, March-April 2008. Google ScholarDigital Library
- D. B. Noonburg and J. P. Shen. Theoretical modeling of superscalar processor performance. In MICRO-27, 1994. Google ScholarDigital Library
- NVIDIA Corporation. CUDA Programming Guide, Version 2.1.Google Scholar
- M. Pharr and R. Fernando. GPU Gems 2. Addison-Wesley Professional, 2005.Google Scholar
- S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S. Ueng, J. Stratton, and W. Hwu. Program optimization space pruning for a multithreaded gpu. In CGO, 2008. Google ScholarDigital Library
- R. H. Saavedra-Barrera and D. E. Culler. An analytical solution for a markov chain modeling multithreaded. Technical report, Berkeley, CA, USA, 1991. Google ScholarDigital Library
- L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: a many-core x86 architecture for visual computing. ACM Trans. Graph., 2008. Google ScholarDigital Library
- D. J. Sorin, V. S. Pai, S. V. Adve, M. K. Vernon, and D. A. Wood. Analytic evaluation of shared-memory systems with ILP processors. In ISCA, 1998. Google ScholarDigital Library
- C. A. Waring and X. Liu. Face detection using spectral histograms and SVMs. Systems, Man, and Cybernetics, Part B, IEEE Transactions on, 35(3):467--476, June 2005. Google ScholarDigital Library
Index Terms
- An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness
Recommendations
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness
ISCA '09: Proceedings of the 36th annual international symposium on Computer architectureGPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance ...
An integrated GPU power and performance model
ISCA '10: Proceedings of the 37th annual international symposium on Computer architectureGPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Performance optimization for multi-core processors has been a challenge for programmers. Furthermore, optimizing for power consumption is ...
An integrated GPU power and performance model
ISCA '10GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Performance optimization for multi-core processors has been a challenge for programmers. Furthermore, optimizing for power consumption is ...
Comments