research-article

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Authors:
Sunpyo Hong

Georgia Institute of Technology, Atlanta, GA, USA

Georgia Institute of Technology, Atlanta, GA, USA
View Profile

,
Hyesoon Kim

Georgia Institute of Technology, Atlanta, GA, USA

Georgia Institute of Technology, Atlanta, GA, USA
View Profile

Authors Info & Claims

ACM SIGARCH Computer Architecture News Volume 37 Issue 3June 2009pp 152–163https://doi.org/10.1145/1555815.1555775

Published:20 June 2009Publication History

ACM SIGARCH Computer Architecture News

Abstract

GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance bottlenecks of those parallel programs on GPU architectures to improve application performance is even more difficult. Current approaches rely on programmers to tune their applications by exploiting the design space exhaustively without fully understanding the performance characteristics of their applications.

To provide insights into the performance bottlenecks of parallel applications on GPU architectures, we propose a simple analytical model that estimates the execution time of massively parallel programs. The key component of our model is estimating the number of parallel memory requests (we call this the memory warp parallelism) by considering the number of running threads and memory bandwidth. Based on the degree of memory warp parallelism, the model estimates the cost of memory requests, thereby estimating the overall execution time of a program. Comparisons between the outcome of the model and the actual execution time in several GPUs show that the geometric mean of absolute error of our model on micro-benchmarks is 5.4% and on GPU computing applications is 13.3%. All the applications are written in the CUDA programming language.

References

ATI Mobility RadeonTM HD4850/4870 Graphics-Overview. http://ati.amd.com/products/radeonhd4800.Google Scholar
Intel Core2 Quad Processors. http://www.intel.com/products/processor/core2quad.Google Scholar
NVIDIA GeForce series GTX280, 8800GTX, 8800GT. http://www.nvidia.com/geforce.Google Scholar
NVIDIA Quadro FX5600. http://www.nvidia.com/quadro.Google Scholar
Advanced Micro Devices, Inc. AMD Brook+. http://ati.amd.com/technology/streamcomputing/AMDBrookplus.pdf.Google Scholar
A. Bakhoda, G. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing cuda workloads using a detailed GPU simulator. In IEEE ISPASS, April 2009.Google ScholarCross Ref
X. E. Chen and T. M. Aamodt. A first-order fine-grained multithreaded throughput model. In HPCA, 2009.Google ScholarCross Ref
E. Lindholm, J. Nickolls, S.Oberman and J. Montrym. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro, 28(2):39--55, March-April 2008. Google ScholarDigital Library
M. Fatica, P. LeGresley, I. Buck, J. Stone, J. Phillips, S. Morton, and P. Micikevicius. High Performance Computing with CUDA, SC08, 2008.Google Scholar
A. Glew. MLP yes! ILP no! In ASPLOS Wild and Crazy Idea Session '98, Oct. 1998.Google Scholar
GPGPU. General-Purpose Computation Using Graphics Hardware. http://www.gpgpu.org/.Google Scholar
S. Hong and H. Kim. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. Technical Report TR-2009-003, Atlanta, GA, USA, 2009.Google Scholar
W. Hwu and D. Kirk. Ece 498al1: Programming massively parallel processors, fall 2007. http://courses.ece.uiuc.edu/ece498/al1/.Google Scholar
Intel SSE / MMX2 / KNI documentation. http://www.intel80386.com/simd/mmx2-doc.html.Google Scholar
T. S. Karkhanis and J. E. Smith. A first-order superscalar processor model. In ISCA, 2004. Google ScholarDigital Library
Khronos. Opencl - the open standard for parallel programming of heterogeneous systems. http://www.khronos.org/opencl/.Google Scholar
M. D. Linderman, J. D. Collins, H. Wang, and T. H. Meng. Merge: a programming model for heterogeneous multi-core systems. In ASPLOS XIII, 2008. Google ScholarDigital Library
P. Michaud and A. Seznec. Data-flow prescheduling for large instruction windows in out-of-order processors. In HPCA, 2001. Google ScholarDigital Library
P. Michaud, A. Seznec, and S. Jourdan. Exploring instruction-fetch bandwidth requirement in wide-issue superscalar processors. In PACT, 1999. Google ScholarDigital Library
J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable Parallel Programming with CUDA. ACM Queue, 6(2):40--53, March-April 2008. Google ScholarDigital Library
D. B. Noonburg and J. P. Shen. Theoretical modeling of superscalar processor performance. In MICRO-27, 1994. Google ScholarDigital Library
NVIDIA Corporation. CUDA Programming Guide, Version 2.1.Google Scholar
M. Pharr and R. Fernando. GPU Gems 2. Addison-Wesley Professional, 2005.Google Scholar
S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S. Ueng, J. Stratton, and W. Hwu. Program optimization space pruning for a multithreaded gpu. In CGO, 2008. Google ScholarDigital Library
R. H. Saavedra-Barrera and D. E. Culler. An analytical solution for a markov chain modeling multithreaded. Technical report, Berkeley, CA, USA, 1991. Google ScholarDigital Library
L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: a many-core x86 architecture for visual computing. ACM Trans. Graph., 2008. Google ScholarDigital Library
D. J. Sorin, V. S. Pai, S. V. Adve, M. K. Vernon, and D. A. Wood. Analytic evaluation of shared-memory systems with ILP processors. In ISCA, 1998. Google ScholarDigital Library
C. A. Waring and X. Liu. Face detection using spectral histograms and SVMs. Systems, Man, and Cybernetics, Part B, IEEE Transactions on, 35(3):467--476, June 2005. Google ScholarDigital Library

Index Terms

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Recommendations

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness
ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture

GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance ...
Read More
An integrated GPU power and performance model
ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture

GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Performance optimization for multi-core processors has been a challenge for programmers. Furthermore, optimizing for power consumption is ...
Read More
An integrated GPU power and performance model
ISCA '10

GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Performance optimization for multi-core processors has been a challenge for programmers. Furthermore, optimizing for power consumption is ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGARCH Computer Architecture News Volume 37, Issue 3
June 2009
495 pages
ISSN:0163-5964
DOI:10.1145/1555815
Issue’s Table of Contents
ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture
June 2009
510 pages
ISBN:9781605585260
DOI:10.1145/1555754
General Chair:
Steve Keckler
University of Texas at Austin
,
Program Chair:
Luiz André Barroso
Google Inc.
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 June 2009
Check for updates
Author Tags
GPU architecture
analytical model
cuda
memory level parallelism
performance estimation
warp level parallelism
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 512
  Total Citations
  View Citations
- 6,086
  Total Downloads
- Downloads (Last 12 months)271
- Downloads (Last 6 weeks)50
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

ACM SIGARCH Computer Architecture News

Abstract

References

Cited By

Index Terms

Recommendations

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

An integrated GPU power and performance model

An integrated GPU power and performance model