skip to main content
research-article

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Published:20 June 2009Publication History
Skip Abstract Section

Abstract

GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance bottlenecks of those parallel programs on GPU architectures to improve application performance is even more difficult. Current approaches rely on programmers to tune their applications by exploiting the design space exhaustively without fully understanding the performance characteristics of their applications.

To provide insights into the performance bottlenecks of parallel applications on GPU architectures, we propose a simple analytical model that estimates the execution time of massively parallel programs. The key component of our model is estimating the number of parallel memory requests (we call this the memory warp parallelism) by considering the number of running threads and memory bandwidth. Based on the degree of memory warp parallelism, the model estimates the cost of memory requests, thereby estimating the overall execution time of a program. Comparisons between the outcome of the model and the actual execution time in several GPUs show that the geometric mean of absolute error of our model on micro-benchmarks is 5.4% and on GPU computing applications is 13.3%. All the applications are written in the CUDA programming language.

References

  1. ATI Mobility RadeonTM HD4850/4870 Graphics-Overview. http://ati.amd.com/products/radeonhd4800.Google ScholarGoogle Scholar
  2. Intel Core2 Quad Processors. http://www.intel.com/products/processor/core2quad.Google ScholarGoogle Scholar
  3. NVIDIA GeForce series GTX280, 8800GTX, 8800GT. http://www.nvidia.com/geforce.Google ScholarGoogle Scholar
  4. NVIDIA Quadro FX5600. http://www.nvidia.com/quadro.Google ScholarGoogle Scholar
  5. Advanced Micro Devices, Inc. AMD Brook+. http://ati.amd.com/technology/streamcomputing/AMDBrookplus.pdf.Google ScholarGoogle Scholar
  6. A. Bakhoda, G. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing cuda workloads using a detailed GPU simulator. In IEEE ISPASS, April 2009.Google ScholarGoogle ScholarCross RefCross Ref
  7. X. E. Chen and T. M. Aamodt. A first-order fine-grained multithreaded throughput model. In HPCA, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  8. E. Lindholm, J. Nickolls, S.Oberman and J. Montrym. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro, 28(2):39--55, March-April 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Fatica, P. LeGresley, I. Buck, J. Stone, J. Phillips, S. Morton, and P. Micikevicius. High Performance Computing with CUDA, SC08, 2008.Google ScholarGoogle Scholar
  10. A. Glew. MLP yes! ILP no! In ASPLOS Wild and Crazy Idea Session '98, Oct. 1998.Google ScholarGoogle Scholar
  11. GPGPU. General-Purpose Computation Using Graphics Hardware. http://www.gpgpu.org/.Google ScholarGoogle Scholar
  12. S. Hong and H. Kim. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. Technical Report TR-2009-003, Atlanta, GA, USA, 2009.Google ScholarGoogle Scholar
  13. W. Hwu and D. Kirk. Ece 498al1: Programming massively parallel processors, fall 2007. http://courses.ece.uiuc.edu/ece498/al1/.Google ScholarGoogle Scholar
  14. Intel SSE / MMX2 / KNI documentation. http://www.intel80386.com/simd/mmx2-doc.html.Google ScholarGoogle Scholar
  15. T. S. Karkhanis and J. E. Smith. A first-order superscalar processor model. In ISCA, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Khronos. Opencl - the open standard for parallel programming of heterogeneous systems. http://www.khronos.org/opencl/.Google ScholarGoogle Scholar
  17. M. D. Linderman, J. D. Collins, H. Wang, and T. H. Meng. Merge: a programming model for heterogeneous multi-core systems. In ASPLOS XIII, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. Michaud and A. Seznec. Data-flow prescheduling for large instruction windows in out-of-order processors. In HPCA, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. Michaud, A. Seznec, and S. Jourdan. Exploring instruction-fetch bandwidth requirement in wide-issue superscalar processors. In PACT, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable Parallel Programming with CUDA. ACM Queue, 6(2):40--53, March-April 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. B. Noonburg and J. P. Shen. Theoretical modeling of superscalar processor performance. In MICRO-27, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. NVIDIA Corporation. CUDA Programming Guide, Version 2.1.Google ScholarGoogle Scholar
  23. M. Pharr and R. Fernando. GPU Gems 2. Addison-Wesley Professional, 2005.Google ScholarGoogle Scholar
  24. S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S. Ueng, J. Stratton, and W. Hwu. Program optimization space pruning for a multithreaded gpu. In CGO, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. R. H. Saavedra-Barrera and D. E. Culler. An analytical solution for a markov chain modeling multithreaded. Technical report, Berkeley, CA, USA, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: a many-core x86 architecture for visual computing. ACM Trans. Graph., 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. J. Sorin, V. S. Pai, S. V. Adve, M. K. Vernon, and D. A. Wood. Analytic evaluation of shared-memory systems with ILP processors. In ISCA, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. C. A. Waring and X. Liu. Face detection using spectral histograms and SVMs. Systems, Man, and Cybernetics, Part B, IEEE Transactions on, 35(3):467--476, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGARCH Computer Architecture News
          ACM SIGARCH Computer Architecture News  Volume 37, Issue 3
          June 2009
          495 pages
          ISSN:0163-5964
          DOI:10.1145/1555815
          Issue’s Table of Contents
          • cover image ACM Conferences
            ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture
            June 2009
            510 pages
            ISBN:9781605585260
            DOI:10.1145/1555754

          Copyright © 2009 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 20 June 2009

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader