skip to main content
10.1145/2063384.2063400acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

CudaDMA: optimizing GPU memory bandwidth via warp specialization

Published:12 November 2011Publication History

ABSTRACT

As the computational power of GPUs continues to scale with Moore's Law, an increasing number of applications are becoming limited by memory bandwidth. We propose an approach for programming GPUs with tightly-coupled specialized DMA warps for performing memory transfers between on-chip and off-chip memories. Separate DMA warps improve memory bandwidth utilization by better exploiting available memory-level parallelism and by leveraging efficient inter-warp producer-consumer synchronization mechanisms. DMA warps also improve programmer productivity by decoupling the need for thread array shapes to match data layout. To illustrate the benefits of this approach, we present an extensible API, CudaDMA, that encapsulates synchronization and common sequential and strided data transfer patterns. Using CudaDMA, we demonstrate speedup of up to 1.37x on representative synthetic microbenchmarks, and 1.15x-3.2x on several kernels from scientific applications written in CUDA running on NVIDIA Fermi GPUs.

References

  1. Cuda toolkit 4.0 thrust quick start guide. http://developer.download.nvidia.com/compute/cuda/4\_0\_rc2/toolkit/docs/Thrust\_Quick\_Start\_Guide.pdf, January.Google ScholarGoogle Scholar
  2. The opencl specification, version 1.0. http://www.khronos.org/registry/cl/specs/opencl-1.0.33.pdf, April 2009.Google ScholarGoogle Scholar
  3. Cuda cufft library. http://developer.download.nvidia.com/compute/cuda/4\_0\_rc2/toolkit/docs/CUFFT\_Library.pdf, February 2011.Google ScholarGoogle Scholar
  4. Cuda toolkit 4.0 cublas library. http://developer.download.nvidia.com/compute/cuda/4\_0\_rc2/toolkit/docs/CUBLAS\_Library.pdf, February 2011.Google ScholarGoogle Scholar
  5. Cudadma repository. http://code.google.com/p/cudadma/, April 2011.Google ScholarGoogle Scholar
  6. Green 500 supercomputers. http://www.green500.org/, June 2011.Google ScholarGoogle Scholar
  7. Ptx isa. http://developer.download.nvidia.com/compute/cuda/4_0_rc2/toolkit/docs/ptx_isa_2.3.pdf, February 2011.Google ScholarGoogle Scholar
  8. Top 500 supercomputers. http://www.top500.org/, June 2011.Google ScholarGoogle Scholar
  9. A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson, D. Kranz, J. Kubiatowicz, B.-H. Lim, K. Mackenzie, and D. Yeung. The MIT Alewife Machine: Architecture and Performance. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 2--13, May 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Bauer, J. Clark, E. Schkufza, and A. Aiken. Programming the memory hierarchy revisited: Supporting irregular parallelism in sequoia. In Principles and Practices of Parallel Programming, PPoPP'11, February 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. B. Catanzaro, M. Garland, and K. Keutzer. Copperhead: Compiling an embedded data parallel language. In Principles and Practices of Parallel Programming, PPoPP'11, pages 47--56, February 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. B. Catanzaro, S. Kamil, Y. Lee, K. Asanovic, J. Demmel, K. Keutzer, J. Shalf, K. Yelick, and A. Fox. SEJITS: Getting productivity and performance with selective embedded jit specialization. In Programming Models for Emerging Architectures, 2009.Google ScholarGoogle Scholar
  13. H. Chafi, A. K. Sujeeth, K. J. Brown, H. Lee, A. R. Atreya, and K. Olukotun. A domain-specific approach to heterogeneous parallelism. In Principles and Practices of Parallel Programming, PPoPP'11, February 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Chen, R. Raghavan, J. N. Dale, and E. Iwata. Cell broadband engine architecture and its first implementation; a performance view. IBM Journal of Research and Development, 51(5):559--572, sept. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun. Accelerating cuda graph algorithms at maximum warp. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP '11, pages 267--276, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Merrill and A. Grimshaw. Revisiting sorting for gpgpu stream architectures. Technical Report Technical Report CS2010-03, University of Virginia, February 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. G. Merrill and A. S. Grimshaw. Revisiting sorting for gpgpu stream architectures. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques, PACT '10, pages 545--546, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. Micikevicius. 3d finite difference computation on gpus using cuda. In Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-2, pages 79--84, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. Nath, S. Tomov, and J. Dongarra. Accelerating gpu kernels for dense linear algebra. In Proceedings of the 9th international conference on High performance computing for computational science, VECPAR'10, pages 83--92, Berlin, Heidelberg, 2011. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Rixner, W. J. Dally, U. J. Kapasi, B. Khailany, A. LÃŞpez-Lagunas, P. R. Mattson, and J. D. Owens. A bandwidth-efficient architecture for media processing. In In 31st International Symposium on Microarchitecture, pages 3--13, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. W. Thies, M. Karczmarek, and S. Amarasinghe. Streamit: A language for streaming applications. In R. Horspool, editor, Compiler Construction, volume 2304 of Lecture Notes in Computer Science, pages 49--84. Springer Berlin/Heidelberg, 2002. 10.1007/3-540-45937-5_14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. V. Volkov. Better performance at lower occupancy. In Proceedings of the GPU Technology Conference, GTC '10, 2010.Google ScholarGoogle Scholar
  23. S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52:65--76, April 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
    November 2011
    866 pages
    ISBN:9781450307710
    DOI:10.1145/2063384

    Copyright © 2011 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 12 November 2011

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    SC '11 Paper Acceptance Rate74of352submissions,21%Overall Acceptance Rate1,516of6,373submissions,24%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader