ABSTRACT
As the computational power of GPUs continues to scale with Moore's Law, an increasing number of applications are becoming limited by memory bandwidth. We propose an approach for programming GPUs with tightly-coupled specialized DMA warps for performing memory transfers between on-chip and off-chip memories. Separate DMA warps improve memory bandwidth utilization by better exploiting available memory-level parallelism and by leveraging efficient inter-warp producer-consumer synchronization mechanisms. DMA warps also improve programmer productivity by decoupling the need for thread array shapes to match data layout. To illustrate the benefits of this approach, we present an extensible API, CudaDMA, that encapsulates synchronization and common sequential and strided data transfer patterns. Using CudaDMA, we demonstrate speedup of up to 1.37x on representative synthetic microbenchmarks, and 1.15x-3.2x on several kernels from scientific applications written in CUDA running on NVIDIA Fermi GPUs.
- Cuda toolkit 4.0 thrust quick start guide. http://developer.download.nvidia.com/compute/cuda/4\_0\_rc2/toolkit/docs/Thrust\_Quick\_Start\_Guide.pdf, January.Google Scholar
- The opencl specification, version 1.0. http://www.khronos.org/registry/cl/specs/opencl-1.0.33.pdf, April 2009.Google Scholar
- Cuda cufft library. http://developer.download.nvidia.com/compute/cuda/4\_0\_rc2/toolkit/docs/CUFFT\_Library.pdf, February 2011.Google Scholar
- Cuda toolkit 4.0 cublas library. http://developer.download.nvidia.com/compute/cuda/4\_0\_rc2/toolkit/docs/CUBLAS\_Library.pdf, February 2011.Google Scholar
- Cudadma repository. http://code.google.com/p/cudadma/, April 2011.Google Scholar
- Green 500 supercomputers. http://www.green500.org/, June 2011.Google Scholar
- Ptx isa. http://developer.download.nvidia.com/compute/cuda/4_0_rc2/toolkit/docs/ptx_isa_2.3.pdf, February 2011.Google Scholar
- Top 500 supercomputers. http://www.top500.org/, June 2011.Google Scholar
- A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson, D. Kranz, J. Kubiatowicz, B.-H. Lim, K. Mackenzie, and D. Yeung. The MIT Alewife Machine: Architecture and Performance. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 2--13, May 1995. Google ScholarDigital Library
- M. Bauer, J. Clark, E. Schkufza, and A. Aiken. Programming the memory hierarchy revisited: Supporting irregular parallelism in sequoia. In Principles and Practices of Parallel Programming, PPoPP'11, February 2011. Google ScholarDigital Library
- B. Catanzaro, M. Garland, and K. Keutzer. Copperhead: Compiling an embedded data parallel language. In Principles and Practices of Parallel Programming, PPoPP'11, pages 47--56, February 2011. Google ScholarDigital Library
- B. Catanzaro, S. Kamil, Y. Lee, K. Asanovic, J. Demmel, K. Keutzer, J. Shalf, K. Yelick, and A. Fox. SEJITS: Getting productivity and performance with selective embedded jit specialization. In Programming Models for Emerging Architectures, 2009.Google Scholar
- H. Chafi, A. K. Sujeeth, K. J. Brown, H. Lee, A. R. Atreya, and K. Olukotun. A domain-specific approach to heterogeneous parallelism. In Principles and Practices of Parallel Programming, PPoPP'11, February 2011. Google ScholarDigital Library
- T. Chen, R. Raghavan, J. N. Dale, and E. Iwata. Cell broadband engine architecture and its first implementation; a performance view. IBM Journal of Research and Development, 51(5):559--572, sept. 2007. Google ScholarDigital Library
- S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun. Accelerating cuda graph algorithms at maximum warp. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP '11, pages 267--276, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- D. Merrill and A. Grimshaw. Revisiting sorting for gpgpu stream architectures. Technical Report Technical Report CS2010-03, University of Virginia, February 2010.Google ScholarDigital Library
- D. G. Merrill and A. S. Grimshaw. Revisiting sorting for gpgpu stream architectures. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques, PACT '10, pages 545--546, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- P. Micikevicius. 3d finite difference computation on gpus using cuda. In Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-2, pages 79--84, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- R. Nath, S. Tomov, and J. Dongarra. Accelerating gpu kernels for dense linear algebra. In Proceedings of the 9th international conference on High performance computing for computational science, VECPAR'10, pages 83--92, Berlin, Heidelberg, 2011. Springer-Verlag. Google ScholarDigital Library
- S. Rixner, W. J. Dally, U. J. Kapasi, B. Khailany, A. LÃŞpez-Lagunas, P. R. Mattson, and J. D. Owens. A bandwidth-efficient architecture for media processing. In In 31st International Symposium on Microarchitecture, pages 3--13, 1998. Google ScholarDigital Library
- W. Thies, M. Karczmarek, and S. Amarasinghe. Streamit: A language for streaming applications. In R. Horspool, editor, Compiler Construction, volume 2304 of Lecture Notes in Computer Science, pages 49--84. Springer Berlin/Heidelberg, 2002. 10.1007/3-540-45937-5_14. Google ScholarDigital Library
- V. Volkov. Better performance at lower occupancy. In Proceedings of the GPU Technology Conference, GTC '10, 2010.Google Scholar
- S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52:65--76, April 2009. Google ScholarDigital Library
Recommendations
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and SimulationHigh performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and ManycoresAchieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...
Comments