skip to main content
research-article

StreamScan: fast scan algorithms for GPUs without global barrier synchronization

Authors Info & Claims
Published:23 February 2013Publication History
Skip Abstract Section

Abstract

Scan (also known as prefix sum) is a very useful primitive for various important parallel algorithms, such as sort, BFS, SpMV, compaction and so on. Current state of the art of GPU based scan implementation consists of three consecutive Reduce-Scan-Scan phases. This approach requires at least two global barriers and 3N (N is the problem size) global memory accesses. In this paper we propose StreamScan, a novel approach to implement scan on GPUs with only one computation phase. The main idea is to restrict synchronization to only adjacent workgroups, and thereby eliminating global barrier synchronization completely. The new approach requires only 2N global memory accesses and just one kernel invocation. On top of this we propose two important op-timizations to further boost performance speedups, namely thread grouping to eliminate unnecessary local barriers, and register optimization to expand the on chip problem size. We designed an auto-tuning framework to search the parameter space automatically to generate highly optimized codes for both AMD and Nvidia GPUs. We implemented our technique with OpenCL. Compared with previous fast scan implementations, experimental results not only show promising performance speedups, but also reveal dramatic different optimization tradeoffs between Nvidia and AMD GPU platforms.

References

  1. Blelloch, G. E. Scans as Primitive Parallel Operations. IEEE Trans. Comput, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Blelloch, G.E. Prefix Sums and Their Applications. Synthesis of Parallel Algorithms, 1990.Google ScholarGoogle Scholar
  3. Iverson, K. E. A Programming Language. AIEE-IRE '62, 1962. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Kogge, P. M. and Stone, H. S. A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations. IEEE Trans. Comput, 1973. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Brent, R. P. and Kung, H. T. A Regular Layout for Parallel Adders. IEEE Trans. Comput, 1982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Merrill, and A. Grimshaw Revisiting Sorting for GPGPU Stream Architectures. Tech. Rep. CS2010-03, Department of Computer Science, University of Virginia, 2010.Google ScholarGoogle Scholar
  7. Cederman, D. and Tsigas, P. On Sorting and Load Balancingon GPUs. SIGARCH Comput. Archit. News, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Merrill and A. Grimshaw. High Performance and Scalable Radix Sorting: A case study of implementing dynamic parallelism for GPU computing. Parallel Processing Letters, 2011.Google ScholarGoogle Scholar
  9. Cederman, D., and Tsigas, P. GPU-Quicksort: A practical Quicksort algorithm for graphics processors. Journal of Experimental Algorithmics. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. H. Peters, O. Schulz-Hildebrandt, and N. Luttenberger. Fast in-place sorting with cuda based on bitonic sort. PPAM 09: Proceedings of the International Conference on Parallel Processing and Applied Mathematics, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. N. Satish, M. Harris, and M. Garland. Designing Efficient Sorting Algorithms for Manycore GPUs. In Proc. Int'l Symposium on Parallel and Distributed Processing (IPDPS), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Merrill, M. Garland, and A. Grimshaw. Scalable GPU Graph Traversal. Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming ( PPoPP '12), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Merrill. Allocation-oriented Algorithm Design with Application to GPU Computing. PhD thesis, University of Virginia, 2011.Google ScholarGoogle Scholar
  14. Nan Zhang. A Novel Parallel Scan for Multicore Processors and Its Application in Sparse Matrix-Vector Multiplication. Parallel and Distributed Systems, IEEE Transactions, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Markus Billeter, Ola Olsson and Ulf Assarsson. Efficient stream compaction on wide SIMD many-core architectures, Proceedings of the Conference on High Performance Graphics, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Vibhav Vineet, Pawan Harish, Suryakant Patidar, and P. J. Narayanan. Fast minimum spanning tree for large graphs on the GPU. In Proc. of High Performance Graphics, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Zheng Wei, Joseph JaJa. Optimization of linked list prefix computations on multithreaded GPUs using CUDA. In Parallel & Distributed Processing (IPDPS), 2010.Google ScholarGoogle Scholar
  18. Mark Harris. State of the Art in GPU Data-Parallel Algorithm Primitives. Tech. Rep. GPU Technology Conference, 2010.Google ScholarGoogle Scholar
  19. M. Harris, J. Owens, S. Sengupta, Y. Zhang, and A. Davidson. CUDPP: CUDA Data Parallel Primitives Library. http://gpgpu.org/developer/cudpp.Google ScholarGoogle Scholar
  20. D. Merrill. Parallel Scan for Stream Architectures. Technical Report CS2009-14, Department of Computer Science, University of Virginia, 2009.Google ScholarGoogle Scholar
  21. Y. Dotsenko, N. K. Govindaraju, P.-P. Sloan, C. Boyd, and J. Manferdelli. Fast Scan Algorithms on Graphics Processors. In ICS'08: Proc. 22nd Annual International Conference on Supercomputing, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. NVIDIA Corporation. Nvidia Cuda C Programming Guide. http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf, 2012.Google ScholarGoogle Scholar
  23. S. Sengupta, M. Harris, and M. Garland. Efficient Parallel Scan Algorithms for GPUs. NVIDIA Tech. Rep, 2008.Google ScholarGoogle Scholar
  24. Jens Breitbart. Static GPU threads and an Improved Scan Algorithm. In Euro-Par 2010 Work-shop Proceedings, Lecture Notes in Computer Science, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Sengupta S., Lefohn A. and Owens, J. A work-efficient step-efficient prefix-sum algorithm. Proceedings of the Workshop on Edge Computing Using New Commodity Architectures, 2006.Google ScholarGoogle Scholar
  26. S. Xiao and W. chun Feng. Inter-block GPU Communication via Fast Barrier Synchronization. In Parallel & Distributed Processing (IPDPS), 2010.Google ScholarGoogle Scholar
  27. Horn D. Stream Reduction Operations for GPGPU Applications. In GPU Gems 2, Pharr M., (Ed.).Addison Wesley, ch. 36, pp. 573--589, 2005.Google ScholarGoogle Scholar
  28. Harris M., Sengupta S. and Owens J. D. Parallel Prefix sum (scan) with CUDA. In GPU Gems 3, Nguyen H., (Ed.).Addison Wesley, ch. 39, 2007.Google ScholarGoogle Scholar
  29. Haipeng Jia, Yunquan Zhang, Guoping Long, Jianliang Xu, and Shengen Yan. GPURoofline: A Model for Guiding Performance Optimizations on GPUs. Euro-Par 2012 Parallel Processing, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Khronos OpenCL Working Group, The OpenCL Specification Version: 1.2, 2012.Google ScholarGoogle Scholar

Index Terms

  1. StreamScan: fast scan algorithms for GPUs without global barrier synchronization

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 48, Issue 8
        PPoPP '13
        August 2013
        309 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/2517327
        Issue’s Table of Contents
        • cover image ACM Conferences
          PPoPP '13: Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
          February 2013
          332 pages
          ISBN:9781450319225
          DOI:10.1145/2442516

        Copyright © 2013 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 23 February 2013

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader