Abstract
Scan (also known as prefix sum) is a very useful primitive for various important parallel algorithms, such as sort, BFS, SpMV, compaction and so on. Current state of the art of GPU based scan implementation consists of three consecutive Reduce-Scan-Scan phases. This approach requires at least two global barriers and 3N (N is the problem size) global memory accesses. In this paper we propose StreamScan, a novel approach to implement scan on GPUs with only one computation phase. The main idea is to restrict synchronization to only adjacent workgroups, and thereby eliminating global barrier synchronization completely. The new approach requires only 2N global memory accesses and just one kernel invocation. On top of this we propose two important op-timizations to further boost performance speedups, namely thread grouping to eliminate unnecessary local barriers, and register optimization to expand the on chip problem size. We designed an auto-tuning framework to search the parameter space automatically to generate highly optimized codes for both AMD and Nvidia GPUs. We implemented our technique with OpenCL. Compared with previous fast scan implementations, experimental results not only show promising performance speedups, but also reveal dramatic different optimization tradeoffs between Nvidia and AMD GPU platforms.
- Blelloch, G. E. Scans as Primitive Parallel Operations. IEEE Trans. Comput, 1989. Google ScholarDigital Library
- Blelloch, G.E. Prefix Sums and Their Applications. Synthesis of Parallel Algorithms, 1990.Google Scholar
- Iverson, K. E. A Programming Language. AIEE-IRE '62, 1962. Google ScholarDigital Library
- Kogge, P. M. and Stone, H. S. A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations. IEEE Trans. Comput, 1973. Google ScholarDigital Library
- Brent, R. P. and Kung, H. T. A Regular Layout for Parallel Adders. IEEE Trans. Comput, 1982. Google ScholarDigital Library
- D. Merrill, and A. Grimshaw Revisiting Sorting for GPGPU Stream Architectures. Tech. Rep. CS2010-03, Department of Computer Science, University of Virginia, 2010.Google Scholar
- Cederman, D. and Tsigas, P. On Sorting and Load Balancingon GPUs. SIGARCH Comput. Archit. News, 2008. Google ScholarDigital Library
- D. Merrill and A. Grimshaw. High Performance and Scalable Radix Sorting: A case study of implementing dynamic parallelism for GPU computing. Parallel Processing Letters, 2011.Google Scholar
- Cederman, D., and Tsigas, P. GPU-Quicksort: A practical Quicksort algorithm for graphics processors. Journal of Experimental Algorithmics. 2009. Google ScholarDigital Library
- H. Peters, O. Schulz-Hildebrandt, and N. Luttenberger. Fast in-place sorting with cuda based on bitonic sort. PPAM 09: Proceedings of the International Conference on Parallel Processing and Applied Mathematics, 2009. Google ScholarDigital Library
- N. Satish, M. Harris, and M. Garland. Designing Efficient Sorting Algorithms for Manycore GPUs. In Proc. Int'l Symposium on Parallel and Distributed Processing (IPDPS), 2009. Google ScholarDigital Library
- D. Merrill, M. Garland, and A. Grimshaw. Scalable GPU Graph Traversal. Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming ( PPoPP '12), 2012. Google ScholarDigital Library
- D. Merrill. Allocation-oriented Algorithm Design with Application to GPU Computing. PhD thesis, University of Virginia, 2011.Google Scholar
- Nan Zhang. A Novel Parallel Scan for Multicore Processors and Its Application in Sparse Matrix-Vector Multiplication. Parallel and Distributed Systems, IEEE Transactions, 2012. Google ScholarDigital Library
- Markus Billeter, Ola Olsson and Ulf Assarsson. Efficient stream compaction on wide SIMD many-core architectures, Proceedings of the Conference on High Performance Graphics, 2009. Google ScholarDigital Library
- Vibhav Vineet, Pawan Harish, Suryakant Patidar, and P. J. Narayanan. Fast minimum spanning tree for large graphs on the GPU. In Proc. of High Performance Graphics, 2009. Google ScholarDigital Library
- Zheng Wei, Joseph JaJa. Optimization of linked list prefix computations on multithreaded GPUs using CUDA. In Parallel & Distributed Processing (IPDPS), 2010.Google Scholar
- Mark Harris. State of the Art in GPU Data-Parallel Algorithm Primitives. Tech. Rep. GPU Technology Conference, 2010.Google Scholar
- M. Harris, J. Owens, S. Sengupta, Y. Zhang, and A. Davidson. CUDPP: CUDA Data Parallel Primitives Library. http://gpgpu.org/developer/cudpp.Google Scholar
- D. Merrill. Parallel Scan for Stream Architectures. Technical Report CS2009-14, Department of Computer Science, University of Virginia, 2009.Google Scholar
- Y. Dotsenko, N. K. Govindaraju, P.-P. Sloan, C. Boyd, and J. Manferdelli. Fast Scan Algorithms on Graphics Processors. In ICS'08: Proc. 22nd Annual International Conference on Supercomputing, 2008. Google ScholarDigital Library
- NVIDIA Corporation. Nvidia Cuda C Programming Guide. http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf, 2012.Google Scholar
- S. Sengupta, M. Harris, and M. Garland. Efficient Parallel Scan Algorithms for GPUs. NVIDIA Tech. Rep, 2008.Google Scholar
- Jens Breitbart. Static GPU threads and an Improved Scan Algorithm. In Euro-Par 2010 Work-shop Proceedings, Lecture Notes in Computer Science, 2010. Google ScholarDigital Library
- Sengupta S., Lefohn A. and Owens, J. A work-efficient step-efficient prefix-sum algorithm. Proceedings of the Workshop on Edge Computing Using New Commodity Architectures, 2006.Google Scholar
- S. Xiao and W. chun Feng. Inter-block GPU Communication via Fast Barrier Synchronization. In Parallel & Distributed Processing (IPDPS), 2010.Google Scholar
- Horn D. Stream Reduction Operations for GPGPU Applications. In GPU Gems 2, Pharr M., (Ed.).Addison Wesley, ch. 36, pp. 573--589, 2005.Google Scholar
- Harris M., Sengupta S. and Owens J. D. Parallel Prefix sum (scan) with CUDA. In GPU Gems 3, Nguyen H., (Ed.).Addison Wesley, ch. 39, 2007.Google Scholar
- Haipeng Jia, Yunquan Zhang, Guoping Long, Jianliang Xu, and Shengen Yan. GPURoofline: A Model for Guiding Performance Optimizations on GPUs. Euro-Par 2012 Parallel Processing, 2012. Google ScholarDigital Library
- Khronos OpenCL Working Group, The OpenCL Specification Version: 1.2, 2012.Google Scholar
Index Terms
- StreamScan: fast scan algorithms for GPUs without global barrier synchronization
Recommendations
StreamScan: fast scan algorithms for GPUs without global barrier synchronization
PPoPP '13: Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programmingScan (also known as prefix sum) is a very useful primitive for various important parallel algorithms, such as sort, BFS, SpMV, compaction and so on. Current state of the art of GPU based scan implementation consists of three consecutive Reduce-Scan-Scan ...
A Cross-Platform SpMV Framework on Many-Core Architectures
Sparse Matrix-Vector multiplication (SpMV) is a key operation in engineering and scientific computing. Although the previous work has shown impressive progress in optimizing SpMV on many-core architectures, load imbalance and high memory bandwidth ...
yaSpMV: yet another SpMV framework on GPUs
PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programmingSpMV is a key linear algebra algorithm and has been widely used in many important application domains. As a result, numerous attempts have been made to optimize SpMV on GPUs to leverage their massive computational throughput. Although the previous work ...
Comments