research-article

StreamScan: fast scan algorithms for GPUs without global barrier synchronization

Authors:
Shengen Yan

Institute of Software, Chinese Academy of Sciences, Beijing, China

Institute of Software, Chinese Academy of Sciences, Beijing, China
View Profile

,
Guoping Long

Institute of Software, Chinese Academy of Sciences, Beijing, China

Institute of Software, Chinese Academy of Sciences, Beijing, China
View Profile

,
Yunquan Zhang

Institute of Software, Chinese Academy of Sciences, Beijing, China

Institute of Software, Chinese Academy of Sciences, Beijing, China
View Profile

Authors Info & Claims

ACM SIGPLAN Notices Volume 48 Issue 8August 2013pp 229–238https://doi.org/10.1145/2517327.2442539

Published:23 February 2013Publication History

ACM SIGPLAN Notices

Abstract

Scan (also known as prefix sum) is a very useful primitive for various important parallel algorithms, such as sort, BFS, SpMV, compaction and so on. Current state of the art of GPU based scan implementation consists of three consecutive Reduce-Scan-Scan phases. This approach requires at least two global barriers and 3N (N is the problem size) global memory accesses. In this paper we propose StreamScan, a novel approach to implement scan on GPUs with only one computation phase. The main idea is to restrict synchronization to only adjacent workgroups, and thereby eliminating global barrier synchronization completely. The new approach requires only 2N global memory accesses and just one kernel invocation. On top of this we propose two important op-timizations to further boost performance speedups, namely thread grouping to eliminate unnecessary local barriers, and register optimization to expand the on chip problem size. We designed an auto-tuning framework to search the parameter space automatically to generate highly optimized codes for both AMD and Nvidia GPUs. We implemented our technique with OpenCL. Compared with previous fast scan implementations, experimental results not only show promising performance speedups, but also reveal dramatic different optimization tradeoffs between Nvidia and AMD GPU platforms.

References

Blelloch, G. E. Scans as Primitive Parallel Operations. IEEE Trans. Comput, 1989. Google ScholarDigital Library
Blelloch, G.E. Prefix Sums and Their Applications. Synthesis of Parallel Algorithms, 1990.Google Scholar
Iverson, K. E. A Programming Language. AIEE-IRE '62, 1962. Google ScholarDigital Library
Kogge, P. M. and Stone, H. S. A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations. IEEE Trans. Comput, 1973. Google ScholarDigital Library
Brent, R. P. and Kung, H. T. A Regular Layout for Parallel Adders. IEEE Trans. Comput, 1982. Google ScholarDigital Library
D. Merrill, and A. Grimshaw Revisiting Sorting for GPGPU Stream Architectures. Tech. Rep. CS2010-03, Department of Computer Science, University of Virginia, 2010.Google Scholar
Cederman, D. and Tsigas, P. On Sorting and Load Balancingon GPUs. SIGARCH Comput. Archit. News, 2008. Google ScholarDigital Library
D. Merrill and A. Grimshaw. High Performance and Scalable Radix Sorting: A case study of implementing dynamic parallelism for GPU computing. Parallel Processing Letters, 2011.Google Scholar
Cederman, D., and Tsigas, P. GPU-Quicksort: A practical Quicksort algorithm for graphics processors. Journal of Experimental Algorithmics. 2009. Google ScholarDigital Library
H. Peters, O. Schulz-Hildebrandt, and N. Luttenberger. Fast in-place sorting with cuda based on bitonic sort. PPAM 09: Proceedings of the International Conference on Parallel Processing and Applied Mathematics, 2009. Google ScholarDigital Library
N. Satish, M. Harris, and M. Garland. Designing Efficient Sorting Algorithms for Manycore GPUs. In Proc. Int'l Symposium on Parallel and Distributed Processing (IPDPS), 2009. Google ScholarDigital Library
D. Merrill, M. Garland, and A. Grimshaw. Scalable GPU Graph Traversal. Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming ( PPoPP '12), 2012. Google ScholarDigital Library
D. Merrill. Allocation-oriented Algorithm Design with Application to GPU Computing. PhD thesis, University of Virginia, 2011.Google Scholar
Nan Zhang. A Novel Parallel Scan for Multicore Processors and Its Application in Sparse Matrix-Vector Multiplication. Parallel and Distributed Systems, IEEE Transactions, 2012. Google ScholarDigital Library
Markus Billeter, Ola Olsson and Ulf Assarsson. Efficient stream compaction on wide SIMD many-core architectures, Proceedings of the Conference on High Performance Graphics, 2009. Google ScholarDigital Library
Vibhav Vineet, Pawan Harish, Suryakant Patidar, and P. J. Narayanan. Fast minimum spanning tree for large graphs on the GPU. In Proc. of High Performance Graphics, 2009. Google ScholarDigital Library
Zheng Wei, Joseph JaJa. Optimization of linked list prefix computations on multithreaded GPUs using CUDA. In Parallel & Distributed Processing (IPDPS), 2010.Google Scholar
Mark Harris. State of the Art in GPU Data-Parallel Algorithm Primitives. Tech. Rep. GPU Technology Conference, 2010.Google Scholar
M. Harris, J. Owens, S. Sengupta, Y. Zhang, and A. Davidson. CUDPP: CUDA Data Parallel Primitives Library. http://gpgpu.org/developer/cudpp.Google Scholar
D. Merrill. Parallel Scan for Stream Architectures. Technical Report CS2009-14, Department of Computer Science, University of Virginia, 2009.Google Scholar
Y. Dotsenko, N. K. Govindaraju, P.-P. Sloan, C. Boyd, and J. Manferdelli. Fast Scan Algorithms on Graphics Processors. In ICS'08: Proc. 22nd Annual International Conference on Supercomputing, 2008. Google ScholarDigital Library
NVIDIA Corporation. Nvidia Cuda C Programming Guide. http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf, 2012.Google Scholar
S. Sengupta, M. Harris, and M. Garland. Efficient Parallel Scan Algorithms for GPUs. NVIDIA Tech. Rep, 2008.Google Scholar
Jens Breitbart. Static GPU threads and an Improved Scan Algorithm. In Euro-Par 2010 Work-shop Proceedings, Lecture Notes in Computer Science, 2010. Google ScholarDigital Library
Sengupta S., Lefohn A. and Owens, J. A work-efficient step-efficient prefix-sum algorithm. Proceedings of the Workshop on Edge Computing Using New Commodity Architectures, 2006.Google Scholar
S. Xiao and W. chun Feng. Inter-block GPU Communication via Fast Barrier Synchronization. In Parallel & Distributed Processing (IPDPS), 2010.Google Scholar
Horn D. Stream Reduction Operations for GPGPU Applications. In GPU Gems 2, Pharr M., (Ed.).Addison Wesley, ch. 36, pp. 573--589, 2005.Google Scholar
Harris M., Sengupta S. and Owens J. D. Parallel Prefix sum (scan) with CUDA. In GPU Gems 3, Nguyen H., (Ed.).Addison Wesley, ch. 39, 2007.Google Scholar
Haipeng Jia, Yunquan Zhang, Guoping Long, Jianliang Xu, and Shengen Yan. GPURoofline: A Model for Guiding Performance Optimizations on GPUs. Euro-Par 2012 Parallel Processing, 2012. Google ScholarDigital Library
Khronos OpenCL Working Group, The OpenCL Specification Version: 1.2, 2012.Google Scholar

Index Terms

StreamScan: fast scan algorithms for GPUs without global barrier synchronization
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

StreamScan: fast scan algorithms for GPUs without global barrier synchronization
PPoPP '13: Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming

Scan (also known as prefix sum) is a very useful primitive for various important parallel algorithms, such as sort, BFS, SpMV, compaction and so on. Current state of the art of GPU based scan implementation consists of three consecutive Reduce-Scan-Scan ...
Read More
A Cross-Platform SpMV Framework on Many-Core Architectures

Sparse Matrix-Vector multiplication (SpMV) is a key operation in engineering and scientific computing. Although the previous work has shown impressive progress in optimizing SpMV on many-core architectures, load imbalance and high memory bandwidth ...
Read More
yaSpMV: yet another SpMV framework on GPUs
PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

SpMV is a key linear algebra algorithm and has been widely used in many important application domains. As a result, numerous attempts have been made to optimize SpMV on GPUs to leverage their massive computational throughput. Although the previous work ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGPLAN Notices Volume 48, Issue 8
PPoPP '13
August 2013
309 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2517327
Issue’s Table of Contents
PPoPP '13: Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
February 2013
332 pages
ISBN:9781450319225
DOI:10.1145/2442516
General Chairs:
Alex Nicolau
University of California, Irvine, USA
,
Xiaowei Shen
IBM Research, China
,
Program Chairs:
Saman Amarasinghe
Massachusetts Institute of Technology, USA
,
Richard Vuduc
Georgia Institute of Technology, USA
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 February 2013
Check for updates
Author Tags
cuda
gpu
opencl
parallel algorithms
prefix-sum
scan
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 75
  Total Citations
  View Citations
- 1,224
  Total Downloads
- Downloads (Last 12 months)67
- Downloads (Last 6 weeks)16
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

StreamScan: fast scan algorithms for GPUs without global barrier synchronization

ACM SIGPLAN Notices

Abstract

References

Cited By

Index Terms

Recommendations

StreamScan: fast scan algorithms for GPUs without global barrier synchronization

A Cross-Platform SpMV Framework on Many-Core Architectures

yaSpMV: yet another SpMV framework on GPUs