ABSTRACT
Prefix sums are key building blocks in the implementation of many concurrent software applications, and recently much work has gone into efficiently implementing prefix sums to run on massively parallel graphics processing units (GPUs). Because they lie at the heart of many GPU-accelerated applications, the correctness of prefix sum implementations is of prime importance.
We introduce a novel abstraction, the interval of summations, that allows scalable reasoning about implementations of prefix sums. We present this abstraction as a monoid, and prove a soundness and completeness result showing that a generic sequential prefix sum implementation is correct for an array of length $n$ if and only if it computes the correct result for a specific test case when instantiated with the interval of summations monoid. This allows correctness to be established by running a single test where the input and result require O(n lg(n)) space. This improves upon an existing result by Sheeran where the input requires O(n lg(n)) space and the result O(n2 \lg(n)) space, and is more feasible for large n than a method by Voigtlaender that uses O(n) space for the input and result but requires running O(n2) tests. We then extend our abstraction and results to the context of data-parallel programs, developing an automated verification method for GPU implementations of prefix sums. Our method uses static verification to prove that a generic prefix sum implementation is data race-free, after which functional correctness of the implementation can be determined by running a single test case under the interval of summations abstraction.
We present an experimental evaluation using four different prefix sum algorithms, showing that our method is highly automatic, scales to large thread counts, and significantly outperforms Voigtlaender's method when applied to large arrays.
Supplemental Material
- A. Betts, N. Chong, A. F. Donaldson, S. Qadeer, and P. Thomson. GPUVerify: a verifier for GPU kernels. In phOOPSLA, pages 113--132, 2012. Google ScholarDigital Library
- M. Billeter, O. Olsson, and U. Assarsson. Efficient stream compaction on wide SIMD many-core architectures. In phHPG, pages 159--166, 2009. Google ScholarDigital Library
- G. E. Blelloch. Scans as primitive parallel operations. phIEEE Trans. Comput., 38 (11): 1526--1538, 1989. Google ScholarDigital Library
- G. E. Blelloch. Prefix sums and their applications. In J. H. Reif, editor, phSynthesis of Parallel Algorithms. Morgan Kaufmann, 1990.Google Scholar
- R. P. Brent and H.-T. Kung. A regular layout for parallel adders. phIEEE Trans. Computers, 31 (3): 260--264, 1982. Google ScholarDigital Library
- C. Cadar, D. Dunbar, and D. R. Engler. KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. In phOSDI, pages 209--224, 2008. Google ScholarDigital Library
- S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In phWorkload Characterization, pages 44--54, 2009. Google ScholarDigital Library
- N. Chong, A. F. Donaldson, P. H. Kelly, J. Ketema, and S. Qadeer. Barrier invariants: A shared state abstraction for the analysis of data-dependent GPU kernels. In phOOPSLA, pages 605--622, 2013. Google ScholarDigital Library
- P. Collingbourne, , C. Cadar, and P. H. J. Kelly. Symbolic testing of OpenCL code. In phHVC'11, pages 203--218, 2012. Google ScholarDigital Library
- P. Collingbourne, A. F. Donaldson, J. Ketema, and S. Qadeer. Interleaving and lock-step semantics for analysis and verification of GPU kernels. In phESOP, pages 270--289, 2013. Google ScholarDigital Library
- P. Cousot and R. Cousot. Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints. In phPOPL, pages 238--252, 1977. Google ScholarDigital Library
- A. Danalis et al. The scalable heterogeneous computing (SHOC) benchmark suite. In phGPGPU 2010, pages 63--74, 2010. Google ScholarDigital Library
- io\uglu et al.(1990)E\ugecio\uglu, Gallopoulos, and Koç}PolynomialsO. E\ugecio\uglu, E. Gallopoulos, and C. Koç. A parallel method for fast and practical high-order Newton interpolation. phBIT Numerical Mathematics, 30: 268--288, 1990. Google ScholarDigital Library
- W. Gropp, E. Lusk, and A. Skjellum. phUsing MPI: Portable Parallel Programming with the Message Passing Interface. MIT Press, 2nd edition, 1999. Google ScholarDigital Library
- A. Habermaier and A. Knapp. On the correctness of the SIMT execution model of GPUs. In phESOP, pages 316--335, 2012. Google ScholarDigital Library
- M. Harris, S. Sengupta, and J. Owens. Parallel prefix sum (scan) with CUDA. In H. Nguyen, editor, phGPU Gems 3. Addison-Wesley, 2007.Google Scholar
- R. Hinze. An algebra of scans. In phMPC, pages 186--210, 2004.Google Scholar
- 'c(2013)}BYTECODEM. Huisman and M. Mihel\vcić. Specification and verification of GPGPU programs using permission-based separation logic. In phBYTECODE, 2013.Google Scholar
- 012)}OpenCLKhronos OpenCL Working Group. The OpenCL specification, version 1.2, 2012.Google Scholar
- D. E. Knuth. phThe Art of Computer Programming, volume 3. Addison-Wesley, 2nd edition, 1998. Google ScholarDigital Library
- P. M. Kogge and H. S. Stone. A parallel algorithm for the efficient solution of a general class of recurrence equations. phIEEE Trans. Computers, C-22 (8): 786--793, 1973. Google ScholarDigital Library
- R. E. Ladner and M. J. Fischer. Parallel prefix computation. phJ. ACM, 27 (4): 831--838, 1980. Google ScholarDigital Library
- A. Leung, M. Gupta, Y. Agarwal, et al. Verifying GPU kernels by test amplification. In phPLDI, pages 383--394, 2012. Google ScholarDigital Library
- G. Li and G. Gopalakrishnan. Scalable SMT-based verification of GPU kernel functions. In phFSE, pages 187--196, 2010. Google ScholarDigital Library
- G. Li, P. Li, G. Sawaya, G. Gopalakrishnan, I. Ghosh, and S. P. Rajan. GKLEE: concolic verification and test generation for GPUs. In phPPoPP, pages 215--224, 2012. Google ScholarDigital Library
- D. Merrill and A. Grimshaw. Parallel scan for stream architectures. Technical Report CX2009--14, Department of Computer Science, University of Virginia, 2009.Google Scholar
- 012)}CUDANVIDIA. CUDA C programming guide, version 5.0, 2012.Google Scholar
- B. C. Pierce. phTypes and Programming Languages. MIT Press, 2002. Google ScholarDigital Library
- (2006)}DBLP:conf/pvm/SandersT06P. Sanders and J. L. Tr\"aff. Parallel prefix (scan) algorithms for MPI. In phPVM/MPI, pages 22--29, 2006.Google Scholar
- N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore GPUs. In phIPDPS, pages 1--10, 2009. Google ScholarDigital Library
- S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for GPU computing. In phGH, pages 97--106, 2007. Google ScholarDigital Library
- I. Sergeev. On the complexity of parallel prefix circuits. Technical Report TR13-041, Electronic Colloquium on Computational Complexity, 2013.Google Scholar
- M. Sheeran. Functional and dynamic programming in the design of parallel prefix networks. phJ. of Funct. Program., 21 (1): 59--114, 2011. Google ScholarDigital Library
- J. Sklansky. Conditional-sum addition logic. phIRE Trans. Electronic Computers, EC-9: 226--231, 1960.Google Scholar
- B. So, A. M. Ghuloum, and Y. Wu. Optimizing data parallel operations on many-core platforms. In phIn First Workshop on Software Tools for Multi-Core Systems (STMCS), 2006.Google Scholar
- H. S. Stone. Parallel processing with the perfect shuffle. phIEEE Trans. Comput., 20 (2): 153--161, 1971. Google ScholarDigital Library
- J. Stratton et al. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Technical Report IMPACT-12-01, University of Illinois at Urbana-Champaign, 2012.Google Scholar
- er(2008)}Voigtlaender08J. Voigtl\"ander. Much ado about two (pearl): a pearl on parallel prefix computation. In phPOPL, pages 29--35, 2008. Google ScholarDigital Library
Index Terms
- A sound and complete abstraction for reasoning about parallel prefix sums
Recommendations
Higher-order and tuple-based massively-parallel prefix sums
PLDI '16Prefix sums are an important parallel primitive, especially in massively-parallel programs. This paper discusses two orthogonal generalizations thereof, which we call higher-order and tuple-based prefix sums. Moreover, it describes and evaluates SAM, a ...
A sound and complete abstraction for reasoning about parallel prefix sums
POPL '14Prefix sums are key building blocks in the implementation of many concurrent software applications, and recently much work has gone into efficiently implementing prefix sums to run on massively parallel graphics processing units (GPUs). Because they lie ...
Abstraction and Refinement: Towards Scalable and Exact Verification of Neural Networks
As a new programming paradigm, deep neural networks (DNNs) have been increasingly deployed in practice, but the lack of robustness hinders their applications in safety-critical domains. While there are techniques for verifying DNNs with formal guarantees, ...
Comments