skip to main content
10.1145/2535838.2535882acmconferencesArticle/Chapter ViewAbstractPublication PagespoplConference Proceedingsconference-collections
research-article

A sound and complete abstraction for reasoning about parallel prefix sums

Published:08 January 2014Publication History

ABSTRACT

Prefix sums are key building blocks in the implementation of many concurrent software applications, and recently much work has gone into efficiently implementing prefix sums to run on massively parallel graphics processing units (GPUs). Because they lie at the heart of many GPU-accelerated applications, the correctness of prefix sum implementations is of prime importance.

We introduce a novel abstraction, the interval of summations, that allows scalable reasoning about implementations of prefix sums. We present this abstraction as a monoid, and prove a soundness and completeness result showing that a generic sequential prefix sum implementation is correct for an array of length $n$ if and only if it computes the correct result for a specific test case when instantiated with the interval of summations monoid. This allows correctness to be established by running a single test where the input and result require O(n lg(n)) space. This improves upon an existing result by Sheeran where the input requires O(n lg(n)) space and the result O(n2 \lg(n)) space, and is more feasible for large n than a method by Voigtlaender that uses O(n) space for the input and result but requires running O(n2) tests. We then extend our abstraction and results to the context of data-parallel programs, developing an automated verification method for GPU implementations of prefix sums. Our method uses static verification to prove that a generic prefix sum implementation is data race-free, after which functional correctness of the implementation can be determined by running a single test case under the interval of summations abstraction.

We present an experimental evaluation using four different prefix sum algorithms, showing that our method is highly automatic, scales to large thread counts, and significantly outperforms Voigtlaender's method when applied to large arrays.

Skip Supplemental Material Section

Supplemental Material

d2_left_t9.mp4

mp4

252.7 MB

References

  1. A. Betts, N. Chong, A. F. Donaldson, S. Qadeer, and P. Thomson. GPUVerify: a verifier for GPU kernels. In phOOPSLA, pages 113--132, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Billeter, O. Olsson, and U. Assarsson. Efficient stream compaction on wide SIMD many-core architectures. In phHPG, pages 159--166, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. G. E. Blelloch. Scans as primitive parallel operations. phIEEE Trans. Comput., 38 (11): 1526--1538, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. G. E. Blelloch. Prefix sums and their applications. In J. H. Reif, editor, phSynthesis of Parallel Algorithms. Morgan Kaufmann, 1990.Google ScholarGoogle Scholar
  5. R. P. Brent and H.-T. Kung. A regular layout for parallel adders. phIEEE Trans. Computers, 31 (3): 260--264, 1982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Cadar, D. Dunbar, and D. R. Engler. KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. In phOSDI, pages 209--224, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In phWorkload Characterization, pages 44--54, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. N. Chong, A. F. Donaldson, P. H. Kelly, J. Ketema, and S. Qadeer. Barrier invariants: A shared state abstraction for the analysis of data-dependent GPU kernels. In phOOPSLA, pages 605--622, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. Collingbourne, , C. Cadar, and P. H. J. Kelly. Symbolic testing of OpenCL code. In phHVC'11, pages 203--218, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Collingbourne, A. F. Donaldson, J. Ketema, and S. Qadeer. Interleaving and lock-step semantics for analysis and verification of GPU kernels. In phESOP, pages 270--289, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. Cousot and R. Cousot. Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints. In phPOPL, pages 238--252, 1977. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Danalis et al. The scalable heterogeneous computing (SHOC) benchmark suite. In phGPGPU 2010, pages 63--74, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. io\uglu et al.(1990)E\ugecio\uglu, Gallopoulos, and Koç}PolynomialsO. E\ugecio\uglu, E. Gallopoulos, and C. Koç. A parallel method for fast and practical high-order Newton interpolation. phBIT Numerical Mathematics, 30: 268--288, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. W. Gropp, E. Lusk, and A. Skjellum. phUsing MPI: Portable Parallel Programming with the Message Passing Interface. MIT Press, 2nd edition, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Habermaier and A. Knapp. On the correctness of the SIMT execution model of GPUs. In phESOP, pages 316--335, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Harris, S. Sengupta, and J. Owens. Parallel prefix sum (scan) with CUDA. In H. Nguyen, editor, phGPU Gems 3. Addison-Wesley, 2007.Google ScholarGoogle Scholar
  17. R. Hinze. An algebra of scans. In phMPC, pages 186--210, 2004.Google ScholarGoogle Scholar
  18. 'c(2013)}BYTECODEM. Huisman and M. Mihel\vcić. Specification and verification of GPGPU programs using permission-based separation logic. In phBYTECODE, 2013.Google ScholarGoogle Scholar
  19. 012)}OpenCLKhronos OpenCL Working Group. The OpenCL specification, version 1.2, 2012.Google ScholarGoogle Scholar
  20. D. E. Knuth. phThe Art of Computer Programming, volume 3. Addison-Wesley, 2nd edition, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. M. Kogge and H. S. Stone. A parallel algorithm for the efficient solution of a general class of recurrence equations. phIEEE Trans. Computers, C-22 (8): 786--793, 1973. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. E. Ladner and M. J. Fischer. Parallel prefix computation. phJ. ACM, 27 (4): 831--838, 1980. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Leung, M. Gupta, Y. Agarwal, et al. Verifying GPU kernels by test amplification. In phPLDI, pages 383--394, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. G. Li and G. Gopalakrishnan. Scalable SMT-based verification of GPU kernel functions. In phFSE, pages 187--196, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. G. Li, P. Li, G. Sawaya, G. Gopalakrishnan, I. Ghosh, and S. P. Rajan. GKLEE: concolic verification and test generation for GPUs. In phPPoPP, pages 215--224, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. Merrill and A. Grimshaw. Parallel scan for stream architectures. Technical Report CX2009--14, Department of Computer Science, University of Virginia, 2009.Google ScholarGoogle Scholar
  27. 012)}CUDANVIDIA. CUDA C programming guide, version 5.0, 2012.Google ScholarGoogle Scholar
  28. B. C. Pierce. phTypes and Programming Languages. MIT Press, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. (2006)}DBLP:conf/pvm/SandersT06P. Sanders and J. L. Tr\"aff. Parallel prefix (scan) algorithms for MPI. In phPVM/MPI, pages 22--29, 2006.Google ScholarGoogle Scholar
  30. N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore GPUs. In phIPDPS, pages 1--10, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for GPU computing. In phGH, pages 97--106, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. I. Sergeev. On the complexity of parallel prefix circuits. Technical Report TR13-041, Electronic Colloquium on Computational Complexity, 2013.Google ScholarGoogle Scholar
  33. M. Sheeran. Functional and dynamic programming in the design of parallel prefix networks. phJ. of Funct. Program., 21 (1): 59--114, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. Sklansky. Conditional-sum addition logic. phIRE Trans. Electronic Computers, EC-9: 226--231, 1960.Google ScholarGoogle Scholar
  35. B. So, A. M. Ghuloum, and Y. Wu. Optimizing data parallel operations on many-core platforms. In phIn First Workshop on Software Tools for Multi-Core Systems (STMCS), 2006.Google ScholarGoogle Scholar
  36. H. S. Stone. Parallel processing with the perfect shuffle. phIEEE Trans. Comput., 20 (2): 153--161, 1971. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. J. Stratton et al. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Technical Report IMPACT-12-01, University of Illinois at Urbana-Champaign, 2012.Google ScholarGoogle Scholar
  38. er(2008)}Voigtlaender08J. Voigtl\"ander. Much ado about two (pearl): a pearl on parallel prefix computation. In phPOPL, pages 29--35, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A sound and complete abstraction for reasoning about parallel prefix sums

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        POPL '14: Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages
        January 2014
        702 pages
        ISBN:9781450325448
        DOI:10.1145/2535838

        Copyright © 2014 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 8 January 2014

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        POPL '14 Paper Acceptance Rate51of220submissions,23%Overall Acceptance Rate824of4,130submissions,20%

        Upcoming Conference

        POPL '25

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader