skip to main content
10.1145/2737924.2737962acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article
Public Access

Verification of producer-consumer synchronization in GPU programs

Published:03 June 2015Publication History

ABSTRACT

Previous efforts to formally verify code written for GPUs have focused solely on kernels written within the traditional data-parallel GPU programming model. No previous work has considered the higher performance, but more complex, warp-specialized kernels based on producer-consumer named barriers available on current hardware. In this work we present the first formal operational semantics for named barriers and define what it means for a warp-specialized kernel to be correct. We give algorithms for verifying the correctness of warp-specialized kernels and prove that they are both sound and complete for the most common class of warp-specialized programs. We also present WEFT, a verification tool for checking warp-specialized code. Using WEFT, we discover several non-trivial bugs in production warp-specialized kernels.

References

  1. Parallel thread execution ISA version 4.1. http://docs.nvidia. com/cuda/parallel-thread-execution/index.html, 2014.Google ScholarGoogle Scholar
  2. A. Aiken and D. Gay. Barrier inference. In POPL, pages 342–354, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Alglave, M. Batty, A. F. Donaldson, G. Gopalakrishnan, J. Ketema, D. Poetzl, T. Sorensen, and J. Wickerson. GPU concurrency: Weak behaviours and programming assumptions. In ASPLOS, pages 577– 591, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. E. Bardsley and A. F. Donaldson. Warps and atomics: Beyond barrier synchronization in the verification of GPU kernels. In NFM, pages 230–245, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. E. Bardsley, A. Betts, N. Chong, P. Collingbourne, P. Deligiannis, A. F. Donaldson, J. Ketema, D. Liew, and S. Qadeer. Engineering a static verification tool for GPU kernels. In CAV, pages 226–242, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Bauer, H. Cook, and B. Khailany. CudaDMA: optimizing GPU memory bandwidth via warp specialization. SC ’11, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Bauer, S. Treichler, and A. Aiken. Singe: Leveraging warp specialization for high performance on GPUs. PPoPP ’14, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Betts, N. Chong, A. F. Donaldson, S. Qadeer, and P. Thomson. GPUVerify: a verifier for GPU kernels. In OOPSLA, pages 113–132, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. H. Chen, A. Choudhary, B. de Supinski, M. DeVries, E. R. Hawkes, S. Klasky, W. K. Liao, K. L. Ma, J. Mellor-Crummey, N. Podhorszki, R. Sankaran, S. Shende, and C. S. Yoo. Terascale direct numerical simulations of turbulent combustion using S3D. Computational Science and Discovery, page 015001, 2009.Google ScholarGoogle Scholar
  10. W. Chiang, G. Gopalakrishnan, G. Li, and Z. Rakamaric. Formal analysis of GPU programs with atomics via conflict-directed delaybounding. In NFM, pages 213–228, 2013.Google ScholarGoogle Scholar
  11. N. Chong, A. F. Donaldson, P. H. J. Kelly, J. Ketema, and S. Qadeer. Barrier invariants: a shared state abstraction for the analysis of datadependent GPU kernels. In OOPSLA, pages 605–622, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. N. Chong, A. F. Donaldson, and J. Ketema. A sound and complete abstraction for reasoning about parallel prefix sums. In POPL, pages 397–410, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. Collingbourne, C. Cadar, and P. H. J. Kelly. Symbolic crosschecking of floating-point and SIMD code. In EuroSys, pages 315–328, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco. Gpus and the future of parallel computing. IEEE Micro, 31(5):7–17, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Khronos. The OpenCL Specification, Version 1.0. The Khronos OpenCL Working Group, December 2008.Google ScholarGoogle Scholar
  16. L. Lamport. Time, clocks, and the ordering of events in a distributed system. Commun. ACM, 21(7):558–565, 1978. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D.-K. Le, W.-N. Chin, and Y. M. Teo. Verification of static and dynamic barrier synchronization using bounded permissions. In ICFEM, pages 231–248, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  18. A. Leung, M. Gupta, Y. Agarwal, R. Gupta, R. Jhala, and S. Lerner. Verifying GPU kernels by test amplification. In PLDI, pages 383–394, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. G. Li and G. Gopalakrishnan. Scalable smt-based verification of GPU kernel functions. In FSE, pages 187–196, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. G. Li, P. Li, G. Sawaya, G. Gopalakrishnan, I. Ghosh, and S. P. Rajan. GKLEE: concolic verification and test generation for GPUs. In PPOPP, pages 215–224, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Li, G. Li, and G. Gopalakrishnan. Parametric flows: automated behavior equivalencing for symbolic analysis of races in CUDA programs. In SC, page 29, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. NVIDIA. CUDA programming guide 6.0. http://developer. download.nvidia.com/compute/DevZone/docs/html/C/doc/ CUDA_C_Programming_Guide.pdf, August 2014.Google ScholarGoogle Scholar
  23. Oracle. Class phaser. https://docs.oracle.com/javase/7/ docs/api/java/util/concurrent/Phaser.html, 2014.Google ScholarGoogle Scholar
  24. S. G. Parker, J. Bigler, A. Dietrich, H. Friedrich, J. Hoberock, D. Luebke, D. McAllister, M. McGuire, K. Morley, A. Robison, and M. Stich. OptiX: A general purpose ray tracing engine. ACM Transactions on Graphics, August 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. G. Ramalingam. Context-sensitive synchronization-sensitive analysis is undecidable. ACM Trans. Program. Lang. Syst., 22(2):416–430, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Top500. Top 500 supercomputers. http://www.top500.org, 2014.Google ScholarGoogle Scholar

Index Terms

  1. Verification of producer-consumer synchronization in GPU programs

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            PLDI '15: Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation
            June 2015
            630 pages
            ISBN:9781450334686
            DOI:10.1145/2737924
            • cover image ACM SIGPLAN Notices
              ACM SIGPLAN Notices  Volume 50, Issue 6
              PLDI '15
              June 2015
              630 pages
              ISSN:0362-1340
              EISSN:1558-1160
              DOI:10.1145/2813885
              • Editor:
              • Andy Gill
              Issue’s Table of Contents

            Copyright © 2015 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 3 June 2015

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate406of2,067submissions,20%

            Upcoming Conference

            PLDI '24

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader