ABSTRACT
Concurrency is pervasive and perplexing, particularly on graphics processing units (GPUs). Current specifications of languages and hardware are inconclusive; thus programmers often rely on folklore assumptions when writing software.
To remedy this state of affairs, we conducted a large empirical study of the concurrent behaviour of deployed GPUs. Armed with litmus tests (i.e. short concurrent programs), we questioned the assumptions in programming guides and vendor documentation about the guarantees provided by hardware. We developed a tool to generate thousands of litmus tests and run them under stressful workloads. We observed a litany of previously elusive weak behaviours, and exposed folklore beliefs about GPU programming---often supported by official tutorials---as false.
As a way forward, we propose a model of Nvidia GPU hardware, which correctly models every behaviour witnessed in our experiments. The model is a variant of SPARC Relaxed Memory Order (RMO), structured following the GPU concurrency hierarchy.
- Online companion material. http://virginia.cs.ucl.ac.uk/sunflowers/asplos15/.Google Scholar
- GPUBench, June 2014. http://graphics.stanford.edu/projects/gpubench.Google Scholar
- J. Alglave. A Shared Memory Poetics. PhD thesis, Université Paris 7, 2010.Google Scholar
- J. Alglave. A formal hierarchy of weak memory models. Formal Methods in System Design (FMSD), 41(2):178--210, 2012. Google ScholarDigital Library
- J. Alglave, L. Maranget, S. Sarkar, and P. Sewell. Litmus: Running tests against hardware. In Tools and Algorithms for the Construction and Analysis of Systems (TACAS), pages 41--44, 2011. Google ScholarDigital Library
- J. Alglave, L. Maranget, S. Sarkar, and P. Sewell. Fences in weak memory models (extended version). Formal Methods in System Design (FMSD), 40(2):170--205, 2012. Google ScholarDigital Library
- J. Alglave, L. Maranget, and M. Tautschnig. Herding cats: Modelling, simulation, testing, and data mining for weak memory. ACM Transactions on Programming Languages and Systems (TOPLAS), 36(2):7, 2014. Google ScholarDigital Library
- AMD. AMD intermediate language (IL), version 2.4, Oct. 2011.Google Scholar
- AMD. Evergreen family instruction set architecture: Instructions and microcode, revision 1.1a, Nov. 2011.Google Scholar
- AMD. Southern Islands series instruction set architecture, revision 1.1, Dec. 2012.Google Scholar
- AMD. AMD accelerated parallel processing OpenCL programming guide, Nov. 2013.Google Scholar
- ARM. Cortex-A9 MPCore, programmer advice notice, read-after-read hazards ARM reference 761319, Sept. 2011.Google Scholar
- N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 119--129, 1998. Google ScholarDigital Library
- H. Boehm and S. V. Adve. Foundations of the C++ concurrency memory model. In Programming Language Design and Implementation (PLDI), pages 68--78, 2008. Google ScholarDigital Library
- D. Cederman and P. Tsigas. On dynamic load balancing on graphics processors. In SIGGRAPH/Eurographics, pages 57--64, 2008. Google ScholarDigital Library
- D. Cederman and P. Tsigas. Dynamic load balancing on graphics processors, Feb. 2014. http://www.cse.chalmers.se/research/group/dcs/gpuloadbal.html. Google ScholarDigital Library
- W. Collier. Reasoning About Parallel Architectures. Prentice-Hall, 1992. Google ScholarDigital Library
- E. Eide and J. Regehr. Volatiles are miscompiled, and what to do about it. In Embedded software (EMSOFT), pages 255--264, 2008. Google ScholarDigital Library
- W. Feng and S. Xiao. To GPU synchronize or not GPU synchronize? In International Symposium on Circuits and Systems (ISCAS), pages 3801--3804, 2010.Google Scholar
- A. Habermaier and A. Knapp. On the correctness of the SIMT execution model of GPUs. In European Symposium on Programming (ESOP), pages 316--335, 2012. Google ScholarDigital Library
- T. Härder and A. Reuter. Principles of transaction-oriented database recovery. Computing Survey (CSUR), 15(4):287--317, 1983. Google ScholarDigital Library
- B. He and J. X. Yu. High-throughput transaction executions on graphics processors. The Proceedings of the VLDB Endowment (PVLDB), 4(5):314--325, 2011. Google ScholarDigital Library
- B. A. Hechtman and D. J. Sorin. Exploring memory consistency for massively-threaded throughput-oriented processors. In International Symposium on Circuits and Systems (ISCA), pages 201--212, 2013. Google ScholarDigital Library
- D. R. Hower, B. M. Beckmann, B. R. Gaster, B. A. Hechtman, M. D. Hill, S. K. Reinhardt, and D. A. Wood. Sequential consistency for heterogeneous-race-free. In Memory Systems Performance and Correctness (MSPC), 2013.Google Scholar
- D. R. Hower, B. A. Hechtman, B. M. Beckmann, B. R. Gaster, M. D. Hill, S. K. Reinhardt, and D. A. Wood. Heterogeneous- race-free memory models. In Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 427--440, 2014. Google ScholarDigital Library
- W.-m. W. Hwu. GPU Computing Gems Jade Edition. Morgan Kaufmann Publishers Inc., 2011. Google ScholarDigital Library
- Khronos OpenCL Working Group. The OpenCL specification (version 1.2, revision 19), Nov. 2012.Google Scholar
- L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, 28(9):690--691, 1979. Google ScholarDigital Library
- S. Lee, Y. Kim, J. Kim, and J. Kim. Stealing webpages rendered on your browser by exploiting GPU vulnerabilities. In Symposium on Security and Privacy (SP), pages 19--33, 2014. Google ScholarDigital Library
- P. Misra and M. Chaudhuri. Performance evaluation of concurrent lock-free data structures on GPUs. In International Conference on Parallel and Distributed Systems, (ICPADS), pages 53--60, 2012. Google ScholarDigital Library
- R. Morisset, P. Pawan, and F. Z. Nardelli. Compiler testing via a theory of sound optimisations in the C11/C++11 memory model. In Programming Language Design and Implementation (PLDI), pages 187--196, 2013. Google ScholarDigital Library
- Nvidia. CUDA C programming guide, version 5.5, July 2013.Google Scholar
- Nvidia. CUDA by example -- errata, June 2014. http://developer.nvidia.com/cuda-example-errata-page.Google Scholar
- Nvidia. CUDA C programming guide, version 6, July 2014.Google Scholar
- Nvidia. CUDA binary utilities, Aug. 2014. http://docs.nvidia.com/cuda/pdf/CUDA_Binary_Utilities.pdf.Google Scholar
- Nvidia. Parallel thread execution ISA: Version 4.0, Feb. 2014.Google Scholar
- S. Owens, S. Sarkar, and P. Sewell. A better x86 memory model: x86-tso. In Theorem Proving in Higher Order Logics (TPHOLs), pages 391--407, 2009. Google ScholarDigital Library
- J. Sanders and E. Kandrot. CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley Professional, 2010. Google ScholarDigital Library
- S. Sarkar, P. Sewell, J. Alglave, L. Maranget, and D. Williams. Understanding POWER multiprocessors. In Programming Language Design and Implementation (PLDI), pages 175--186, 2011. Google ScholarDigital Library
- T. Sorensen. Towards shared memory consistency models for GPUs. Bachelor's thesis, University of Utah, 2013.Google Scholar
- T. Sorensen, G. Gopalakrishnan, and V. Grover. Towards shared memory consistency models for GPUs. In International Conference on Supercomputing (ICS), pages 489--490, 2013. Google ScholarDigital Library
- J. A. Stuart and J. D. Owens. Efficient synchronization primitives for GPUs. Computing Research Repository (CoRR), abs/1110.4623, 2011. http://arxiv.org/pdf/1110.4623.pdf.Google Scholar
- D. L. Weaver and T. Germond. The SPARC Architecture Manual Version 9. SPARC International Inc., 1994.Google Scholar
- H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. Demystifying GPU microarchitecture through microbenchmarking. In Performance Analysis of Systems Software (ISPASS), pages 235--246, 2010.Google ScholarCross Ref
- S. Xiao and W. Feng. Inter-block GPU communication via fast barrier synchronization. In International Symposium on Parallel and Distributed Processing (IPDPS), pages 1--12, 2010.Google Scholar
Index Terms
- GPU Concurrency: Weak Behaviours and Programming Assumptions
Recommendations
GPU Concurrency: Weak Behaviours and Programming Assumptions
ASPLOS'15Concurrency is pervasive and perplexing, particularly on graphics processing units (GPUs). Current specifications of languages and hardware are inconclusive; thus programmers often rely on folklore assumptions when writing software.
To remedy this state ...
GPU Concurrency: Weak Behaviours and Programming Assumptions
ASPLOS '15Concurrency is pervasive and perplexing, particularly on graphics processing units (GPUs). Current specifications of languages and hardware are inconclusive; thus programmers often rely on folklore assumptions when writing software.
To remedy this state ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Comments