Skip to main content

2017 | OriginalPaper | Buchkapitel

GPUDrano: Detecting Uncoalesced Accesses in GPU Programs

verfasst von : Rajeev Alur, Joseph Devietti, Omar S. Navarro Leija, Nimit Singhania

Erschienen in: Computer Aided Verification

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Graphics Processing Units (GPUs) have become widespread and popular over the past decade. Fully utilizing the parallel compute and memory resources that GPUs present remains a significant challenge, however. In this paper, we describe GPUDrano: a scalable static analysis that detects uncoalesced global memory accesses in CUDA programs. Uncoalesced global memory accesses arise when a GPU program accesses DRAM in an ill-structured way, increasing latency and energy consumption. We formalize the GPUDrano static analysis and compare it empirically against a dynamic analysis to demonstrate that false positives are rare for most programs. We implement GPUDrano in LLVM and show that it can run on GPU programs of over a thousand lines of code. GPUDrano finds 133 of the 143 uncoalesced static memory accesses in the popular Rodinia GPU benchmark suite, demonstrating the precision of our implementation. Fixing these bugs leads to real performance improvements of up to 25%.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Amilkanthwar, M., Balachandran, S.: CUPL: a compile-time uncoalesced memory access pattern locator for CUDA. In: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS 2013, pp. 459–460. ACM, New York (2013). http://doi.acm.org/10.1145/2464996.2467288 Amilkanthwar, M., Balachandran, S.: CUPL: a compile-time uncoalesced memory access pattern locator for CUDA. In: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS 2013, pp. 459–460. ACM, New York (2013). http://​doi.​acm.​org/​10.​1145/​2464996.​2467288
2.
Zurück zum Zitat Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2008, pp. 1–10. ACM, New York (2008). http://doi.acm.org/10.1145/1345206.1345210 Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2008, pp. 1–10. ACM, New York (2008). http://​doi.​acm.​org/​10.​1145/​1345206.​1345210
3.
Zurück zum Zitat Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: A compiler framework for optimization of affine loop nests for GPGPUs. In: Proceedings of the 22nd Annual International Conference on Supercomputing, ICS 2008, pp. 225–234. ACM, New York (2008). http://doi.acm.org/10.1145/1375527.1375562 Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: A compiler framework for optimization of affine loop nests for GPGPUs. In: Proceedings of the 22nd Annual International Conference on Supercomputing, ICS 2008, pp. 225–234. ACM, New York (2008). http://​doi.​acm.​org/​10.​1145/​1375527.​1375562
5.
Zurück zum Zitat Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.H., Skadron, K.: Rodinia: a benchmark suite for heterogeneous computing. In: 2009 IEEE International Symposium on Workload Characterization (IISWC), pp. 44–54, October 2009 Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.H., Skadron, K.: Rodinia: a benchmark suite for heterogeneous computing. In: 2009 IEEE International Symposium on Workload Characterization (IISWC), pp. 44–54, October 2009
6.
Zurück zum Zitat Che, S., Sheaffer, J.W., Skadron, K.: Dymaxion: optimizing memory access patterns for heterogeneous systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011, pp. 13:1–13:11. ACM, New York (2011). http://doi.acm.org/10.1145/2063384.2063401 Che, S., Sheaffer, J.W., Skadron, K.: Dymaxion: optimizing memory access patterns for heterogeneous systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011, pp. 13:1–13:11. ACM, New York (2011). http://​doi.​acm.​org/​10.​1145/​2063384.​2063401
7.
Zurück zum Zitat Chen, G., Wu, B., Li, D., Shen, X.: PORPLE: an extensible optimizer for portable data placement on GPU. In: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-47, pp. 88–100. IEEE Computer Society, Washington, DC (2014). http://dx.doi.org/10.1109/MICRO.2014.20 Chen, G., Wu, B., Li, D., Shen, X.: PORPLE: an extensible optimizer for portable data placement on GPU. In: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-47, pp. 88–100. IEEE Computer Society, Washington, DC (2014). http://​dx.​doi.​org/​10.​1109/​MICRO.​2014.​20
8.
Zurück zum Zitat Cousot, P., Cousot, R.: Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints. In: Proceedings of the 4th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages, POPL 1977, pp. 238–252. ACM, New York (1977). http://doi.acm.org/10.1145/512950.512973 Cousot, P., Cousot, R.: Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints. In: Proceedings of the 4th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages, POPL 1977, pp. 238–252. ACM, New York (1977). http://​doi.​acm.​org/​10.​1145/​512950.​512973
9.
Zurück zum Zitat Fauzia, N., Pouchet, L.N., Sadayappan, P.: Characterizing and enhancing global memory data coalescing on GPUs. In: Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2015, pp. 12–22. IEEE Computer Society, Washington, DC (2015). http://dl.acm.org/citation.cfm?id=2738600.2738603 Fauzia, N., Pouchet, L.N., Sadayappan, P.: Characterizing and enhancing global memory data coalescing on GPUs. In: Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2015, pp. 12–22. IEEE Computer Society, Washington, DC (2015). http://​dl.​acm.​org/​citation.​cfm?​id=​2738600.​2738603
10.
Zurück zum Zitat Lv, J., Li, G., Humphrey, A., Gopalakrishnan, G.: Performance degradation analysis of GPU kernels. In: Workshop on Exploiting Concurrency Efficiently and Correctly (2011) Lv, J., Li, G., Humphrey, A., Gopalakrishnan, G.: Performance degradation analysis of GPU kernels. In: Workshop on Exploiting Concurrency Efficiently and Correctly (2011)
12.
Zurück zum Zitat Lee, S., Eigenmann, R.: OpenMPC: extended OpenMP programming and tuning for GPUs. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010, pp. 1–11. IEEE Computer Society, Washington, DC (2010). https://doi.org/10.1109/SC.2010.36 Lee, S., Eigenmann, R.: OpenMPC: extended OpenMP programming and tuning for GPUs. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010, pp. 1–11. IEEE Computer Society, Washington, DC (2010). https://​doi.​org/​10.​1109/​SC.​2010.​36
13.
Zurück zum Zitat Lee, S., Min, S.J., Eigenmann, R.: OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2009, pp. 101–110. ACM, New York (2009). http://doi.acm.org/10.1145/1504176.1504194 Lee, S., Min, S.J., Eigenmann, R.: OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2009, pp. 101–110. ACM, New York (2009). http://​doi.​acm.​org/​10.​1145/​1504176.​1504194
14.
15.
Zurück zum Zitat Li, G., Li, P., Sawaya, G., Gopalakrishnan, G., Ghosh, I., Rajan, S.P.: GKLEE: concolic verification and test generation for GPUs. In: Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2012, pp. 215–224. ACM, New York (2012). http://doi.acm.org/10.1145/2145816.2145844 Li, G., Li, P., Sawaya, G., Gopalakrishnan, G., Ghosh, I., Rajan, S.P.: GKLEE: concolic verification and test generation for GPUs. In: Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2012, pp. 215–224. ACM, New York (2012). http://​doi.​acm.​org/​10.​1145/​2145816.​2145844
17.
Zurück zum Zitat Nielson, F., Nielson, H.R., Hankin, C.: Principles of Program Analysis. Springer Publishing Company Incorporated, Heidelberg (2010)MATH Nielson, F., Nielson, H.R., Hankin, C.: Principles of Program Analysis. Springer Publishing Company Incorporated, Heidelberg (2010)MATH
20.
Zurück zum Zitat Sung, I.J., Stratton, J.A., Hwu, W.M.W.: Data layout transformation exploiting memory-level parallelism in structured grid many-core applications. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT 2010, pp. 513–522. ACM, New York (2010). http://doi.acm.org/10.1145/1854273.1854336 Sung, I.J., Stratton, J.A., Hwu, W.M.W.: Data layout transformation exploiting memory-level parallelism in structured grid many-core applications. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT 2010, pp. 513–522. ACM, New York (2010). http://​doi.​acm.​org/​10.​1145/​1854273.​1854336
23.
Zurück zum Zitat Wu, J., Belevich, A., Bendersky, E., Heffernan, M., Leary, C., Pienaar, J., Roune, B., Springer, R., Weng, X., Hundt, R.: Gpucc: An open-source GPGPU compiler. In: Proceedings of the 2016 International Symposium on Code Generation and Optimization, CGO 2016, pp. 105–116. ACM, New York (2016). http://doi.acm.org/10.1145/2854038.2854041 Wu, J., Belevich, A., Bendersky, E., Heffernan, M., Leary, C., Pienaar, J., Roune, B., Springer, R., Weng, X., Hundt, R.: Gpucc: An open-source GPGPU compiler. In: Proceedings of the 2016 International Symposium on Code Generation and Optimization, CGO 2016, pp. 105–116. ACM, New York (2016). http://​doi.​acm.​org/​10.​1145/​2854038.​2854041
24.
Zurück zum Zitat Yang, Y., Xiang, P., Kong, J., Zhou, H.: A GPGPU compiler for memory optimization and parallelism management. In: Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2010, pp. 86–97. ACM, New York (2010). http://doi.acm.org/10.1145/1806596.1806606 Yang, Y., Xiang, P., Kong, J., Zhou, H.: A GPGPU compiler for memory optimization and parallelism management. In: Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2010, pp. 86–97. ACM, New York (2010). http://​doi.​acm.​org/​10.​1145/​1806596.​1806606
Metadaten
Titel
GPUDrano: Detecting Uncoalesced Accesses in GPU Programs
verfasst von
Rajeev Alur
Joseph Devietti
Omar S. Navarro Leija
Nimit Singhania
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-63387-9_25