Skip to main content
Erschienen in: The Journal of Supercomputing 1/2014

01.07.2014

Concurrent warp execution: improving performance of GPU-likely SIMD architecture by increasing resource utilization

verfasst von: Hong Jun Choi, Dong Oh Son, Jong Myon Kim, Cheol Hong Kim

Erschienen in: The Journal of Supercomputing | Ausgabe 1/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Hardware parallelism should be exploited to improve the performance of computing systems. Single instruction multiple data (SIMD) architecture has been widely used to maximize the throughput of computing systems by exploiting hardware parallelism. Unfortunately, branch divergence due to branch instructions causes underutilization of computational resources, resulting in performance degradation of SIMD architecture. Graphics processing unit (GPU) is a representative parallel architecture based on SIMD architecture. In recent computing systems, GPUs can process general-purpose applications as well as graphics applications with the help of convenient APIs. However, contrary to graphics applications, general-purpose applications include many branch instructions, resulting in serious performance degradation of GPU due to branch divergence. In this paper, we propose concurrent warp execution (CWE) technique to reduce the performance degradation of GPU in executing general-purpose applications by increasing resource utilization. The proposed CWE enables selecting co-warps to activate more threads in the warp, leading to concurrent execution of combined warps. According to our simulation results, the proposed architecture provides a significant performance improvement (5.85 % over PDOM, 91 % over DWF) with little hardware overhead.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Lee R (1999) Efficiency of microSIMD architectures and index-mapped data for media processing. In: Proceedings of IS and T/SPIE symposium on electric imaging, pp 34–46 Lee R (1999) Efficiency of microSIMD architectures and index-mapped data for media processing. In: Proceedings of IS and T/SPIE symposium on electric imaging, pp 34–46
2.
3.
4.
Zurück zum Zitat Lee VW, Kim CK, Chhugani J, Deisher M, Kim DH, Nguyen AD, Satish N, Smelyanskiy M, Chennupaty S, Hammarlund P, Singhal R, Dubey P (2010) Debunking the 100\(\times \) GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In: Proceedings of international symposium on computer architecture, pp 451–460 Lee VW, Kim CK, Chhugani J, Deisher M, Kim DH, Nguyen AD, Satish N, Smelyanskiy M, Chennupaty S, Hammarlund P, Singhal R, Dubey P (2010) Debunking the 100\(\times \) GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In: Proceedings of international symposium on computer architecture, pp 451–460
6.
Zurück zum Zitat Buck I, Foley T, Horn D, Sugerman J, Fatahalian K, Houston M, Hanrahan P (2004) Brook for GPUs: stream computing on graphics hardware. In: Proceedings of conference on computer graphics and interactive techniques (SIGGRAPH), pp 777–786 Buck I, Foley T, Horn D, Sugerman J, Fatahalian K, Houston M, Hanrahan P (2004) Brook for GPUs: stream computing on graphics hardware. In: Proceedings of conference on computer graphics and interactive techniques (SIGGRAPH), pp 777–786
7.
Zurück zum Zitat Owens JD, Luebke D, Govindaraju N, Harris M, Kruger J, Lefohn AE, Purcell TJ (2005) A survey of general-purpose computation on graphics hardware. In: Euro-graphics 2005, state of the art reports, pp 21–51 Owens JD, Luebke D, Govindaraju N, Harris M, Kruger J, Lefohn AE, Purcell TJ (2005) A survey of general-purpose computation on graphics hardware. In: Euro-graphics 2005, state of the art reports, pp 21–51
14.
Zurück zum Zitat Rhu M, Erez M (2012) CAPRI: prediction of compaction-adequacy for handling control-divergence in GPGPU architectures. In: Proceedings of international symposium on computer architecture, pp 61–71 Rhu M, Erez M (2012) CAPRI: prediction of compaction-adequacy for handling control-divergence in GPGPU architectures. In: Proceedings of international symposium on computer architecture, pp 61–71
15.
Zurück zum Zitat Gilani SZ, Kim NS, Michael J (2013) Schulte, power-efficient computing for compute-intensive GPGPU applications. In: Proceedings of international symposium on high performance computer architecture, pp 412–423 Gilani SZ, Kim NS, Michael J (2013) Schulte, power-efficient computing for compute-intensive GPGPU applications. In: Proceedings of international symposium on high performance computer architecture, pp 412–423
16.
Zurück zum Zitat Levinthal A, Porter T (1984) Chap—a SIMD graphics processor. In: Proceedings of conference on computer graphics and interactive techniques (SIGGRAPH), pp 77–82 Levinthal A, Porter T (1984) Chap—a SIMD graphics processor. In: Proceedings of conference on computer graphics and interactive techniques (SIGGRAPH), pp 77–82
19.
Zurück zum Zitat Fung WWL, Sham I, Yuan G, Aamodt TM (2007) Dynamic warp formation and scheduling for efficient GPU control flow. In: Proceedings of international symposium on microarchitecture, pp 407–420 Fung WWL, Sham I, Yuan G, Aamodt TM (2007) Dynamic warp formation and scheduling for efficient GPU control flow. In: Proceedings of international symposium on microarchitecture, pp 407–420
20.
Zurück zum Zitat Fung WWL, Aamodt TM (2011) Thread block compaction for efficient SIMT control flow. In: Proceedings of international symposium on high performance computer architecture, pp 25–36 Fung WWL, Aamodt TM (2011) Thread block compaction for efficient SIMT control flow. In: Proceedings of international symposium on high performance computer architecture, pp 25–36
21.
Zurück zum Zitat Narasiman V, Shebanow M, Lee CJ, Miftakhutdinov R, Mutlu O, Patt YN (2011) Improving GPU performance via large warps and two-level warp scheduling. In: Proceedings of international symposium on microarchitecture, pp 308–317 Narasiman V, Shebanow M, Lee CJ, Miftakhutdinov R, Mutlu O, Patt YN (2011) Improving GPU performance via large warps and two-level warp scheduling. In: Proceedings of international symposium on microarchitecture, pp 308–317
22.
Zurück zum Zitat Meng J, Tarjan D, Skadron K (2010) Dynamic warp subdivision for integrated branch and memory divergence tolerance. In: Proceedings of international symposium on computer architecture, pp 235–246 Meng J, Tarjan D, Skadron K (2010) Dynamic warp subdivision for integrated branch and memory divergence tolerance. In: Proceedings of international symposium on computer architecture, pp 235–246
23.
Zurück zum Zitat Harish P, Narayanan PJ (2007) Accelerating large graph algorithms on the GPU using CUDA. In: Proceedings of international conference on high, performance computing, pp 197–208 Harish P, Narayanan PJ (2007) Accelerating large graph algorithms on the GPU using CUDA. In: Proceedings of international conference on high, performance computing, pp 197–208
27.
Zurück zum Zitat Micikevicius P (2009) 3D finite difference computation on GPUs using CUDA. In: Proceedings of workshop on general purpose processing on graphics processing units, pp 79–84 Micikevicius P (2009) 3D finite difference computation on GPUs using CUDA. In: Proceedings of workshop on general purpose processing on graphics processing units, pp 79–84
28.
Zurück zum Zitat Matsumoto M, Nishimura T (1998) Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Trans Model Comput Simul 8(1):3–30CrossRefMATH Matsumoto M, Nishimura T (1998) Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Trans Model Comput Simul 8(1):3–30CrossRefMATH
33.
Zurück zum Zitat Thornton JE (1964) Parallel operation in the control data 6600, In: AFIPS proceedings of FJCC, Part 2, vol 26, pp 33–40 Thornton JE (1964) Parallel operation in the control data 6600, In: AFIPS proceedings of FJCC, Part 2, vol 26, pp 33–40
34.
Zurück zum Zitat Coon BW, Lindholm EJ (2008) United States Patent No. 7,353,369: system and method for managing divergent threads in a SIMD architecture Coon BW, Lindholm EJ (2008) United States Patent No. 7,353,369: system and method for managing divergent threads in a SIMD architecture
35.
Zurück zum Zitat Coon BW, Mills PC, Oberman SF, Siu MY (2008) United States Patent No. 7,434,032: tracking register usage during multithreaded processing using a scorebard having separate memory regions and storing sequential register size indicators Coon BW, Mills PC, Oberman SF, Siu MY (2008) United States Patent No. 7,434,032: tracking register usage during multithreaded processing using a scorebard having separate memory regions and storing sequential register size indicators
36.
Zurück zum Zitat Lindholm J, Moy S (2010) United States Patent Application No. 2005/0138328 A1: across-thread out-of-order instruction dispatch in a multithreaded microprocessor Lindholm J, Moy S (2010) United States Patent Application No. 2005/0138328 A1: across-thread out-of-order instruction dispatch in a multithreaded microprocessor
37.
Zurück zum Zitat Woop S, Schmittler J, Slusallek P (2005) RPU: a programmable ray processing unit for realtime ray tracing. In: Proceedings of conference on computer graphics and Interactive Techniques(SIGGRAPH), pp 434–444 Woop S, Schmittler J, Slusallek P (2005) RPU: a programmable ray processing unit for realtime ray tracing. In: Proceedings of conference on computer graphics and Interactive Techniques(SIGGRAPH), pp 434–444
38.
Zurück zum Zitat Muchnick S (1997) Advanced compiler design and implementation. Morgan Kaufmanns, San Francisco Muchnick S (1997) Advanced compiler design and implementation. Morgan Kaufmanns, San Francisco
39.
Zurück zum Zitat Bakhoda A, Yuan GL, Fung WWL, Wong H, Aamodt TM (2009) Analyzing CUDA workloads using a detailed GPU simulator. In: Proceedings of international symposium on performance analysis of systems and software, pp 163–174 Bakhoda A, Yuan GL, Fung WWL, Wong H, Aamodt TM (2009) Analyzing CUDA workloads using a detailed GPU simulator. In: Proceedings of international symposium on performance analysis of systems and software, pp 163–174
40.
Zurück zum Zitat Burger DC, Austin TM (1997) The SimpleScalar tool set, version 2.0. Comput Archit News 25(3):13–25CrossRef Burger DC, Austin TM (1997) The SimpleScalar tool set, version 2.0. Comput Archit News 25(3):13–25CrossRef
41.
Zurück zum Zitat Dally WJ, Towles B (2004) Interconnection Networks. Morgan Kaufmann, San Francisco Dally WJ, Towles B (2004) Interconnection Networks. Morgan Kaufmann, San Francisco
43.
Zurück zum Zitat Kirk D, Hwu WW (2010) Programming massively parallel processors: a hands-on approach. Morgan Kaufmanns, San Francisco Kirk D, Hwu WW (2010) Programming massively parallel processors: a hands-on approach. Morgan Kaufmanns, San Francisco
44.
Zurück zum Zitat Tarjan D, Thoziyor S, Jouppi NP (2006) CACTI 4.0. Technical Report HPL-2006–86 Tarjan D, Thoziyor S, Jouppi NP (2006) CACTI 4.0. Technical Report HPL-2006–86
Metadaten
Titel
Concurrent warp execution: improving performance of GPU-likely SIMD architecture by increasing resource utilization
verfasst von
Hong Jun Choi
Dong Oh Son
Jong Myon Kim
Cheol Hong Kim
Publikationsdatum
01.07.2014
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 1/2014
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-014-1155-4

Weitere Artikel der Ausgabe 1/2014

The Journal of Supercomputing 1/2014 Zur Ausgabe