nach oben

The Journal of Supercomputing

Erschienen in:

01.07.2014

Concurrent warp execution: improving performance of GPU-likely SIMD architecture by increasing resource utilization

verfasst von: Hong Jun Choi, Dong Oh Son, Jong Myon Kim, Cheol Hong Kim

Erschienen in: The Journal of Supercomputing | Ausgabe 1/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Hardware parallelism should be exploited to improve the performance of computing systems. Single instruction multiple data (SIMD) architecture has been widely used to maximize the throughput of computing systems by exploiting hardware parallelism. Unfortunately, branch divergence due to branch instructions causes underutilization of computational resources, resulting in performance degradation of SIMD architecture. Graphics processing unit (GPU) is a representative parallel architecture based on SIMD architecture. In recent computing systems, GPUs can process general-purpose applications as well as graphics applications with the help of convenient APIs. However, contrary to graphics applications, general-purpose applications include many branch instructions, resulting in serious performance degradation of GPU due to branch divergence. In this paper, we propose concurrent warp execution (CWE) technique to reduce the performance degradation of GPU in executing general-purpose applications by increasing resource utilization. The proposed CWE enables selecting co-warps to activate more threads in the warp, leading to concurrent execution of combined warps. According to our simulation results, the proposed architecture provides a significant performance improvement (5.85 % over PDOM, 91 % over DWF) with little hardware overhead.

Vorheriger Artikel Comparative evaluation of platforms for parallel Ant Colony Optimization

Nächster Artikel Performance analysis of cloud computing services considering resources sharing among virtual machines

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Lee R (1999) Efficiency of microSIMD architectures and index-mapped data for media processing. In: Proceedings of IS and T/SPIE symposium on electric imaging, pp 34–46

Flynn M (1972) Some computer organizations and their effectiveness. IEEE Trans Comput C–21(9):948–960CrossRefMathSciNet

Luebke D, Humphreys G (2007) How GPUs work. J Comput 40(2):96–100CrossRef

Lee VW, Kim CK, Chhugani J, Deisher M, Kim DH, Nguyen AD, Satish N, Smelyanskiy M, Chennupaty S, Hammarlund P, Singhal R, Dubey P (2010) Debunking the 100\(\times \) GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In: Proceedings of international symposium on computer architecture, pp 451–460

General-purpose computation on graphics hardware, available at http://www.gpgpu.org/. Accessed 2 Jul 2011

Buck I, Foley T, Horn D, Sugerman J, Fatahalian K, Houston M, Hanrahan P (2004) Brook for GPUs: stream computing on graphics hardware. In: Proceedings of conference on computer graphics and interactive techniques (SIGGRAPH), pp 777–786

Owens JD, Luebke D, Govindaraju N, Harris M, Kruger J, Lefohn AE, Purcell TJ (2005) A survey of general-purpose computation on graphics hardware. In: Euro-graphics 2005, state of the art reports, pp 21–51

CUDA Programming Guide Version 3.0, available at https://developer.nvidia.com/cuda-toolkit-30-downloads/. Accessed 11 Aug 2011

ATI Stream Technology, available at http://developer.amd.com/tools-and-sdks/. Accessed 28 Sep 2011

10.

Khronos Group, OpenCL, available at http://www.khronos.org/opencl/. Accessed 1 Feb 2012

11.

Cg, available at https://developer.nvidia.com/cg-toolkit. Accessed 9 Apr 2012

12.

HLSL, available at http://msdn2.microsoft.com/en-us/library/bb509638.aspx. Accessed 13 Apr 2012

13.

OpenGL, available at http://www.opengl.org/registry/doc/GLSLangSpec.Full.1.20.8.pdf. Accessed 27 Jun 2012

14.

Rhu M, Erez M (2012) CAPRI: prediction of compaction-adequacy for handling control-divergence in GPGPU architectures. In: Proceedings of international symposium on computer architecture, pp 61–71

15.

Gilani SZ, Kim NS, Michael J (2013) Schulte, power-efficient computing for compute-intensive GPGPU applications. In: Proceedings of international symposium on high performance computer architecture, pp 412–423

16.

Levinthal A, Porter T (1984) Chap—a SIMD graphics processor. In: Proceedings of conference on computer graphics and interactive techniques (SIGGRAPH), pp 77–82

17.

Moy S, Lindholm E (2005) US patent 6,947,047: method and system for programmable pipelined graphics processing with branching instructions, available at http://www.google.com/patents/US6947047. Accessed 7 Jan 2012

18.

Lorie RA, Strong HR (1984) US patent 4,435,758: method for conditional branch execution in SIMD vector processors, available at http://www.google.com/patents/US4435758. Accessed 10 Jan 2012

19.

Fung WWL, Sham I, Yuan G, Aamodt TM (2007) Dynamic warp formation and scheduling for efficient GPU control flow. In: Proceedings of international symposium on microarchitecture, pp 407–420

20.

Fung WWL, Aamodt TM (2011) Thread block compaction for efficient SIMT control flow. In: Proceedings of international symposium on high performance computer architecture, pp 25–36

21.

Narasiman V, Shebanow M, Lee CJ, Miftakhutdinov R, Mutlu O, Patt YN (2011) Improving GPU performance via large warps and two-level warp scheduling. In: Proceedings of international symposium on microarchitecture, pp 308–317

22.

Meng J, Tarjan D, Skadron K (2010) Dynamic warp subdivision for integrated branch and memory divergence tolerance. In: Proceedings of international symposium on computer architecture, pp 235–246

23.

Harish P, Narayanan PJ (2007) Accelerating large graph algorithms on the GPU using CUDA. In: Proceedings of international conference on high, performance computing, pp 197–208

24.

Giles M (2008) Jacobi iteration for a Laplace discretisation on a 3D structured grid. Technical Report, available at http://people.maths.ox.ac.uk/~gilesm/hpc/NVIDIAlaplace3d.pdf. Accessed 19 Jan 2012

25.

Chen L, Das H, Pan S (2009) An implementation of ray tracing in CUDA. CSE 260 Project Report, available at http://cseweb.ucsd.edu/~baden/classes/Exemplars/260_fa09/ChenDasPan_cse260_fa09.pdf. Accessed 25 Jan 2012

26.

Harris M (2007) Parallel prefix sum (scan) with CUDA. Project Report, available at http://beowulf.lcs.mit.edu/18.337-2008/lectslides/scan.pdf. Accessed 1 Feb 2012

27.

Micikevicius P (2009) 3D finite difference computation on GPUs using CUDA. In: Proceedings of workshop on general purpose processing on graphics processing units, pp 79–84

28.

Matsumoto M, Nishimura T (1998) Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Trans Model Comput Simul 8(1):3–30CrossRefMATH

29.

NVIDA Co., Ltd., available at http://www.nvidia.com/. Accessed 22 Jul 2011

30.

AMD(Advanced Micro Devices) Inc., available at http://www.amd.com/. Accessed 18 Oct 2011

31.

QuadroFX5800, available at http://www.nvidia.com/object/product_quadro_fx_5800_us.html. Accessed 16 Sep 2012

32.

NVIDIA Co., Ltd. (2009) NVIDIA’s next generation CUDA compute architecture: Fermi, White paper, available at http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf. Accessed 11 Dec 2011

33.

Thornton JE (1964) Parallel operation in the control data 6600, In: AFIPS proceedings of FJCC, Part 2, vol 26, pp 33–40

34.

Coon BW, Lindholm EJ (2008) United States Patent No. 7,353,369: system and method for managing divergent threads in a SIMD architecture

35.

Coon BW, Mills PC, Oberman SF, Siu MY (2008) United States Patent No. 7,434,032: tracking register usage during multithreaded processing using a scorebard having separate memory regions and storing sequential register size indicators

36.

Lindholm J, Moy S (2010) United States Patent Application No. 2005/0138328 A1: across-thread out-of-order instruction dispatch in a multithreaded microprocessor

37.

Woop S, Schmittler J, Slusallek P (2005) RPU: a programmable ray processing unit for realtime ray tracing. In: Proceedings of conference on computer graphics and Interactive Techniques(SIGGRAPH), pp 434–444

38.

Muchnick S (1997) Advanced compiler design and implementation. Morgan Kaufmanns, San Francisco

39.

Bakhoda A, Yuan GL, Fung WWL, Wong H, Aamodt TM (2009) Analyzing CUDA workloads using a detailed GPU simulator. In: Proceedings of international symposium on performance analysis of systems and software, pp 163–174

40.

Burger DC, Austin TM (1997) The SimpleScalar tool set, version 2.0. Comput Archit News 25(3):13–25CrossRef

41.

Dally WJ, Towles B (2004) Interconnection Networks. Morgan Kaufmann, San Francisco

42.

Cuda, SDK, available at http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.html Accessed 30 Sep 2012

43.

Kirk D, Hwu WW (2010) Programming massively parallel processors: a hands-on approach. Morgan Kaufmanns, San Francisco

44.

Tarjan D, Thoziyor S, Jouppi NP (2006) CACTI 4.0. Technical Report HPL-2006–86

Titel: Concurrent warp execution: improving performance of GPU-likely SIMD architecture by increasing resource utilization
verfasst von: Hong Jun Choi
Dong Oh Son
Jong Myon Kim
Cheol Hong Kim
Publikationsdatum: 01.07.2014
Verlag: Springer US
Erschienen in: The Journal of Supercomputing / Ausgabe 1/2014
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-014-1155-4

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 1/2014

A GPU-based heart simulator with mass-spring systems and cellular automaton

User subscription-based resource management for Desktop-as-a-Service platforms

Design of 4-disjoint gamma interconnection network layouts and reliability analysis of gamma interconnection Networks

Some properties and algorithms for the hyper-torus network

Scalable hybrid implementation of the Schur complement method for multi-GPU systems

A formal framework for secure and complying services