Top

The Journal of Supercomputing

Published in:

25-01-2019

Application-aware NoC management in GPUs multitasking

Authors: Zhen Xu, Xia Zhao, Zhiying Wang, Canqun Yang

Published in: The Journal of Supercomputing | Issue 8/2019

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Current network-on-chip (NoC) designs in GPUs are agnostic to application requirements, and this leads to wasted performance in GPUs multitasking. We observe that applications can generally be classified as either network-sensitive or network-insensitive. We propose the application-aware NoC (AA-NoC) management to better exploit the application characteristics. AA-NoC consists of the topology-aware streaming multiprocessor (SM) mapping and the adaptive virtual channel (VC) management. The topology-aware SM mapping is implemented in the concurrent thread array scheduler, and the adaptive VC management replies on a light-weight online profiling which only incurs limited hardware overhead. Compared to the traditional application-agnostic NoC, the evaluation results show that AA-NoC improves the STP and ANTT by 19.7% and 20.9%, respectively.

previous article Secure data processing for IoT middleware systems

next article Parallel simulation model for heat and moisture transfer of clothed human body

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Nvidia (2009) NVIDIA’s next generation CUDA compute architecture: Fermi. http://www.nvidia.com/content/PDF/fermi_white_papers/P.Glaskowsky_NVIDIA’s_Fermi-The_First_Complete_GPU_Architecture.pdf. Accessed July 2018

Nvidia (2016) NVIDIA GP100 Pascal architecture. White paper. http://www.nvidia.com/object/pascal-architecture-whitepaper.html. Accessed July 2018

Sewell K, Dreslinski RG, Manville T, Satpathy S, Pinckney N, Blake G, Cieslak M, Das R, Wenisch TF, Sylvester D, Blaauw D, Mudge T (2012) Swizzle-switch networks for many-core systems. IEEE J Emerg Sel Top Circuits Syst 2:278–294CrossRef

Bakhoda A, Kim J, Aamodt TM (2010) Throughput-effective on-chip networks for manycore accelerators. In: Proceedings of the International Symposium on Microarchitecture (MICRO), pp 421–432

Kim H, Kim J, Seo W, Cho Y, Ryu S (2012) Providing cost-effective on-chip network bandwidth in GPGPUs. In: Proceedings of the International Conference on Computer Design (ICCD), pp 407–412

Jang H, Kim J, Gratz P, Yum KH, Kim EJ (2015) Bandwidth-efficient on-chip interconnect designs for GPGPUs. In: Proceedings of the Design Automation Conference (DAC), pp 9:1–9:6

Zhao X, Ma S, Li C, Eeckhout L, Wang Z (2016) A heterogeneous low-cost and low-latency ring-chain network for GPGPUs. In: Proceedings of the International Conference on Computer Design (ICCD), pp 472–479

Adriaens JT, Compton K, Kim NS, Schulte MJ (2012) The case for GPGPU spatial multitasking. In: Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), pp 1–12

Nvidia (2017) NVIDIA Tesla V100 GPU architecture the world’s most advanced data center GPU. White paper. http://www.nvidia.com/object/volta-architecture-whitepaper.html

10.

Jog A, Kayiran O, Kesten T, Pattnaik A, Bolotin E, Chatterjee N, Keckler SW, Kandemir MT, Das CR (2015) Anatomy of GPU memory system for multi-application execution. In: Proceedings of the 2015 International Symposium on Memory Systems, MEMSYS

11.

Park JJK, Park Y, Mahlke S (2015) Chimera: collaborative preemption for multitasking on a shared GPU. In: Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp 593–606

12.

Wang B, Yu W, Sun X-H, Wang X (2015) DaCache: memory divergence-aware GPU cache management. In: Proceedings of the International Conference on Supercomputing (ICS), pp 89–98

13.

Sethia A, Jamshidi DA, Mahlke S (2015) Mascar: speeding up GPU warps by reducing memory pitstops. In: Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pp 174–185

14.

Abts D, Enright Jerger ND, Kim J, Gibson D, Lipasti MH (2009) Achieving predictable performance through better memory controller placement in many-core CMPs. In: Proceedings of the International Symposium on Computer Architecture, pp 451–461

15.

Jerger N E, Krishna T, Peh L (2017) On-chip networks, 2nd edn. Morgan & Claypool Publishers, Williston

16.

Tanasic I, Gelado I, Cabezas J, Ramirez A, Navarro N, Valero M (2014) Enabling preemptive multiprogramming on GPUs. In: Proceeding of the International Symposium on Computer Architecture (ISCA), pp 193–204

17.

Rezazad M, Sarbazi-azad H (2005) The effect of virtual channel organization on the performance of interconnection networks. In: Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS)

18.

Lee J, Kim H (2012) TAP: a TLP-aware cache management policy for a CPU-GPU heterogeneous architecture. In: Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pp 1–12

19.

Grauer-Gray S, Xu L, Searles R, Ayalasomayajula S, Cavazos J (2012) Auto-tuning a high-level language targeted to GPU codes. In: Innovative Parallel Computing (InPar), pp 1–10

20.

He B, Fang W, Luo Q, Govindaraju NK, Wang T (2008) Mars: a MapReduce framework on graphics processors. In: Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pp 260–269

21.

Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee S-H, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing. In: Proceedings of the International Symposium on Workload Characterization (IISWC), pp 44–54

22.

NVIDIA CUDA SDK Code Samples. https://developer.nvidia.com/cuda-downloads

23.

Bakhoda A, Yuan GL, Fung WWL, Wong H, Aamodt TM (2009) Analyzing CUDA workloads using a detailed GPU simulator. In: Proceeding of the International Symposium on Performance Analysis of Systems and Software (ISPASS), pp 163–174

24.

Stratton JA, Rodrigues C, Sung I-J, Obeid N, Chang L-W, Anssari N, Liu GD, Hwu WMW (2012) Parboil: a revised benchmark suite for scientific and commercial throughput computing. Technical report

25.

Wang Z, Yang J, Melhem R, Childers B, Zhang Y, Guo M (2016) Simultaneous multikernel GPU: multi-tasking throughput processors via fine-grained sharing. In: Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pp 358–369

26.

Xu Q, Jeon H, Kim K, Ro WW, Annavaram M (2016) Warped-slicer: efficient intra-SM slicing through dynamic resource partitioning for GPU multiprogramming. In: Proceedings of the International Symposium on Computer Architecture (ISCA), pp 230–242

27.

Zhao X, Wang Z, Eeckhout L (2018) Classification-driven search for effective SM partitioning in GPU multitasking. In: Proceedings of the International Conference on Supercomputing (ICS)

28.

Eyerman S, Eeckhout L (2008) System-level performance metrics for multiprogram workloads. IEEE Micro 28(3):42–53CrossRef

29.

Arabnia HR, Oliver MA (1986) Fast operations on raster images with SIMD machine architectures. Comput Graph Forum 5:179–188CrossRef

30.

Arabnia HR, Oliver MA (1987) Arbitrary rotation of raster images with SIMD machine architectures. Comput Graph Forum 6:3–11CrossRef

31.

Arabnia HR, Oliver MA (1987) A transputer network for the arbitrary rotation of digitised images. Comput J 30(5):425–432CrossRef

32.

Arabnia HR, Oliver MA (1989) A transputer network for fast operations on digitised images. Comput Graph Forum 8:3–11CrossRef

33.

Arabnia HR (1990) A parallel algorithm for the arbitrary rotation of digitized images using process-and-data-decomposition approach. J Parallel Distrib Comput 10(2):188–192CrossRef

34.

Arabnia HR (1996) Distributed stereo-correlation algorithm. Comput Commun 19(8):707–711CrossRef

35.

Arabnia HR, Bhandarkar SM (1996) Parallel stereocorrelation on a reconfigurable multi-ring network. J Supercomput 10:243–269CrossRefMATH

36.

Arabnia HR, Taha TR (1998) A parallel numerical algorithm on a reconfigurable multi-ring network. Telecommun Syst 10(1–2):185–202CrossRef

37.

Ziabari AK, Abellán JL, Ma Y, Joshi A, Kaeli D (2015) Asymmetric NoC architectures for GPU systems. In: Proceedings of the International Symposium on Networks-on-Chip (NoCs), pp 25:1–25:8

38.

Zhao X, Ma S, Liu Y, Eeckhout L, Wang Z (2016) A low-cost conflict-free NoC for GPGPUs. In: Proceedings of the Design Automation Conference (DAC), pp 34:1–34:6

39.

Cheng X, Zhao Y, Zhao H, Xie Y (2018) Packet pump: overcoming network bottleneck in on-chip interconnects for GPGPUs. In: Proceedings of the Design Automation Conference (DAC), pp 84:1–84:6

40.

Aguilera P, Morrow K, Kim NS (2014) Fair share: allocation of GPU resources for both performance and fairness. In: The 32nd IEEE International Conference on Computer Design, ICCD

41.

Wang H, Luo F, Ibrahim M, Kayiran O, Jog A (2018) Efficient and fair multi-programming in GPUs via effective bandwidth management. In: Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pp 247–258

42.

Ausavarungnirun R, Landgraf J, Miller V, Ghose S, Gandhi J, Rossbach CJ, Mutlu O (2017) Mosaic: a GPU memory manager with application-transparent support for multiple page sizes. In: Proceedings of the International Symposium on Microarchitecture (MICRO), pp 136–150

43.

Dai H, Lin Z, Li C, Zhao C, Wang F, Zheng N, Zhou H (2018) Accelerate GPU concurrent kernel execution by mitigating memory pipeline stalls. In: Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pp 208–220

44.

Liu Y, Yu Z, Eeckhout L, Reddi VJ, Luo Y, Wang X, Wang Z, Xu C (2016) Barrier-aware warp scheduling for throughput processors. In: Proceedings of the International Conference on Supercomputing (ICS), pp 42:1–42:12

45.

Jog A, Kayiran O, Mishra AK, andemir MT, Mutlu O, Iyer R, Das CR (2013) Orchestrated scheduling and prefetching for GPGPUs. In: ACM SIGARCH Computer Architecture News, vol 41, pp 332–343. ACM

46.

Wang B, Zhu Y, Yu W (2016) OAWS: memory occlusion aware warp scheduling. In: Proceedings of the International Conference on Parallel Architecture and Compilation Techniques (PACT), pp 45–55

47.

Rogers TG, O’Connor M, Aamodt TM (2012) Cache-conscious wavefront scheduling. In: Proceedings of the International Symposium on Microarchitecture (MICRO), pp 72–83

48.

Lee S-Y, Arunkumar A, Wu C-J (2015) CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads. In: Proceedings of the International Symposium on Computer Architecture (ISCA), pp 515–527

49.

Xie X, Liang Y, Wang Y, Sun G, Wang T (2015) Coordinated static and dynamic cache bypassing for GPUs. In: Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pp 76–88

50.

Jia W, Shaw KA, Martonosi M (2014) MRPB: memory request prioritization for massively parallel processors. In: Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pp 272–283

51.

Jeon H, Ravi GS, Kim NS, Annavaram M (2015) GPU register file virtualization. In: Proceedings of the International Symposium on Microarchitecture (MICRO), pp 420–432

52.

Abdel-Majeed M, Annavaram M (2013) Warped register file: a power efficient register file for GPGPUs. In: Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pp 412–423

53.

Jing N, Shen Y, Lu Y, Ganapathy S, Mao Z, Guo M, Canal R, Liang X (2013) An energy-efficient and scalable eDRAM-based register file architecture for GPGPU. In: Proceedings of the International Symposium on Computer Architecture (ISCA), pp 344–355

54.

Yoon M K, Kim K, Lee S, Ro WW, Annavaram M (2016) Virtual thread: maximizing thread-level parallelism beyond GPU scheduling limit. In: Proceedings of the International Symposium on Computer Architecture (ISCA), pp 609–621

55.

Vijaykumar N, Hsieh K, Pekhimenko G, Khan S, Shrestha A, Ghose S, Jog A, Gibbons PB, Mutlu O (2016) Zorua: a holistic approach to resource virtualization in GPUs. In: Proceedings of the International Symposium on Microarchitecture (MICRO), pp 1–14

56.

Arunkumar A, Bolotin E, Cho B, Milic U, Ebrahimi E, Villa O, Jaleel A, Wu C-J, Nellans D (2017) MCM-GPU: multi-chip-module GPUs for continued performance scalability. In: Proceedings of the International Symposium on Computer Architecture (ISCA), pp 320–332

57.

Milic U, Villa O, Bolotin E, Arunkumar A, Ebrahimi E, Jaleel A, Ramirez A, Nellans D (2017) Beyond the socket: NUMA-aware GPUs. In: Proceedings of the International Symposium on Microarchitecture (MICRO), pp 123–135

Title: Application-aware NoC management in GPUs multitasking
Authors: Zhen Xu
Xia Zhao
Zhiying Wang
Canqun Yang
Publication date: 25-01-2019
Publisher: Springer US
Published in: The Journal of Supercomputing / Issue 8/2019
Print ISSN: 0920-8542
Electronic ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-018-2694-x

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Other articles of this Issue 8/2019

Experiences with implementing parallel discrete-event simulation on GPU

Predictive modeling of the performance of asynchronous iterative methods

Finding the chromatic sums of graphs using a D-Wave quantum computer

A blockchain-based decentralized efficient investigation framework for IoT digital forensics

Command and control of industrial manipulator through speech-based interfaces in Indic Languages

Improved feature selection and classification for rheumatoid arthritis disease using weighted decision tree approach (REACT)

Premium Partner