nach oben

The Journal of Supercomputing

Erschienen in:

01.08.2014

A run-time optimization approach for reducing data movements using locality-aware searching

verfasst von: Liang Li, Endong Wang, Xingjun Zhang, Kang Yan, Tao Ju, Xiaoshe Dong

Erschienen in: The Journal of Supercomputing | Ausgabe 2/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

The CPU–GPU communication bottleneck limits the performance improvement of GPU applications in heterogeneous GPGPU systems and usually is handled by data reuse optimization. This paper analyzes data reuse through DAG abstraction and obtains rules showing that the run-time data reuse optimization can effectively relieve the bottleneck. Based on the rules, this paper proposes a run-time optimization framework for data reuse, called R-Tracker. The R-Tracker uses locality-aware searching approach to handle reuses. It can not only low costly implement the data reuse optimization but also effectively implement the searching, the data transfers, and the GPU computation concurrently. R-Tracker relaxes the constraints that are required in compiler-based approaches and thus achieves better reuse effect. The experimental results show that R-Tracker improves the performance by 1.77–16.42 % over compiler-based approach OpenMPC and 1.40–8.39 % over CGCM in single-node execution, and 48.78–60 % over CGCM in multi-node execution.

Vorheriger Artikel High performance parallel -means clustering for disk-resident datasets on multi-core CPUs

Nächster Artikel Sierpinski triangle based data center architecture in cloud computing

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nickolls J, Dally WJ (2010) The GPU computing era. In: Proceedings of IEEE Micro, pp 56–69

Top500 List (2013) http://www.top500.org/statistics/list/. Accessed on 1 April 2013

Bayoumi AM, Hanafy YY (2008) Massive parallelization of SPICE device model evaluation on GPU-based SIMD architectures. In: Proceedings of IFMT, 2008

He B, Lu M, Yang K, Fang R, Govindaraju NK, Luo Q, Sander PV (2009) Relational query co-processing on graphics processors, presented at, ACM transactions on database systems, 2009, pp 1–35

NVIDIA Corporation (2011) Cuda c programming guide 4.0

Khronos OpenCL Working Group (2012) The opencl specication

Buck I, Foley T, Horn DR, Sugerman J, Fatahalian K, Houston M, Hanrahan P (2004) Brook for GPUs: stream computing on graphics hardware, presented at, ACM Transactions on Graphics, 2004, pp 777–786

Ueng S, Lathara M, Baghsorkhi SS, Hwu WW (2008) CUDA-Lite: reducing GPU programming complexity. In: Proceedings of LCPC, 2008, pp 1–15

Han TD, Abdelrahman TS (2011) hiCUDA: high-level GPGPU programming, presented at, IEEE transactions on parallel and distributed systems, 2011, pp 78–90

10.

Lee S, Eigenmann R (2010) OpenMPC: extended OpenMP programming and tuning for GPUs. In: Proceedings of SC, 2010

11.

Jablin TB, Prabhu P, Jablin JA, Johnson NP, Beard SR, August DI (2011) Automatic CPU–GPU communication management and optimization. In: Proceedings of PLDI, 2011, pp 142–151

12.

Wolfe M (2013) Optimizing data movement in the PGI accelerator programming model. http://www.pgroup.com/lit/articles/insider/v3n1a1.htm. Accessed on 24 July 2013

13.

Yan Y, Grossman M, Sarkar V (2009) JCUDA: a programmer-friendly interface for accelerating java programs with CUDA. In: Proceedings of Euro-Par, 2009, pp 887–899

14.

Hennessy JL, Patterson DA (2012) Computer architecture: a quantitative approach, 5th edn. pp 318–319

15.

Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee S-H, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing, 2009

16.

Lenna and Pilla. Hpc with gpu. http://hpcgpu.codeplex.com/releases/view/34770. Accessed on 24 July 2013

17.

Pouchet L-N (2013) Polybench: the polyhedral benchmark suite. http://www.cse.ohio-state.edu/~pouchet/software/polybench/. Accessed on 24 July 2013

18.

Ethier S, Tang WM, Lin ZH (2005) Gyrokinetic particle-in-cell simulations of plasma microturbulence on advanced computing platforms, presented at Journal of Physics: Conference Series, 2005 pp 1–15

19.

Klasky S, Ethier S, Lin Z, Martins K, McCune D, Samtaney R (2003) Grid-based parallel data streaming implemented for the gyrokinetic toroidal code. In: Proceedings of SC, 2003, pp 24–33

20.

Zhu X, Liu X, Meng X, Feng J, (2011) Performance analysis and optimization of gyrokinetic torodial code on TH-1A supercomputer. In: Proceedings of international conference on electrical and control engineering, 2011, pp 6027–6031

21.

Aji AM, Dinan J, Buntinas D, Balaji P, Feng W, Bisset KR, Thakur R (2012) MPI-ACC MPI-ACC: an integrated and extensible approach to data movement in accelerator-based systems. In: Proceedings of HPCC, 2012

22.

Feng X, Jin H, Zheng R, Hun K, Zeng J, Shao Z (2011) Optimization of sparse matrix-vector multiplication with variant CSR on GPUs. In: Proceedings of ICPADS, 2011, pp 165–172

23.

Haicheng W, Gregery D, Jeffrey Y, Sudhakar Y (2011) Accelerating data warehousing applications using general purpose GPUs, present at CERCS, 2011

24.

Becchi et al. (2010) Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory. In: Proceedings of SPAA 2010

25.

Becchi M, Sajjapongse K, Graves I, Procter A, Ravi V, Chakradhar S (2013) A virtual memory based runtime to support multi-tenancy in clusters with GPUs. In: Proceedings of HPDC, 2013

26.

Sundaram N, Raghunathan A, Chakradhar ST (2009) A framework for efficient and scalable execution of domain-specific templates on GPUs. In: Proceedings of IPDPS 2009, pp 1–12

27.

Satish N, Sundaram N, Keutzer K (2009) Optimizing the use of GPU memory in applications with large data sets. In: Proceedings of HiPC, 2009, pp 408–418

28.

Gelado et al. (2010) An asymmetric distributed shared memory model for heterogeneous parallel systems. In: Proceedings of ASPLOS, 2010

29.

Yang X, Liao X, Lu K, Hu Q, Song J, Su J (2011) The TianHe-1A supercomputer: its hardware and software, presented at Journal of Computer Science and Technology, 2011, pp 344–351

Titel: A run-time optimization approach for reducing data movements using locality-aware searching
verfasst von: Liang Li
Endong Wang
Xingjun Zhang
Kang Yan
Tao Ju
Xiaoshe Dong
Publikationsdatum: 01.08.2014
Verlag: Springer US
Erschienen in: The Journal of Supercomputing / Ausgabe 2/2014
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-014-1186-x

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 2/2014

A novel real-time scheduling algorithm and performance analysis of a MapReduce-based cloud

A space- and power-efficient multi-match packet classification technique combining TCAMs and SRAMs

Preface

Maximizing network lifetime in wireless sensor networks with regular topologies

PGSW-OS: a novel approach for resource management in a semantic web operating system based on a P2P grid architecture

Lightweight dynamic partitioning for last-level cache of multicore processor on real system