Skip to main content
Erschienen in: The Journal of Supercomputing 2/2014

01.08.2014

A run-time optimization approach for reducing data movements using locality-aware searching

verfasst von: Liang Li, Endong Wang, Xingjun Zhang, Kang Yan, Tao Ju, Xiaoshe Dong

Erschienen in: The Journal of Supercomputing | Ausgabe 2/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The CPU–GPU communication bottleneck limits the performance improvement of GPU applications in heterogeneous GPGPU systems and usually is handled by data reuse optimization. This paper analyzes data reuse through DAG abstraction and obtains rules showing that the run-time data reuse optimization can effectively relieve the bottleneck. Based on the rules, this paper proposes a run-time optimization framework for data reuse, called R-Tracker. The R-Tracker uses locality-aware searching approach to handle reuses. It can not only low costly implement the data reuse optimization but also effectively implement the searching, the data transfers, and the GPU computation concurrently. R-Tracker relaxes the constraints that are required in compiler-based approaches and thus achieves better reuse effect. The experimental results show that R-Tracker improves the performance by 1.77–16.42 % over compiler-based approach OpenMPC and 1.40–8.39 % over CGCM in single-node execution, and 48.78–60 % over CGCM in multi-node execution.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Nickolls J, Dally WJ (2010) The GPU computing era. In: Proceedings of IEEE Micro, pp 56–69 Nickolls J, Dally WJ (2010) The GPU computing era. In: Proceedings of IEEE Micro, pp 56–69
3.
Zurück zum Zitat Bayoumi AM, Hanafy YY (2008) Massive parallelization of SPICE device model evaluation on GPU-based SIMD architectures. In: Proceedings of IFMT, 2008 Bayoumi AM, Hanafy YY (2008) Massive parallelization of SPICE device model evaluation on GPU-based SIMD architectures. In: Proceedings of IFMT, 2008
4.
Zurück zum Zitat He B, Lu M, Yang K, Fang R, Govindaraju NK, Luo Q, Sander PV (2009) Relational query co-processing on graphics processors, presented at, ACM transactions on database systems, 2009, pp 1–35 He B, Lu M, Yang K, Fang R, Govindaraju NK, Luo Q, Sander PV (2009) Relational query co-processing on graphics processors, presented at, ACM transactions on database systems, 2009, pp 1–35
5.
Zurück zum Zitat NVIDIA Corporation (2011) Cuda c programming guide 4.0 NVIDIA Corporation (2011) Cuda c programming guide 4.0
6.
Zurück zum Zitat Khronos OpenCL Working Group (2012) The opencl specication Khronos OpenCL Working Group (2012) The opencl specication
7.
Zurück zum Zitat Buck I, Foley T, Horn DR, Sugerman J, Fatahalian K, Houston M, Hanrahan P (2004) Brook for GPUs: stream computing on graphics hardware, presented at, ACM Transactions on Graphics, 2004, pp 777–786 Buck I, Foley T, Horn DR, Sugerman J, Fatahalian K, Houston M, Hanrahan P (2004) Brook for GPUs: stream computing on graphics hardware, presented at, ACM Transactions on Graphics, 2004, pp 777–786
8.
Zurück zum Zitat Ueng S, Lathara M, Baghsorkhi SS, Hwu WW (2008) CUDA-Lite: reducing GPU programming complexity. In: Proceedings of LCPC, 2008, pp 1–15 Ueng S, Lathara M, Baghsorkhi SS, Hwu WW (2008) CUDA-Lite: reducing GPU programming complexity. In: Proceedings of LCPC, 2008, pp 1–15
9.
Zurück zum Zitat Han TD, Abdelrahman TS (2011) hiCUDA: high-level GPGPU programming, presented at, IEEE transactions on parallel and distributed systems, 2011, pp 78–90 Han TD, Abdelrahman TS (2011) hiCUDA: high-level GPGPU programming, presented at, IEEE transactions on parallel and distributed systems, 2011, pp 78–90
10.
Zurück zum Zitat Lee S, Eigenmann R (2010) OpenMPC: extended OpenMP programming and tuning for GPUs. In: Proceedings of SC, 2010 Lee S, Eigenmann R (2010) OpenMPC: extended OpenMP programming and tuning for GPUs. In: Proceedings of SC, 2010
11.
Zurück zum Zitat Jablin TB, Prabhu P, Jablin JA, Johnson NP, Beard SR, August DI (2011) Automatic CPU–GPU communication management and optimization. In: Proceedings of PLDI, 2011, pp 142–151 Jablin TB, Prabhu P, Jablin JA, Johnson NP, Beard SR, August DI (2011) Automatic CPU–GPU communication management and optimization. In: Proceedings of PLDI, 2011, pp 142–151
13.
Zurück zum Zitat Yan Y, Grossman M, Sarkar V (2009) JCUDA: a programmer-friendly interface for accelerating java programs with CUDA. In: Proceedings of Euro-Par, 2009, pp 887–899 Yan Y, Grossman M, Sarkar V (2009) JCUDA: a programmer-friendly interface for accelerating java programs with CUDA. In: Proceedings of Euro-Par, 2009, pp 887–899
14.
Zurück zum Zitat Hennessy JL, Patterson DA (2012) Computer architecture: a quantitative approach, 5th edn. pp 318–319 Hennessy JL, Patterson DA (2012) Computer architecture: a quantitative approach, 5th edn. pp 318–319
15.
Zurück zum Zitat Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee S-H, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing, 2009 Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee S-H, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing, 2009
18.
Zurück zum Zitat Ethier S, Tang WM, Lin ZH (2005) Gyrokinetic particle-in-cell simulations of plasma microturbulence on advanced computing platforms, presented at Journal of Physics: Conference Series, 2005 pp 1–15 Ethier S, Tang WM, Lin ZH (2005) Gyrokinetic particle-in-cell simulations of plasma microturbulence on advanced computing platforms, presented at Journal of Physics: Conference Series, 2005 pp 1–15
19.
Zurück zum Zitat Klasky S, Ethier S, Lin Z, Martins K, McCune D, Samtaney R (2003) Grid-based parallel data streaming implemented for the gyrokinetic toroidal code. In: Proceedings of SC, 2003, pp 24–33 Klasky S, Ethier S, Lin Z, Martins K, McCune D, Samtaney R (2003) Grid-based parallel data streaming implemented for the gyrokinetic toroidal code. In: Proceedings of SC, 2003, pp 24–33
20.
Zurück zum Zitat Zhu X, Liu X, Meng X, Feng J, (2011) Performance analysis and optimization of gyrokinetic torodial code on TH-1A supercomputer. In: Proceedings of international conference on electrical and control engineering, 2011, pp 6027–6031 Zhu X, Liu X, Meng X, Feng J, (2011) Performance analysis and optimization of gyrokinetic torodial code on TH-1A supercomputer. In: Proceedings of international conference on electrical and control engineering, 2011, pp 6027–6031
21.
Zurück zum Zitat Aji AM, Dinan J, Buntinas D, Balaji P, Feng W, Bisset KR, Thakur R (2012) MPI-ACC MPI-ACC: an integrated and extensible approach to data movement in accelerator-based systems. In: Proceedings of HPCC, 2012 Aji AM, Dinan J, Buntinas D, Balaji P, Feng W, Bisset KR, Thakur R (2012) MPI-ACC MPI-ACC: an integrated and extensible approach to data movement in accelerator-based systems. In: Proceedings of HPCC, 2012
22.
Zurück zum Zitat Feng X, Jin H, Zheng R, Hun K, Zeng J, Shao Z (2011) Optimization of sparse matrix-vector multiplication with variant CSR on GPUs. In: Proceedings of ICPADS, 2011, pp 165–172 Feng X, Jin H, Zheng R, Hun K, Zeng J, Shao Z (2011) Optimization of sparse matrix-vector multiplication with variant CSR on GPUs. In: Proceedings of ICPADS, 2011, pp 165–172
23.
Zurück zum Zitat Haicheng W, Gregery D, Jeffrey Y, Sudhakar Y (2011) Accelerating data warehousing applications using general purpose GPUs, present at CERCS, 2011 Haicheng W, Gregery D, Jeffrey Y, Sudhakar Y (2011) Accelerating data warehousing applications using general purpose GPUs, present at CERCS, 2011
24.
Zurück zum Zitat Becchi et al. (2010) Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory. In: Proceedings of SPAA 2010 Becchi et al. (2010) Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory. In: Proceedings of SPAA 2010
25.
Zurück zum Zitat Becchi M, Sajjapongse K, Graves I, Procter A, Ravi V, Chakradhar S (2013) A virtual memory based runtime to support multi-tenancy in clusters with GPUs. In: Proceedings of HPDC, 2013 Becchi M, Sajjapongse K, Graves I, Procter A, Ravi V, Chakradhar S (2013) A virtual memory based runtime to support multi-tenancy in clusters with GPUs. In: Proceedings of HPDC, 2013
26.
Zurück zum Zitat Sundaram N, Raghunathan A, Chakradhar ST (2009) A framework for efficient and scalable execution of domain-specific templates on GPUs. In: Proceedings of IPDPS 2009, pp 1–12 Sundaram N, Raghunathan A, Chakradhar ST (2009) A framework for efficient and scalable execution of domain-specific templates on GPUs. In: Proceedings of IPDPS 2009, pp 1–12
27.
Zurück zum Zitat Satish N, Sundaram N, Keutzer K (2009) Optimizing the use of GPU memory in applications with large data sets. In: Proceedings of HiPC, 2009, pp 408–418 Satish N, Sundaram N, Keutzer K (2009) Optimizing the use of GPU memory in applications with large data sets. In: Proceedings of HiPC, 2009, pp 408–418
28.
Zurück zum Zitat Gelado et al. (2010) An asymmetric distributed shared memory model for heterogeneous parallel systems. In: Proceedings of ASPLOS, 2010 Gelado et al. (2010) An asymmetric distributed shared memory model for heterogeneous parallel systems. In: Proceedings of ASPLOS, 2010
29.
Zurück zum Zitat Yang X, Liao X, Lu K, Hu Q, Song J, Su J (2011) The TianHe-1A supercomputer: its hardware and software, presented at Journal of Computer Science and Technology, 2011, pp 344–351 Yang X, Liao X, Lu K, Hu Q, Song J, Su J (2011) The TianHe-1A supercomputer: its hardware and software, presented at Journal of Computer Science and Technology, 2011, pp 344–351
Metadaten
Titel
A run-time optimization approach for reducing data movements using locality-aware searching
verfasst von
Liang Li
Endong Wang
Xingjun Zhang
Kang Yan
Tao Ju
Xiaoshe Dong
Publikationsdatum
01.08.2014
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 2/2014
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-014-1186-x

Weitere Artikel der Ausgabe 2/2014

The Journal of Supercomputing 2/2014 Zur Ausgabe

EditorialNotes

Preface