Top

International Journal of Parallel Programming

Published in:

01-12-2014

Reducing Communication Overhead in Multi-GPU Hybrid Solver for 2D Laplace’s Equation

Authors: Michał Czapiński, Chris Thompson, Stuart Barnes

Published in: International Journal of Parallel Programming | Issue 6/2014

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

The possibility of porting algorithms to graphics processing units (GPUs) raises significant interest among researchers. The natural next step is to employ multiple GPUs, but communication overhead may limit further performance improvement. In this paper, we investigate techniques reducing overhead on hybrid CPU–GPU platforms, including careful data layout and usage of GPU memory spaces, and use of non-blocking communication. In addition, we propose an accurate automatic load balancing technique for heterogeneous environments. We validate our approach on a hybrid Jacobi solver for 2D Laplace’s Equation. Experiments carried out using various graphics hardware and types of connectivity have confirmed that the proposed data layout allows our fastest CUDA kernels to reach the analytical limit for memory bandwidth (up to 106 GB/s on NVidia GTX 480), and that the non-blocking communication significantly reduces overhead, allowing for almost linear speed-up, even when communication is carried out over relatively slow networks.

previous article Design patterns percolating to parallel programming framework implementation

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Compute Capability defines the hardware configuration of the GPU, e.g. amount of shared memory, registers, presence of implicit caching etc.

Bolz, J., Farmer, I., Grinspun, E., Schröder, P.: Sparse matrix solvers on the GPU: conjugate gradients and multigrid. In: Proceedings of ACM Transactions on Graphics, pp. 917–924 (2003)

Goodnight, N., Woolley, C., Lewin, G., Luebke, D., Humphreys, G.: A multigrid solver for boundary value problems using programmable graphics hardware. In: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, pp. 102–111 (2003)

Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable parallel programming with CUDA. ACM Queue 6, 40–53 (2008)CrossRef

Lindholm, E., Nickolls, J., Oberman, S., Montrym, J.: NVIDIA Tesla: a unified graphics and computing architecture. IEEE Micro 28, 39–55 (2008)

Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Krüger, J., Lefohn, A., Purcell, T.J.: A survey of general-purpose computation on graphics hardware. Comput. Graph. Forum 26, 80–113 (2007)CrossRef

Garland, M., Le Grand, S., Nickolls, J., Anderson, J., Hardwick, J., Morton, S., Phillips, E., Zhang, Y., Volkov, V.: Parallel computing experiences with CUDA. IEEE Micro 28, 13–27 (2008)

Sanders, J., Kandrot, E.: CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley, Reading (2010)

Kirk, D., Hwu, W., Hwu, W.: Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann Publishers, Los Altos (2010)

Stock, F., Koch, A.: A fast GPU implementation for solving sparse ill-posed linear equation systems. In: Proceedings of the 8th International Conference on Parallel Processing and Applied Mathematics: Part I, pp. 457–466 (2010)

10.

Wozniak, M., Olas, T., Wyrzykowski, R.: Parallel implementation of conjugate gradient method on graphics processors. In: Proceedings of the 8th International Conference on Parallel Processing and Applied Mathematics: Part I, pp. 125–135 (2010)

11.

Zhang, Y., Cohen, J., Owens, J.D.: Fast tridiagonal solvers on the GPU. ACM SIGPLAN Notices 45, 127–136 (2010)CrossRef

12.

Göddeke, D., Strzodka, R.: Cyclic reduction tridiagonal solvers on GPUs applied to mixed precision multigrid. IEEE Trans. Parallel Distrib. Syst. 22, 22–32 (2011)CrossRef

13.

Elsen, E., LeGresley, P., Darve, E.: Large calculation of the flow over a hypersonic vehicle using a GPU. J. Comput. Phys. 227, 10,148–10,161 (2008)

14.

Feng, Z., Li, P.: Multigrid on GPU: tackling power grid analysis on parallel SIMT platforms. In: ICCAD 2008. IEEE/ACM International Conference on, Computer-Aided Design, pp. 647–654 (2008)

15.

Czapiński, M., Barnes, S.: Tabu search with two approaches to parallel flowshop evaluation on CUDA platform. J. Parallel Distrib. Comput. 71, 802–811 (2011)CrossRef

16.

Czapiński, M.: An effective parallel multistart tabu search for quadratic assignment problem on CUDA platform. J. Parallel Distrib. Comput. 73, 1461–1468 (2013)CrossRef

17.

Lawlor, O.: Message passing for GPGPU clusters: CudaMPI. In: Cluster Computing and Workshops, 2009. CLUSTER ’09. IEEE International Conference on, pp. 1–8 (2009)

18.

Cevahir, A., Nukada, A., Matsuoka, S.: Fast conjugate gradients with multiple GPUs. In: Proceedings of the 9th International Conference on Computational Science: Part I, pp. 893–903 (2009)

19.

Tomov, S., Dongarra, J., Baboulin, M.: Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Comput. 36, 232–240 (2010)CrossRefMATH

20.

Yang, C.T., Huang, C.L., Lin, C.F.: Hybrid CUDA, OpenMP, and MPI parallel programming on multicore GPU clusters. Comput. Phys. Commun. 182, 266–269 (2011)CrossRef

21.

Brightwell, R., Riesen, R., Underwood, K.D.: Analyzing the impact of overlap, offload, and independent progress for message passing interface applications. Int. J. High Perform. Comput. Appl. 19, 103–117 (2005)CrossRef

22.

Hoefler, T., Gottschling, P., Lumsdaine, A., Rehm, W.: Optimizing a conjugate gradient solver with non-blocking collective operations. Parallel Comput. 33, 624–633 (2007)CrossRefMathSciNet

23.

Shet, A., Sadayappan, P., Bernholdt, D., Nieplocha, J., Tipparaju, V.: A framework for characterizing overlap of communication and computation in parallel applications. Clust. Comput. 11, 75–90 (2008)CrossRef

24.

Thakur, R., Gropp, W.: Test suite for evaluating performance of multithreaded MPI communication. Parallel Comput. 35, 608–617 (2009)CrossRef

25.

NVidia: NVIDIA CUDA C Programming Guide. http://developer.nvidia.com/cuda-toolkit-40 (2011). Accessed 10 July 2013

26.

White III, J., Dongarra, J.: Overlapping computation and communication for advection on hybrid parallel computers. In: International Parallel and Distributed Processing, Symposium, pp. 59–67 (2011)

27.

Micikevicius, P.: 3D finite difference computation on GPUs using CUDA. In: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, pp. 79–84 (2009)

28.

Demmel, J.: Applied Numerical Linear Algebra. Society for Industrial and Applied Mathematics, Philadelphia (1997)CrossRefMATH

29.

Briggs, W.L., Henson, V.E., McCormick, S.F.: A Multigrid Tutorial, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia (2000)CrossRefMATH

Title: Reducing Communication Overhead in Multi-GPU Hybrid Solver for 2D Laplace’s Equation
Authors: Michał Czapiński
Chris Thompson
Stuart Barnes
Publication date: 01-12-2014
Publisher: Springer US
Published in: International Journal of Parallel Programming / Issue 6/2014
Print ISSN: 0885-7458
Electronic ISSN: 1573-7640
DOI: https://doi.org/10.1007/s10766-013-0293-2

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Other articles of this Issue 6/2014

Design patterns percolating to parallel programming framework implementation

Performance Optimization of Video Coding Process on Multi-Core Platform Using Gop Level Parallelism

An Algorithm Template for Domain-Based Parallel Irregular Algorithms

VORD: A Versatile On-the-fly Race Detection Tool in OpenMP Programs

An Efficient Scalable Runtime System for Macro Data Flow Processing Using S-Net

A Scalable Farm Skeleton for Hybrid Parallel and Distributed Programming

Premium Partner