Skip to main content
Top
Published in: International Journal of Parallel Programming 6/2014

01-12-2014

Reducing Communication Overhead in Multi-GPU Hybrid Solver for 2D Laplace’s Equation

Authors: Michał Czapiński, Chris Thompson, Stuart Barnes

Published in: International Journal of Parallel Programming | Issue 6/2014

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The possibility of porting algorithms to graphics processing units (GPUs) raises significant interest among researchers. The natural next step is to employ multiple GPUs, but communication overhead may limit further performance improvement. In this paper, we investigate techniques reducing overhead on hybrid CPU–GPU platforms, including careful data layout and usage of GPU memory spaces, and use of non-blocking communication. In addition, we propose an accurate automatic load balancing technique for heterogeneous environments. We validate our approach on a hybrid Jacobi solver for 2D Laplace’s Equation. Experiments carried out using various graphics hardware and types of connectivity have confirmed that the proposed data layout allows our fastest CUDA kernels to reach the analytical limit for memory bandwidth (up to 106 GB/s on NVidia GTX 480), and that the non-blocking communication significantly reduces overhead, allowing for almost linear speed-up, even when communication is carried out over relatively slow networks.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Footnotes
1
Compute Capability defines the hardware configuration of the GPU, e.g. amount of shared memory, registers, presence of implicit caching etc.
 
Literature
1.
go back to reference Bolz, J., Farmer, I., Grinspun, E., Schröder, P.: Sparse matrix solvers on the GPU: conjugate gradients and multigrid. In: Proceedings of ACM Transactions on Graphics, pp. 917–924 (2003) Bolz, J., Farmer, I., Grinspun, E., Schröder, P.: Sparse matrix solvers on the GPU: conjugate gradients and multigrid. In: Proceedings of ACM Transactions on Graphics, pp. 917–924 (2003)
2.
go back to reference Goodnight, N., Woolley, C., Lewin, G., Luebke, D., Humphreys, G.: A multigrid solver for boundary value problems using programmable graphics hardware. In: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, pp. 102–111 (2003) Goodnight, N., Woolley, C., Lewin, G., Luebke, D., Humphreys, G.: A multigrid solver for boundary value problems using programmable graphics hardware. In: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, pp. 102–111 (2003)
3.
go back to reference Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable parallel programming with CUDA. ACM Queue 6, 40–53 (2008)CrossRef Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable parallel programming with CUDA. ACM Queue 6, 40–53 (2008)CrossRef
4.
go back to reference Lindholm, E., Nickolls, J., Oberman, S., Montrym, J.: NVIDIA Tesla: a unified graphics and computing architecture. IEEE Micro 28, 39–55 (2008) Lindholm, E., Nickolls, J., Oberman, S., Montrym, J.: NVIDIA Tesla: a unified graphics and computing architecture. IEEE Micro 28, 39–55 (2008)
5.
go back to reference Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Krüger, J., Lefohn, A., Purcell, T.J.: A survey of general-purpose computation on graphics hardware. Comput. Graph. Forum 26, 80–113 (2007)CrossRef Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Krüger, J., Lefohn, A., Purcell, T.J.: A survey of general-purpose computation on graphics hardware. Comput. Graph. Forum 26, 80–113 (2007)CrossRef
6.
go back to reference Garland, M., Le Grand, S., Nickolls, J., Anderson, J., Hardwick, J., Morton, S., Phillips, E., Zhang, Y., Volkov, V.: Parallel computing experiences with CUDA. IEEE Micro 28, 13–27 (2008) Garland, M., Le Grand, S., Nickolls, J., Anderson, J., Hardwick, J., Morton, S., Phillips, E., Zhang, Y., Volkov, V.: Parallel computing experiences with CUDA. IEEE Micro 28, 13–27 (2008)
7.
go back to reference Sanders, J., Kandrot, E.: CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley, Reading (2010) Sanders, J., Kandrot, E.: CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley, Reading (2010)
8.
go back to reference Kirk, D., Hwu, W., Hwu, W.: Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann Publishers, Los Altos (2010) Kirk, D., Hwu, W., Hwu, W.: Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann Publishers, Los Altos (2010)
9.
go back to reference Stock, F., Koch, A.: A fast GPU implementation for solving sparse ill-posed linear equation systems. In: Proceedings of the 8th International Conference on Parallel Processing and Applied Mathematics: Part I, pp. 457–466 (2010) Stock, F., Koch, A.: A fast GPU implementation for solving sparse ill-posed linear equation systems. In: Proceedings of the 8th International Conference on Parallel Processing and Applied Mathematics: Part I, pp. 457–466 (2010)
10.
go back to reference Wozniak, M., Olas, T., Wyrzykowski, R.: Parallel implementation of conjugate gradient method on graphics processors. In: Proceedings of the 8th International Conference on Parallel Processing and Applied Mathematics: Part I, pp. 125–135 (2010) Wozniak, M., Olas, T., Wyrzykowski, R.: Parallel implementation of conjugate gradient method on graphics processors. In: Proceedings of the 8th International Conference on Parallel Processing and Applied Mathematics: Part I, pp. 125–135 (2010)
11.
go back to reference Zhang, Y., Cohen, J., Owens, J.D.: Fast tridiagonal solvers on the GPU. ACM SIGPLAN Notices 45, 127–136 (2010)CrossRef Zhang, Y., Cohen, J., Owens, J.D.: Fast tridiagonal solvers on the GPU. ACM SIGPLAN Notices 45, 127–136 (2010)CrossRef
12.
go back to reference Göddeke, D., Strzodka, R.: Cyclic reduction tridiagonal solvers on GPUs applied to mixed precision multigrid. IEEE Trans. Parallel Distrib. Syst. 22, 22–32 (2011)CrossRef Göddeke, D., Strzodka, R.: Cyclic reduction tridiagonal solvers on GPUs applied to mixed precision multigrid. IEEE Trans. Parallel Distrib. Syst. 22, 22–32 (2011)CrossRef
13.
go back to reference Elsen, E., LeGresley, P., Darve, E.: Large calculation of the flow over a hypersonic vehicle using a GPU. J. Comput. Phys. 227, 10,148–10,161 (2008) Elsen, E., LeGresley, P., Darve, E.: Large calculation of the flow over a hypersonic vehicle using a GPU. J. Comput. Phys. 227, 10,148–10,161 (2008)
14.
go back to reference Feng, Z., Li, P.: Multigrid on GPU: tackling power grid analysis on parallel SIMT platforms. In: ICCAD 2008. IEEE/ACM International Conference on, Computer-Aided Design, pp. 647–654 (2008) Feng, Z., Li, P.: Multigrid on GPU: tackling power grid analysis on parallel SIMT platforms. In: ICCAD 2008. IEEE/ACM International Conference on, Computer-Aided Design, pp. 647–654 (2008)
15.
go back to reference Czapiński, M., Barnes, S.: Tabu search with two approaches to parallel flowshop evaluation on CUDA platform. J. Parallel Distrib. Comput. 71, 802–811 (2011)CrossRef Czapiński, M., Barnes, S.: Tabu search with two approaches to parallel flowshop evaluation on CUDA platform. J. Parallel Distrib. Comput. 71, 802–811 (2011)CrossRef
16.
go back to reference Czapiński, M.: An effective parallel multistart tabu search for quadratic assignment problem on CUDA platform. J. Parallel Distrib. Comput. 73, 1461–1468 (2013)CrossRef Czapiński, M.: An effective parallel multistart tabu search for quadratic assignment problem on CUDA platform. J. Parallel Distrib. Comput. 73, 1461–1468 (2013)CrossRef
17.
go back to reference Lawlor, O.: Message passing for GPGPU clusters: CudaMPI. In: Cluster Computing and Workshops, 2009. CLUSTER ’09. IEEE International Conference on, pp. 1–8 (2009) Lawlor, O.: Message passing for GPGPU clusters: CudaMPI. In: Cluster Computing and Workshops, 2009. CLUSTER ’09. IEEE International Conference on, pp. 1–8 (2009)
18.
go back to reference Cevahir, A., Nukada, A., Matsuoka, S.: Fast conjugate gradients with multiple GPUs. In: Proceedings of the 9th International Conference on Computational Science: Part I, pp. 893–903 (2009) Cevahir, A., Nukada, A., Matsuoka, S.: Fast conjugate gradients with multiple GPUs. In: Proceedings of the 9th International Conference on Computational Science: Part I, pp. 893–903 (2009)
19.
go back to reference Tomov, S., Dongarra, J., Baboulin, M.: Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Comput. 36, 232–240 (2010)CrossRefMATH Tomov, S., Dongarra, J., Baboulin, M.: Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Comput. 36, 232–240 (2010)CrossRefMATH
20.
go back to reference Yang, C.T., Huang, C.L., Lin, C.F.: Hybrid CUDA, OpenMP, and MPI parallel programming on multicore GPU clusters. Comput. Phys. Commun. 182, 266–269 (2011)CrossRef Yang, C.T., Huang, C.L., Lin, C.F.: Hybrid CUDA, OpenMP, and MPI parallel programming on multicore GPU clusters. Comput. Phys. Commun. 182, 266–269 (2011)CrossRef
21.
go back to reference Brightwell, R., Riesen, R., Underwood, K.D.: Analyzing the impact of overlap, offload, and independent progress for message passing interface applications. Int. J. High Perform. Comput. Appl. 19, 103–117 (2005)CrossRef Brightwell, R., Riesen, R., Underwood, K.D.: Analyzing the impact of overlap, offload, and independent progress for message passing interface applications. Int. J. High Perform. Comput. Appl. 19, 103–117 (2005)CrossRef
22.
go back to reference Hoefler, T., Gottschling, P., Lumsdaine, A., Rehm, W.: Optimizing a conjugate gradient solver with non-blocking collective operations. Parallel Comput. 33, 624–633 (2007)CrossRefMathSciNet Hoefler, T., Gottschling, P., Lumsdaine, A., Rehm, W.: Optimizing a conjugate gradient solver with non-blocking collective operations. Parallel Comput. 33, 624–633 (2007)CrossRefMathSciNet
23.
go back to reference Shet, A., Sadayappan, P., Bernholdt, D., Nieplocha, J., Tipparaju, V.: A framework for characterizing overlap of communication and computation in parallel applications. Clust. Comput. 11, 75–90 (2008)CrossRef Shet, A., Sadayappan, P., Bernholdt, D., Nieplocha, J., Tipparaju, V.: A framework for characterizing overlap of communication and computation in parallel applications. Clust. Comput. 11, 75–90 (2008)CrossRef
24.
go back to reference Thakur, R., Gropp, W.: Test suite for evaluating performance of multithreaded MPI communication. Parallel Comput. 35, 608–617 (2009)CrossRef Thakur, R., Gropp, W.: Test suite for evaluating performance of multithreaded MPI communication. Parallel Comput. 35, 608–617 (2009)CrossRef
26.
go back to reference White III, J., Dongarra, J.: Overlapping computation and communication for advection on hybrid parallel computers. In: International Parallel and Distributed Processing, Symposium, pp. 59–67 (2011) White III, J., Dongarra, J.: Overlapping computation and communication for advection on hybrid parallel computers. In: International Parallel and Distributed Processing, Symposium, pp. 59–67 (2011)
27.
go back to reference Micikevicius, P.: 3D finite difference computation on GPUs using CUDA. In: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, pp. 79–84 (2009) Micikevicius, P.: 3D finite difference computation on GPUs using CUDA. In: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, pp. 79–84 (2009)
28.
go back to reference Demmel, J.: Applied Numerical Linear Algebra. Society for Industrial and Applied Mathematics, Philadelphia (1997)CrossRefMATH Demmel, J.: Applied Numerical Linear Algebra. Society for Industrial and Applied Mathematics, Philadelphia (1997)CrossRefMATH
29.
go back to reference Briggs, W.L., Henson, V.E., McCormick, S.F.: A Multigrid Tutorial, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia (2000)CrossRefMATH Briggs, W.L., Henson, V.E., McCormick, S.F.: A Multigrid Tutorial, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia (2000)CrossRefMATH
Metadata
Title
Reducing Communication Overhead in Multi-GPU Hybrid Solver for 2D Laplace’s Equation
Authors
Michał Czapiński
Chris Thompson
Stuart Barnes
Publication date
01-12-2014
Publisher
Springer US
Published in
International Journal of Parallel Programming / Issue 6/2014
Print ISSN: 0885-7458
Electronic ISSN: 1573-7640
DOI
https://doi.org/10.1007/s10766-013-0293-2

Other articles of this Issue 6/2014

International Journal of Parallel Programming 6/2014 Go to the issue

Premium Partner